[00:00:04] RoanKattouw and Urbanecm: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220112T0000). [00:00:04] eigyan and samwilson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:14] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [00:00:19] i can deploy today [00:00:37] but neither eigyan nor samwilson is here? [00:00:43] thanks urbanecm , I'm here. [00:00:54] hello samwilson [00:01:27] hullo :) [00:01:28] I'm confused, did you somehow read my previous two messages? [00:01:56] yep, I'm bridged from Matrix, so maybe something's weird? [00:02:03] I can see all history [00:02:04] ah [00:02:09] that's why [00:02:31] I'm directly connected to IRC, and it shows like "samwilson joined" and then you directly reply to what i said [00:02:43] (03CR) 10Urbanecm: [C: 03+2] Enable Disambiguator notifications for French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753175 (https://phabricator.wikimedia.org/T293319) (owner: 10Samwilson) [00:02:51] oh yeah, that's strange [00:03:26] (03Merged) 10jenkins-bot: Enable Disambiguator notifications for French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753175 (https://phabricator.wikimedia.org/T293319) (owner: 10Samwilson) [00:03:30] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [00:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:48] samwilson: your patch is at mwdebug1001, can you check? [00:04:01] testing now [00:05:14] urbanecm: yep, all looks good! go for it. [00:05:17] syncing [00:05:18] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [00:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:47] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 24a26392a3e36aa3a46445eb1f87e808b57b19c8: Enable Disambiguator notifications for French Wikipedia (T293319) (duration: 01m 08s) [00:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:50] T293319: Rollout plan for disambiguation notifications (wgDisambiguatorNotifications) - https://phabricator.wikimedia.org/T293319 [00:07:16] samwilson: it's live [00:07:18] anything else? [00:07:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:30] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [00:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:38] urbanecm: nope, everything looks great. thanks! [00:09:48] great! [00:09:57] !log UTC late evening B&C done [00:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:11:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:09] (03PS1) 10Bartosz Dziewoński: DiscussionTools: Use bullet indentation on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753192 (https://phabricator.wikimedia.org/T259864) [00:14:53] MatmaRex: as it's still B&C time technically, happy to do that for you if you want :) [00:15:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:38] urbanecm: oh, no, we haven't merged the code that implements it yet :D [00:15:50] ah, okay then :) [00:19:33] 10SRE, 10Data-Engineering, 10Traffic-Icebox: Add backend field to webrequest Hive table - https://phabricator.wikimedia.org/T257354 (10odimitrijevic) [00:19:34] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [00:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:39] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=PUT https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:21:43] 10SRE, 10Data-Engineering, 10Traffic-Icebox: Increased number of webrequest sequence-numbers alarms (mostly) on upload webrequest-source - https://phabricator.wikimedia.org/T225786 (10odimitrijevic) [00:21:45] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:26:21] 10SRE, 10Analytics, 10Data-Engineering, 10Event-Platform, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10odimitrijevic) [00:28:37] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:30:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10odimitrijevic) [00:30:06] 10SRE, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10odimitrijevic) [00:30:56] 10SRE, 10Data-Engineering: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10odimitrijevic) [00:34:56] 10SRE, 10Analytics, 10Data-Engineering, 10Event-Platform, and 3 others: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10odimitrijevic) [00:35:04] 10SRE, 10Analytics, 10Data-Engineering, 10Discovery, and 2 others: Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10odimitrijevic) [00:35:21] 10SRE, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10odimitrijevic) [00:37:41] Greetings I am a bit late [00:43:03] hello eigyan [00:43:14] hello [00:43:19] looks like your patch is beta-only, is that right eigyan? [00:43:24] correct [00:43:48] (03PS4) 10Urbanecm: wmf-config: Update coverage to 0.5 in gdi-survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752708 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [00:43:52] (03CR) 10Urbanecm: [C: 03+2] wmf-config: Update coverage to 0.5 in gdi-survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752708 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [00:44:05] eigyan: in that case, it will be deployed automatically within 30 minutes from today [00:44:07] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [00:44:23] Excellent thank you urbanecm [00:44:31] (03Merged) 10jenkins-bot: wmf-config: Update coverage to 0.5 in gdi-survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752708 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [00:44:33] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [00:49:50] 10SRE, 10Analytics, 10Data-Engineering, 10Event-Platform, and 3 others: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10odimitrijevic) [00:50:02] 10SRE, 10Analytics, 10Data-Engineering, 10Discovery, and 2 others: Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10odimitrijevic) [00:50:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:51:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:51:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:01:59] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:00:17] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:47] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:09:25] (03CR) 10Legoktm: [C: 03+1] "Having a single version is much nicer. During the initial setup/rollout there were times I'd just update one deployment if e.g. I was only" [deployment-charts] - 10https://gerrit.wikimedia.org/r/753020 (owner: 10Giuseppe Lavagetto) [02:17:13] (03CR) 10Legoktm: [C: 03+1] shellbox: rationalize version handling, promote to 1.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/753020 (owner: 10Giuseppe Lavagetto) [02:18:07] 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10Papaul) @Jclark-ctr looking at the image you shared at https://usercontent.irccloud-cdn.com/file/5YslcsIX/1641945459.JPG i see you are using orange cables to msw2 and not to the console server. We use ora... [03:04:11] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:14:39] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:28:01] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:32:11] (03PS1) 10Andrew Bogott: nfs/add_server: include the option to create and attach a service ip/fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753206 (https://phabricator.wikimedia.org/T293800) [05:34:59] (03CR) 10jerkins-bot: [V: 04-1] nfs/add_server: include the option to create and attach a service ip/fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753206 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [06:00:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:00:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:53] 10SRE, 10Language-Team (Language-2022-January-March): Deploy Flores MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T299023 (10KartikMistry) [06:02:33] 10SRE, 10Language-Team (Language-2022-January-March): Deploy Flores MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T299023 (10KartikMistry) [06:07:41] (03PS1) 10Marostegui: db1169: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753341 (https://phabricator.wikimedia.org/T295965) [06:08:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1169 for Bullseye reimage T295965', diff saved to https://phabricator.wikimedia.org/P18617 and previous config saved to /var/cache/conftool/dbconfig/20220112-060803-marostegui.json [06:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:07] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [06:08:56] (03CR) 10Marostegui: [C: 03+2] db1169: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753341 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [06:09:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [06:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [06:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T297191)', diff saved to https://phabricator.wikimedia.org/P18618 and previous config saved to /var/cache/conftool/dbconfig/20220112-060923-marostegui.json [06:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:26] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [06:12:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS bullseye [06:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T297191)', diff saved to https://phabricator.wikimedia.org/P18619 and previous config saved to /var/cache/conftool/dbconfig/20220112-062449-marostegui.json [06:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:53] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [06:36:43] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1169.eqiad.wmnet with OS bullseye [06:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS bullseye [06:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P18620 and previous config saved to /var/cache/conftool/dbconfig/20220112-063953-marostegui.json [06:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] shellbox: rationalize version handling, promote to 1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/753020 (owner: 10Giuseppe Lavagetto) [06:46:07] (03Merged) 10jenkins-bot: shellbox: rationalize version handling, promote to 1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/753020 (owner: 10Giuseppe Lavagetto) [06:48:51] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply on main [06:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:53] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: sync on main [06:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:32] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply on main [06:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:38] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: sync on main [06:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:40] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply on main [06:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:37] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: sync on main [06:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P18621 and previous config saved to /var/cache/conftool/dbconfig/20220112-065458-marostegui.json [06:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:22] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply on main [06:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:52] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: sync on main [06:55:53] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply on main [06:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:23] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: sync on main [06:57:24] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply on main [06:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:01] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: sync on main [06:58:02] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply on main [06:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:41] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: sync on main [06:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:40] 10ops-eqiad: db1169 reimage/idrac failure - https://phabricator.wikimedia.org/T299025 (10Marostegui) [07:02:02] 10ops-eqiad: db1169 reimage/idrac failure - https://phabricator.wikimedia.org/T299025 (10Marostegui) p:05Triage→03High I am setting this to high as this is a live s1 host and that we need to test Bullseye there to make sure we are ready for it so we can confirm that {T297913} would be unblocked if we can go... [07:02:37] (03PS3) 10Giuseppe Lavagetto: shellbox-*: promote to new build [deployment-charts] - 10https://gerrit.wikimedia.org/r/753021 (https://phabricator.wikimedia.org/T292322) [07:02:57] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1169.eqiad.wmnet with OS bullseye [07:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:17] (03PS1) 10Marostegui: mariadb: Move db1128 to s1. [puppet] - 10https://gerrit.wikimedia.org/r/753343 (https://phabricator.wikimedia.org/T295965) [07:10:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T297191)', diff saved to https://phabricator.wikimedia.org/P18622 and previous config saved to /var/cache/conftool/dbconfig/20220112-071003-marostegui.json [07:10:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [07:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [07:10:07] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [07:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:14] (03PS2) 10Marostegui: mariadb: Move db1128 to s1. [puppet] - 10https://gerrit.wikimedia.org/r/753343 (https://phabricator.wikimedia.org/T295965) [07:11:40] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1128 to s1. [puppet] - 10https://gerrit.wikimedia.org/r/753343 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [07:13:15] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:18:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] shellbox-*: promote to new build [deployment-charts] - 10https://gerrit.wikimedia.org/r/753021 (https://phabricator.wikimedia.org/T292322) (owner: 10Giuseppe Lavagetto) [07:25:13] (03Merged) 10jenkins-bot: shellbox-*: promote to new build [deployment-charts] - 10https://gerrit.wikimedia.org/r/753021 (https://phabricator.wikimedia.org/T292322) (owner: 10Giuseppe Lavagetto) [07:28:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [07:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [07:28:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T297191)', diff saved to https://phabricator.wikimedia.org/P18623 and previous config saved to /var/cache/conftool/dbconfig/20220112-072826-marostegui.json [07:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:29] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [07:28:55] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply on main [07:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:16] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: sync on main [07:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:08] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply on main [07:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:53] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: sync on main [07:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:58] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply on main [07:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:08] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: sync on main [07:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:12] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply on main [07:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:04] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: sync on main [07:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:08] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply on main [07:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:02] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: sync on main [07:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:33] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:59] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply on main [07:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:53] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: sync on main [07:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:28] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply on main [07:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:10] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: sync on main [07:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:27] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply on main [07:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:17] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: sync on main [07:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:27] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply on main [07:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:06] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: sync on main [08:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T297191)', diff saved to https://phabricator.wikimedia.org/P18624 and previous config saved to /var/cache/conftool/dbconfig/20220112-080510-marostegui.json [08:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:15] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [08:13:43] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:14:31] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:16:01] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:20:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P18625 and previous config saved to /var/cache/conftool/dbconfig/20220112-082015-marostegui.json [08:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:59] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, 10Patch-For-Review: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) After the deployment, running the same command as before results in: ` var_dump($result); object(Shellbox\Command\BoxedResult)#675 (4) { ["... [08:22:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM mwdebug1001.eqiad.wmnet [08:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:37] 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10ayounsi) Thanks John! I see three issues: * msw1-eqiad:et-0/1/0 is still showing up as down, so something is wrong on the path to msw2-eqiad:et-0/1/0, could you check light, etc? * From the picture, ge-0... [08:26:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM mwdebug1001.eqiad.wmnet [08:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM mwdebug1002.eqiad.wmnet [08:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM mwdebug1002.eqiad.wmnet [08:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:22] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10Antoine_Quhen) Thanks @cmooney My analytics-admin access is working as it should. For example, I can now access an-launcher1002... [08:35:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P18626 and previous config saved to /var/cache/conftool/dbconfig/20220112-083520-marostegui.json [08:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM miscweb1002.eqiad.wmnet [08:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM miscweb1002.eqiad.wmnet [08:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:11] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [08:50:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T297191)', diff saved to https://phabricator.wikimedia.org/P18627 and previous config saved to /var/cache/conftool/dbconfig/20220112-085024-marostegui.json [08:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:29] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [08:50:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [08:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [08:50:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [08:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [08:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:14] (03CR) 10Elukey: "I think this is a great start, thanks a lot for working on it! Left some comments about possible ideas, lemme know your thoughts." [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [09:03:10] (03PS1) 10Marostegui: ProductionServices.php: Replace pc1011 with pc1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753420 (https://phabricator.wikimedia.org/T295965) [09:04:38] (03CR) 10Majavah: kerberos: manage users with custom puppet type (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [09:05:16] (03CR) 10Ladsgroup: [C: 03+1] "double checked the IPs and the cluster config. Looks correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753420 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [09:05:32] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Replace pc1011 with pc1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753420 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [09:05:55] !log Reset replication on pc1014 [09:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:17] (03Merged) 10jenkins-bot: ProductionServices.php: Replace pc1011 with pc1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753420 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [09:08:10] (03PS1) 10Marostegui: mariadb: Promote pc1014 to pc1 master [puppet] - 10https://gerrit.wikimedia.org/r/753422 (https://phabricator.wikimedia.org/T295965) [09:08:23] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc1014 to master in pc1 (duration: 01m 08s) [09:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:50] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote pc1014 to pc1 master [puppet] - 10https://gerrit.wikimedia.org/r/753422 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [09:09:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T297191)', diff saved to https://phabricator.wikimedia.org/P18628 and previous config saved to /var/cache/conftool/dbconfig/20220112-090959-marostegui.json [09:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:02] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [09:10:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:11:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc1011.eqiad.wmnet with OS bullseye [09:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:17] PROBLEM - MariaDB Replica IO: pc1 on pc2011 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@pc1011.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on pc1011.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:16:59] marostegui: ^ [09:20:16] expected [09:20:56] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: switch to plain disk storage [09:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: switch to plain disk storage [09:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:40] !log reverting kubetcd2005 back to "plain" storage [09:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:09] (03CR) 10Matthias Mullie: [C: 03+2] Updated maint script to use fewer queries [extensions/MediaSearch] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753060 (https://phabricator.wikimedia.org/T297484) (owner: 10Cparle) [09:36:55] RECOVERY - MariaDB Replica IO: pc1 on pc2011 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:37:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1011.eqiad.wmnet with OS bullseye [09:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:54] (03PS1) 10Elukey: helmfile.d: move eventgate-analytics* to the WMF CA cert bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/753425 (https://phabricator.wikimedia.org/T296064) [09:39:46] (03PS1) 10Marostegui: Revert "mariadb: Promote pc1014 to pc1 master" [puppet] - 10https://gerrit.wikimedia.org/r/753072 [09:40:09] (03PS1) 10Marostegui: Revert "ProductionServices.php: Replace pc1011 with pc1014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753073 [09:42:19] (03Merged) 10jenkins-bot: Updated maint script to use fewer queries [extensions/MediaSearch] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753060 (https://phabricator.wikimedia.org/T297484) (owner: 10Cparle) [09:43:46] (03CR) 10Ladsgroup: [C: 03+1] Revert "ProductionServices.php: Replace pc1011 with pc1014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753073 (owner: 10Marostegui) [09:43:57] (03PS1) 10DCausse: sre.wdqs.data-reload: few fixes and cleanups: [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 [09:46:35] (03CR) 10jerkins-bot: [V: 04-1] sre.wdqs.data-reload: few fixes and cleanups: [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 (owner: 10DCausse) [09:46:38] (03CR) 10Jelto: [C: 03+1] "lgtm now" [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [09:47:16] (03PS1) 10Matthias Mullie: Undo update to the way the search interface is set [extensions/MediaSearch] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753427 [09:48:23] (03CR) 10Matthias Mullie: [C: 04-2] "-2 to prevent it from being merged by accident. Please remove my +2 and merge it if it does need to be deployed!" [extensions/MediaSearch] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753427 (owner: 10Matthias Mullie) [09:48:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:28] (03PS2) 10DCausse: sre.wdqs.data-reload: few fixes and cleanups: [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 [09:49:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:49:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:02] !log jelto@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM gitlab-runner1001.eqiad.wmnet [09:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:09] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM gitlab-runner1001.eqiad.wmnet rebooted by jelto@cumin1001 with reason: Ganeti Migration [09:51:28] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2006.codfw.wmnet with reason: switch to plain disk storage [09:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2006.codfw.wmnet with reason: switch to plain disk storage [09:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:41] !log reverting kubetcd2006 back to "plain" storage [09:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:21] !log jelto@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM gitlab-runner1001.eqiad.wmnet [09:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=gitlab site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:54:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:54:44] !log Decommissioning cassandra instance restbase2009-b via nodetool [09:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:59] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Replace pc1011 with pc1014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753073 (owner: 10Marostegui) [09:55:22] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote pc1014 to pc1 master" [puppet] - 10https://gerrit.wikimedia.org/r/753072 (owner: 10Marostegui) [09:55:44] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Replace pc1011 with pc1014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753073 (owner: 10Marostegui) [09:57:00] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Revert: Promote pc1014 to master in pc1 (duration: 01m 07s) [09:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:00] (03CR) 10Ayounsi: [C: 04-1] "The upstream list is too dynamic to be written down and will generate false positive." [puppet] - 10https://gerrit.wikimedia.org/r/753147 (owner: 10Jbond) [10:00:36] (03CR) 10Ayounsi: hieradata - cloud: add config for prefies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753117 (owner: 10Jbond) [10:00:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:02:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:09] !log jelto@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM gitlab1001.wikimedia.org [10:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:15] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM gitlab1001.wikimedia.org rebooted by jelto@cumin1001 with reason: Ganeti Migration [10:08:06] !log jelto@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM gitlab1001.wikimedia.org [10:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:51] (03CR) 10DCausse: [C: 04-1] sre.wdqs.data-reload: few fixes and cleanups: (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 (owner: 10DCausse) [10:10:00] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10Jelto) [10:10:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T297191)', diff saved to https://phabricator.wikimedia.org/P18629 and previous config saved to /var/cache/conftool/dbconfig/20220112-101018-marostegui.json [10:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:23] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [10:14:16] (03PS5) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: factorized node creation cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) [10:14:18] (03PS6) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbooks to create each node type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) [10:14:20] (03PS4) 10Arturo Borrero Gonzalez: wmcs: toolforge: relocate some node-specific cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) [10:14:22] (03PS4) 10Arturo Borrero Gonzalez: wmcs: relocate start_instance_with_prefix cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) [10:20:47] (03PS1) 10Cparle: Modify mediasearch tab order on beta commons for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753429 (https://phabricator.wikimedia.org/T284208) [10:21:59] (03CR) 10jerkins-bot: [V: 04-1] Modify mediasearch tab order on beta commons for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753429 (https://phabricator.wikimedia.org/T284208) (owner: 10Cparle) [10:25:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P18630 and previous config saved to /var/cache/conftool/dbconfig/20220112-102523-marostegui.json [10:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:23] (03PS2) 10Cparle: Modify mediasearch tab order on beta commons for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753429 (https://phabricator.wikimedia.org/T284208) [10:26:59] (03PS1) 10Marostegui: db1128: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753430 (https://phabricator.wikimedia.org/T295965) [10:27:11] (03CR) 10jerkins-bot: [V: 04-1] Modify mediasearch tab order on beta commons for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753429 (https://phabricator.wikimedia.org/T284208) (owner: 10Cparle) [10:28:41] (03PS3) 10Cparle: Modify mediasearch tab order on beta commons for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753429 (https://phabricator.wikimedia.org/T284208) [10:29:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1128 in s1 with minimal weight T295965', diff saved to https://phabricator.wikimedia.org/P18631 and previous config saved to /var/cache/conftool/dbconfig/20220112-102938-marostegui.json [10:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:44] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [10:30:51] (03CR) 10Marostegui: [C: 03+2] db1128: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753430 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [10:31:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1128', diff saved to https://phabricator.wikimedia.org/P18632 and previous config saved to /var/cache/conftool/dbconfig/20220112-103144-marostegui.json [10:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:05] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply on main [10:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:24] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: sync on main [10:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:30] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply on main [10:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:33] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply on main [10:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:39] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply on main [10:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:42] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply on main [10:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:48] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply on main [10:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:08] 10SRE, 10Infrastructure-Foundations: Write a cookbook to align the "master-capable" state of Ganeti nodes - https://phabricator.wikimedia.org/T299034 (10MoritzMuehlenhoff) [10:34:14] (03CR) 10Matthias Mullie: [C: 03+1] Modify mediasearch tab order on beta commons for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753429 (https://phabricator.wikimedia.org/T284208) (owner: 10Cparle) [10:34:57] 10SRE, 10Infrastructure-Foundations: Write a cookbook to align the "master-capable" state of Ganeti nodes - https://phabricator.wikimedia.org/T299034 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:36:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1128 in s1 with minimal weight T295965', diff saved to https://phabricator.wikimedia.org/P18633 and previous config saved to /var/cache/conftool/dbconfig/20220112-103619-marostegui.json [10:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:23] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [10:37:57] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: sync on main [10:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:03] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply on main [10:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:25] (03CR) 10Jbond: hieradata - cloud: add config for prefies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753117 (owner: 10Jbond) [10:39:29] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: sync on main [10:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:24] (03CR) 10Cparle: [C: 03+2] Modify mediasearch tab order on beta commons for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753429 (https://phabricator.wikimedia.org/T284208) (owner: 10Cparle) [10:40:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P18634 and previous config saved to /var/cache/conftool/dbconfig/20220112-104028-marostegui.json [10:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:07] (03Merged) 10jenkins-bot: Modify mediasearch tab order on beta commons for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753429 (https://phabricator.wikimedia.org/T284208) (owner: 10Cparle) [10:41:51] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply on main [10:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:23] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply on main [10:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:25] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply on main [10:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:33] (03PS1) 10Elukey: role::kafka::jumbo::broker: move to fixed uid/gid for kafka [puppet] - 10https://gerrit.wikimedia.org/r/753432 (https://phabricator.wikimedia.org/T296990) [10:47:20] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply on main [10:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:41] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33203/console" [puppet] - 10https://gerrit.wikimedia.org/r/753432 (https://phabricator.wikimedia.org/T296990) (owner: 10Elukey) [10:48:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:53] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: sync on main [10:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:35] (03PS1) 10Cparle: Switch audio and video mediasearch tabs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753433 (https://phabricator.wikimedia.org/T284208) [10:50:03] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply on main [10:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:05] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply on main [10:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM dbmonitor1002.wikimedia.org [10:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:35] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply on main [10:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:42] (03CR) 10Matthias Mullie: [C: 03+1] Switch audio and video mediasearch tabs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753433 (https://phabricator.wikimedia.org/T284208) (owner: 10Cparle) [10:52:07] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: sync on main [10:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:46] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [10:53:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM dbmonitor1002.wikimedia.org [10:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:22] (03CR) 10Ayounsi: bgpalerter: add new class to configure bgpalerter (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [10:54:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:55] (03CR) 10Cparle: [C: 03+2] Switch audio and video mediasearch tabs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753433 (https://phabricator.wikimedia.org/T284208) (owner: 10Cparle) [10:55:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T297191)', diff saved to https://phabricator.wikimedia.org/P18635 and previous config saved to /var/cache/conftool/dbconfig/20220112-105532-marostegui.json [10:55:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [10:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [10:55:37] (03Merged) 10jenkins-bot: Switch audio and video mediasearch tabs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753433 (https://phabricator.wikimedia.org/T284208) (owner: 10Cparle) [10:55:37] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [10:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T297191)', diff saved to https://phabricator.wikimedia.org/P18636 and previous config saved to /var/cache/conftool/dbconfig/20220112-105540-marostegui.json [10:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18637 and previous config saved to /var/cache/conftool/dbconfig/20220112-105650-marostegui.json [10:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:53] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [10:58:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:58:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:11] !log rebalance ganeti/codfw row B (all nodes reimaged to Buster) [10:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM dborch1001.wikimedia.org [11:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM dborch1001.wikimedia.org [11:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [11:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T297191)', diff saved to https://phabricator.wikimedia.org/P18638 and previous config saved to /var/cache/conftool/dbconfig/20220112-110508-marostegui.json [11:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:12] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [11:06:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:06:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [11:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:35] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, 10Patch-For-Review: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) >>! In T292322#7603446, @tstarling wrote: > Is the procedure the one documented at https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments... [11:11:24] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [11:18:35] (03PS1) 10Giuseppe Lavagetto: shellbox-media: bump cpus available, reduce the number of pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/753435 (https://phabricator.wikimedia.org/T292322) [11:20:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P18639 and previous config saved to /var/cache/conftool/dbconfig/20220112-112013-marostegui.json [11:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:05] !log move kafka-jumbo nodes to fixed kafka uid/gid - T296990 [11:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:08] T296990: Move kafka-jumbo to a fixed uid/gid - https://phabricator.wikimedia.org/T296990 [11:21:10] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::kafka::jumbo::broker: move to fixed uid/gid for kafka [puppet] - 10https://gerrit.wikimedia.org/r/753432 (https://phabricator.wikimedia.org/T296990) (owner: 10Elukey) [11:22:34] (03PS1) 10Kormat: wmfdb/mycnf: Drop unnecessary tuple return for get_* [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753437 [11:31:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18640 and previous config saved to /var/cache/conftool/dbconfig/20220112-113119-marostegui.json [11:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:23] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [11:31:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] shellbox-media: bump cpus available, reduce the number of pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/753435 (https://phabricator.wikimedia.org/T292322) (owner: 10Giuseppe Lavagetto) [11:35:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P18641 and previous config saved to /var/cache/conftool/dbconfig/20220112-113518-marostegui.json [11:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:22] (03Merged) 10jenkins-bot: shellbox-media: bump cpus available, reduce the number of pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/753435 (https://phabricator.wikimedia.org/T292322) (owner: 10Giuseppe Lavagetto) [11:37:34] (03CR) 10Klausman: [C: 03+1] wmfdb/mycnf: Drop unnecessary tuple return for get_* [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753437 (owner: 10Kormat) [11:37:45] (03CR) 10Kormat: [C: 03+2] wmfdb/mycnf: Drop unnecessary tuple return for get_* [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753437 (owner: 10Kormat) [11:39:16] (03Merged) 10jenkins-bot: wmfdb/mycnf: Drop unnecessary tuple return for get_* [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753437 (owner: 10Kormat) [11:41:45] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:54] (03PS7) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbooks to create each node type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) [11:41:56] (03PS5) 10Arturo Borrero Gonzalez: wmcs: toolforge: relocate some node-specific cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) [11:41:58] (03PS5) 10Arturo Borrero Gonzalez: wmcs: relocate start_instance_with_prefix cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) [11:42:13] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply on main [11:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:42] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: sync on main [11:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:16] (03PS6) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: factorized node creation cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) [11:45:18] (03PS8) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbooks to create each node type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) [11:45:20] (03PS6) 10Arturo Borrero Gonzalez: wmcs: toolforge: relocate some node-specific cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) [11:45:22] (03PS6) 10Arturo Borrero Gonzalez: wmcs: relocate start_instance_with_prefix cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) [11:45:24] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: relocate some node-specific cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [11:50:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T297191)', diff saved to https://phabricator.wikimedia.org/P18642 and previous config saved to /var/cache/conftool/dbconfig/20220112-115024-marostegui.json [11:50:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [11:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [11:50:28] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [11:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T297191)', diff saved to https://phabricator.wikimedia.org/P18643 and previous config saved to /var/cache/conftool/dbconfig/20220112-115031-marostegui.json [11:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18644 and previous config saved to /var/cache/conftool/dbconfig/20220112-115259-marostegui.json [11:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:02] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [11:53:52] (03PS1) 10Ladsgroup: Disable flaggedrevs stable template inclusion in ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753441 (https://phabricator.wikimedia.org/T226054) [11:54:49] jouncebot: nowandnext [11:54:49] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [11:54:50] In 0 hour(s) and 5 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220112T1200) [11:55:00] (03PS1) 104nn1l2: fawiki: Add extendedmover usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753442 (https://phabricator.wikimedia.org/T299038) [11:58:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM releases1002.eqiad.wmnet [11:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:27] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, 10Patch-For-Review: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) I tried to give more resources to the shellbox container, but that didn't matter much - I guess the shellout we're running is single-threaded... [12:00:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T297191)', diff saved to https://phabricator.wikimedia.org/P18645 and previous config saved to /var/cache/conftool/dbconfig/20220112-120000-marostegui.json [12:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220112T1200). [12:00:05] cormacparle, WMDE-Fisch, and nn1l2: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:10] hi [12:00:13] * cormacparle waves [12:00:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM releases1002.eqiad.wmnet [12:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:00] @urbanecm is it you running the deployment again today? [12:01:29] i can i guess, if no one else's around [12:01:54] \o [12:02:14] hey WMDE-Fisch [12:03:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doc1001.eqiad.wmnet [12:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:38] cormacparle: are we backporting to wmf.17 only? [12:04:52] (03PS1) 10Matthias Mullie: Revert "Revert "Update the way the search interface is set"" [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753083 [12:05:07] I thought we were, but we need the patch Matthias just added too [12:05:17] one the maint script runs [12:05:55] (03CR) 10Urbanecm: [C: 03+2] Allow aliases to be integers in addition to strings [extensions/TemplateData] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752775 (https://phabricator.wikimedia.org/T298795) (owner: 10Awight) [12:05:59] (03CR) 10Urbanecm: [C: 03+2] fawiki: Add extendedmover usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753442 (https://phabricator.wikimedia.org/T299038) (owner: 104nn1l2) [12:06:15] so we'll still need sync-world after the .16 patch is merged [12:06:22] cormacparle: I'm confused about the maint script and the other backport [12:06:32] sorry, I know it's confusing :/ [12:06:37] a) why does the maint script only need to be backported to wmf.17? [12:06:46] (03Merged) 10jenkins-bot: fawiki: Add extendedmover usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753442 (https://phabricator.wikimedia.org/T299038) (owner: 104nn1l2) [12:06:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doc1001.eqiad.wmnet [12:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:58] b) likewise, why does the revert of revert need to be backported only in wmf.16? [12:07:26] a) we only need to run the maint script once, and because we're on .17 now we backported a fix for yesterday's problem to .17 [12:07:56] don't you need to run it once for every wiki? [12:08:00] or is once for _any_ wiki fine? [12:08:06] just commons [12:08:09] it's only relevant there [12:08:19] but commons is still on wmf.16? [12:08:23] that's what confuses me [12:08:32] will be promoted to .17 later today [12:08:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doc1002.eqiad.wmnet [12:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:47] we want to run the script before it gets promoted [12:09:21] as for question b) the code that we're backporting to .16 (via the revert of the revert) is already on .17 [12:09:28] i see [12:09:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18646 and previous config saved to /var/cache/conftool/dbconfig/20220112-120931-marostegui.json [12:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:35] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [12:10:02] a is still confusing though. The change to the script you scheduled is backported to wmf.17, so you can't run that on commons before it is on wmf.17 (so you'd need to wait after promotion) [12:10:04] or am i missing something? [12:10:31] nn1l2: hi, can you test your patch at mwdebug1001 please? [12:10:33] hmm maybe I'm missing something ... [12:10:37] urbanecm: to answer the same question about the TemplateData patch, the idea is that it's only semi-urgent so going out this week but not immediately would be ideal. And I can deploy this myself after the others, if you'd like. [12:10:49] yeah you're right @urbanecm [12:10:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doc1002.eqiad.wmnet [12:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:08] awight_: will ping you once I'm all done though :) [12:11:14] urbanecm: ty! [12:11:19] *then [12:11:22] np :) [12:11:37] cormacparle: so...revert? [12:11:53] myself and Matthias are talking this through @urbanecm ... sorry for the messing, just give us a few mins [12:12:08] LGTM [12:12:14] thanks nn1l2, syncing [12:12:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:59] (03PS2) 10Matthias Mullie: Revert "Revert "Update the way the search interface is set"" [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753083 [12:13:14] cormacparle: I'm not sure if your message was meant for this channel. Happy to wait though. [12:13:37] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: cfe389afce8037121f8e8b672f4fdf2458a068dd: fawiki: Add extendedmover usergroup (T299038) (duration: 01m 08s) [12:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:40] T299038: Create extendedmover usergroup on Farsi Wikipedia - https://phabricator.wikimedia.org/T299038 [12:13:44] nn1l2: your change is live [12:13:47] (03PS3) 10Matthias Mullie: Update the way the search interface is set [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753083 [12:13:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:13:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:54] Thanks, everything is okay :) [12:15:05] good :) [12:15:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P18647 and previous config saved to /var/cache/conftool/dbconfig/20220112-121505-marostegui.json [12:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:12] (03CR) 10Cparle: [C: 03+1] Update the way the search interface is set [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753083 (owner: 10Matthias Mullie) [12:18:24] (03Abandoned) 10Matthias Mullie: Update the way the search interface is set [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753083 (owner: 10Matthias Mullie) [12:21:07] @urbanecm it looks like the simplest thing to do is for us to undo the change to .17 that we want to run the maint script for, and then redo it tomorrow when the branches are in a less complex state [12:21:35] we're just a bit unsure about the i18n stuff [12:21:50] so revert the .17 backport Matthias merged earlier today and that's it for today? [12:22:36] that backport was just for the maint script [12:22:43] we'll need to merge this instead https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/753427 [12:23:04] (03Merged) 10jenkins-bot: Allow aliases to be integers in addition to strings [extensions/TemplateData] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752775 (https://phabricator.wikimedia.org/T298795) (owner: 10Awight) [12:23:05] @urbanecm that code can stay (doesn't affect anything), but we do need to get rid of some other code (already merged in master, assuming maint script would've run) [12:23:23] this patch does that, and would have to get out https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/753427 [12:23:37] the only thing is - will the i18n changes for existing .17 code have already been cached for .17 on commons? [12:24:24] matthiasmullie: well, it needs to be either reverted or deployed -- as you merged, but did not deploy (!) [12:25:06] it's ok to sync that other patch (it'll be a no-op) along with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/753427 [12:25:14] i see [12:25:23] cormacparle: I'm not sure i understand your question [12:25:31] PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:25:32] the i18n cache is global for a wmf branch [12:25:51] ok that means we'll need sync-world :/ [12:25:57] yeah [12:26:01] any i18n changes need sync-world [12:27:14] so, to confirm, we're deploying https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/753427 (and the no-op maint script patch) and that's it? [12:27:18] cormacparle: matthiasmullie ^ [12:27:34] correct [12:27:35] yes, and we're not running the maint script after all [12:27:38] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [12:27:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18648 and previous config saved to /var/cache/conftool/dbconfig/20220112-122742-marostegui.json [12:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:46] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [12:27:49] okay [12:28:21] so then, I suggest we +2 the patch, hand over to a_wight to do his TD stuff and then continue [12:28:25] does that sound good cormacparle matthiasmullie? [12:28:30] sure [12:28:34] (03CR) 10Cparle: [C: 03+2] Undo update to the way the search interface is set [extensions/MediaSearch] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753427 (owner: 10Matthias Mullie) [12:28:40] sounds good [12:28:52] awight: in that case, please do your TemplateData stuff :) [12:29:00] please note deploy1002 has an undeployed patch in wmf.17 for MediaSearch [12:29:18] (but as long as you don't do sync-world or otherwise sync the extension's wmf.17 folder, it should be fine) [12:30:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P18649 and previous config saved to /var/cache/conftool/dbconfig/20220112-123010-marostegui.json [12:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:59] (03PS1) 10Jbond: bgpalerter: Add email alerting and tweek default config [puppet] - 10https://gerrit.wikimedia.org/r/753445 (https://phabricator.wikimedia.org/T230600) [12:31:10] (03CR) 10Jbond: "follow up change https://gerrit.wikimedia.org/r/c/operations/puppet/+/753445" [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [12:31:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:31:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:33] awight: are you there? :) [12:31:40] (03CR) 10jerkins-bot: [V: 04-1] bgpalerter: Add email alerting and tweek default config [puppet] - 10https://gerrit.wikimedia.org/r/753445 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [12:32:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:36] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) 05In progress→03Resolved a:03MoritzMuehlenhoff This is complete, the entire Ganeti cluster is codfw is on Buster. [12:34:40] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10MoritzMuehlenhoff) [12:37:11] (03PS2) 10Jbond: bgpalerter: Add email alerting and tweek default config [puppet] - 10https://gerrit.wikimedia.org/r/753445 (https://phabricator.wikimedia.org/T230600) [12:37:52] (03PS1) 10Arturo Borrero Gonzalez: toolforge: grid: weblight: support debian buster [puppet] - 10https://gerrit.wikimedia.org/r/753446 (https://phabricator.wikimedia.org/T277653) [12:37:54] (03CR) 10jerkins-bot: [V: 04-1] bgpalerter: Add email alerting and tweek default config [puppet] - 10https://gerrit.wikimedia.org/r/753445 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [12:40:10] (03PS3) 10Jbond: bgpalerter: Add email alerting and tweek default config [puppet] - 10https://gerrit.wikimedia.org/r/753445 (https://phabricator.wikimedia.org/T230600) [12:40:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: grid: weblight: support debian buster [puppet] - 10https://gerrit.wikimedia.org/r/753446 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [12:42:28] urbanecm: alert failure. Yes, I can pick up the last patch, thanks! [12:42:41] go ahead awight, waiting on CI now [12:45:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T297191)', diff saved to https://phabricator.wikimedia.org/P18650 and previous config saved to /var/cache/conftool/dbconfig/20220112-124514-marostegui.json [12:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:18] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [12:46:12] (03Merged) 10jenkins-bot: Undo update to the way the search interface is set [extensions/MediaSearch] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753427 (owner: 10Matthias Mullie) [12:46:18] WMDE-Fisch: TemplateData change is ready to test on mw1001 [12:46:26] mwdebug1001 [12:46:42] awight: I'll see [12:48:17] seems to work, see https://test.wikipedia.org/wiki/Template:Test [12:48:30] !log removing orphan lint error reports in all wikis (T298782) [12:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:33] T298782: Linter seems to be not cleaning up after page deletion - https://phabricator.wikimedia.org/T298782 [12:48:41] and fails with old code ✅ [12:50:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM failoid1002.eqiad.wmnet [12:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:48] !log awight@deploy1002 Synchronized php-1.38.0-wmf.17/extensions/TemplateData: Backport: [[gerrit:752775|Allow aliases to be integers in addition to strings (T298795)]] (duration: 01m 07s) [12:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:51] T298795: breaking change in templatedata: asserting an alias must be a string - https://phabricator.wikimedia.org/T298795 [12:50:57] awight: ACK [12:51:25] urbanecm: done. Thanks for deploying the more challening patches :-) -- anything left that I can help with? [12:51:34] i don't think so :) [12:51:52] !log EU deployment complete [12:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166', diff saved to https://phabricator.wikimedia.org/P18651 and previous config saved to /var/cache/conftool/dbconfig/20220112-125208-marostegui.json [12:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:13] oh hang one awight eu deployment is not complete [12:52:22] we still have one [12:52:36] !log EU deployment reopened :-) [12:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:40] :)) [12:52:43] cormacparle: thanks for the note! [12:52:49] sorry awight, pinged you in the middle :D [12:52:50] heh, not at all :) [12:53:44] so @urbanecm ... that patch has successfully merged [12:53:53] o, did it already? [12:54:00] let's do it then [12:54:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove watchlist group from s7 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P18652 and previous config saved to /var/cache/conftool/dbconfig/20220112-125402-marostegui.json [12:54:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM failoid1002.eqiad.wmnet [12:54:04] yeah just now [12:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:06] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [12:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:26] cormacparle: pulled to mwdebug1001 [12:54:41] checking ... [12:55:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P18653 and previous config saved to /var/cache/conftool/dbconfig/20220112-125552-root.json [12:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:05] ok looks fine [12:56:08] ... but ... [12:56:17] but? [12:56:25] (03PS1) 10Zfilipin: selenium: Delete all tests [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/753449 (https://phabricator.wikimedia.org/T299047) [12:56:30] commons is still on .16, so I wouldn't be able see it if it was broken :/ [12:56:45] but that's the best we can hope for atm [12:57:25] so you can go ahead and sync I think [12:57:28] test-commons? [12:57:38] (03CR) 1020after4: [C: 03+2] selenium: Delete all tests [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/753449 (https://phabricator.wikimedia.org/T299047) (owner: 10Zfilipin) [12:57:44] that's on wmf.17 cormacparle [12:57:46] https://test-commons.wikimedia.org/wiki/Special:Version [12:57:50] aha excellent! [12:57:53] checking now [12:57:55] (03CR) 1020after4: [V: 03+2 C: 03+2] selenium: Delete all tests [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/753449 (https://phabricator.wikimedia.org/T299047) (owner: 10Zfilipin) [12:58:58] "The wiki is scheduled to be closed and deleted in December 2019" :/ [12:59:00] ok works fine @urbanecm [12:59:14] haha yes @taavi ... lucky for us it hasn't been deleted yet! [12:59:38] taavi: well, we never truly _deleted_ a wiki [12:59:58] anyway, syncing [13:00:28] ok great - will take approx 20mins for sync-world to finish so I can check on test-commons, right? [13:00:35] yeah [13:00:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM debmonitor1002.eqiad.wmnet [13:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:39] start and begining is logged here [13:00:47] I'll just do it, since no important wiki is on wmf.17 now [13:00:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18654 and previous config saved to /var/cache/conftool/dbconfig/20220112-130050-marostegui.json [13:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:53] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [13:01:14] (it'll throw a bit, but not much, as test-commons has very little traffic) [13:01:35] !log urbanecm@deploy1002 Started scap: 4b1e241: Undo update to the way the search interface is set [13:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:15] (03PS1) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [13:04:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM debmonitor1002.eqiad.wmnet [13:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:13] let me know once you all are done. I have a series of deployments 😈 [13:08:49] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter1003.eqiad.wmnet [13:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:53] Amir1: ack [13:10:32] (03PS6) 10BBlack: drmrs: lvs/cp puppetization [puppet] - 10https://gerrit.wikimedia.org/r/748752 (https://phabricator.wikimedia.org/T282787) [13:10:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P18655 and previous config saved to /var/cache/conftool/dbconfig/20220112-131056-root.json [13:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter1003.eqiad.wmnet [13:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:23] (03PS7) 10BBlack: drmrs: lvs/cp puppetization [puppet] - 10https://gerrit.wikimedia.org/r/748752 (https://phabricator.wikimedia.org/T282787) [13:12:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:00] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [13:14:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM puppetboard1002.eqiad.wmnet [13:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:06] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter1004.eqiad.wmnet [13:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM puppetboard1002.eqiad.wmnet [13:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:19] @urbanecm the i18n messages relevant to us seem to have been updated now [13:20:54] !log urbanecm@deploy1002 Finished scap: 4b1e241: Undo update to the way the search interface is set (duration: 19m 19s) [13:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:05] cormacparle: yeah, it was literally about to finish [13:21:08] So we're done now i think [13:21:17] cormacparle: can you confirm before I hand the control over? [13:23:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter1004.eqiad.wmnet [13:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:20] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10cmooney) a:03cmooney Hi Michael. That should be no problem. To confirm the key via a separately authenticated channel I sent you a message on Slack, please respond to my... [13:24:35] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10cmooney) p:05Triage→03Medium [13:25:20] @urbanecm yes all good [13:25:27] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10elukey) [13:25:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P18656 and previous config saved to /var/cache/conftool/dbconfig/20220112-132600-root.json [13:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:14] Great cormacparle, thanks [13:26:16] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10cmooney) 05Open→03Resolved Super thanks for confirming! And other problems just let me know :) [13:26:19] @Amir1: floor is yours [13:26:59] Thanks [13:27:25] (03PS2) 10Ladsgroup: Disable flaggedrevs stable template inclusion in ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753441 (https://phabricator.wikimedia.org/T226054) [13:27:35] (03CR) 10Ladsgroup: [C: 03+2] Disable flaggedrevs stable template inclusion in ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753441 (https://phabricator.wikimedia.org/T226054) (owner: 10Ladsgroup) [13:28:18] (03Merged) 10jenkins-bot: Disable flaggedrevs stable template inclusion in ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753441 (https://phabricator.wikimedia.org/T226054) (owner: 10Ladsgroup) [13:30:26] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:753441|Disable flaggedrevs stable template inclusion in ruwikisource (T226054)]] (duration: 01m 08s) [13:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:29] T226054: FlaggedRevs disabled for NS "Wikisource", "MediaWiki" and "File" but it still requiring reviews for transcludes - https://phabricator.wikimedia.org/T226054 [13:30:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:26] (03PS4) 10Jbond: bgpalerter: Add email alerting and tweek default config [puppet] - 10https://gerrit.wikimedia.org/r/753445 (https://phabricator.wikimedia.org/T230600) [13:31:28] (03PS2) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [13:31:30] (03PS1) 10Jbond: O:bgpalerter: add rpki as a class paramter [puppet] - 10https://gerrit.wikimedia.org/r/753452 (https://phabricator.wikimedia.org/T230600) [13:32:17] (03CR) 10jerkins-bot: [V: 04-1] bgpalerter: Add email alerting and tweek default config [puppet] - 10https://gerrit.wikimedia.org/r/753445 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [13:32:39] (03PS1) 10MSantos: maps: disable every OSM cron to perform re-import of data [puppet] - 10https://gerrit.wikimedia.org/r/753453 [13:32:49] (03CR) 10jerkins-bot: [V: 04-1] O:bgpalerter: add rpki as a class paramter [puppet] - 10https://gerrit.wikimedia.org/r/753452 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [13:34:06] (03PS2) 10MSantos: maps: disable every OSM cron to perform re-import of data [puppet] - 10https://gerrit.wikimedia.org/r/753453 (https://phabricator.wikimedia.org/T299049) [13:34:32] (03CR) 10Jgiannelos: [C: 03+1] maps: disable every OSM cron to perform re-import of data [puppet] - 10https://gerrit.wikimedia.org/r/753453 (https://phabricator.wikimedia.org/T299049) (owner: 10MSantos) [13:35:51] (03PS3) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [13:37:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:37:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:46] (03PS2) 10Jbond: O:bgpalerter: add rpki as a class paramter [puppet] - 10https://gerrit.wikimedia.org/r/753452 (https://phabricator.wikimedia.org/T230600) [13:40:34] (03PS5) 10Jbond: bgpalerter: Add email alerting and tweek default config [puppet] - 10https://gerrit.wikimedia.org/r/753445 (https://phabricator.wikimedia.org/T230600) [13:40:42] (03PS4) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [13:41:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P18657 and previous config saved to /var/cache/conftool/dbconfig/20220112-134103-root.json [13:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance [13:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance [13:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T297191)', diff saved to https://phabricator.wikimedia.org/P18658 and previous config saved to /var/cache/conftool/dbconfig/20220112-134620-marostegui.json [13:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:23] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [13:47:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T297191)', diff saved to https://phabricator.wikimedia.org/P18659 and previous config saved to /var/cache/conftool/dbconfig/20220112-134727-marostegui.json [13:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:11] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:48:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:45] (03PS1) 10Cathal Mooney: Add LDAP-only user Ntsako Maphophe to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/753455 (https://phabricator.wikimedia.org/T298868) [13:50:39] (03CR) 10Cathal Mooney: [C: 03+2] Add LDAP-only user Ntsako Maphophe to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/753455 (https://phabricator.wikimedia.org/T298868) (owner: 10Cathal Mooney) [13:51:24] (03PS5) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [13:51:32] (03CR) 10Jbond: [C: 03+2] O:bgpalerter: add rpki as a class paramter [puppet] - 10https://gerrit.wikimedia.org/r/753452 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [13:51:49] (03CR) 10Ottomata: "To keep things consistent, can we do this for all eventgate services?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/753425 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [13:52:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:52:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:02] (03CR) 10Elukey: helmfile.d: move eventgate-analytics* to the WMF CA cert bundle (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/753425 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [13:57:03] (03PS6) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [13:57:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33206/console" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [13:58:27] (03PS6) 10Jbond: bgpalerter: Add email alerting and tweek default config [puppet] - 10https://gerrit.wikimedia.org/r/753445 (https://phabricator.wikimedia.org/T230600) [13:58:34] (03PS7) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [13:58:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18661 and previous config saved to /var/cache/conftool/dbconfig/20220112-135858-marostegui.json [13:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:26] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [13:59:58] (03PS4) 10Ladsgroup: Merge db-codfw.php and db-eqiad.php into db-production.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702421 (https://phabricator.wikimedia.org/T260297) (owner: 10Legoktm) [14:02:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM webperf1001.eqiad.wmnet [14:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P18662 and previous config saved to /var/cache/conftool/dbconfig/20220112-140232-marostegui.json [14:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:33] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to cn=wmf and cn=ops for Nmaphophe - https://phabricator.wikimedia.org/T298868 (10cmooney) @ntsako I have added your LDAP account to the 'WMF' group as requested. I believe this is all that is required for access to the systems you list, and... [14:05:28] (03PS2) 10Elukey: helmfile.d: move eventgate* to the WMF CA cert bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/753425 (https://phabricator.wikimedia.org/T296064) [14:06:05] (03CR) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [14:06:07] (03PS8) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [14:06:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:07:42] (03PS1) 10Jbond: bgpalerter: update hiera [puppet] - 10https://gerrit.wikimedia.org/r/753460 [14:07:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM webperf1001.eqiad.wmnet [14:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:03] 10SRE, 10Traffic: Make varnish-frontend-restart work on Beta Cluster - https://phabricator.wikimedia.org/T299054 (10ema) [14:08:09] (03CR) 10Jbond: [C: 03+2] bgpalerter: update hiera [puppet] - 10https://gerrit.wikimedia.org/r/753460 (owner: 10Jbond) [14:08:17] 10SRE, 10Traffic: Make varnish-frontend-restart work on Beta Cluster - https://phabricator.wikimedia.org/T299054 (10ema) p:05Triage→03Low [14:08:19] (03CR) 10Jbond: [V: 03+2 C: 03+2] bgpalerter: update hiera [puppet] - 10https://gerrit.wikimedia.org/r/753460 (owner: 10Jbond) [14:08:25] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to cn=wmf and cn=ops for Nmaphophe - https://phabricator.wikimedia.org/T298868 (10ntsako) Hi @cmooney Thanks for adding me. Please note that the changes might not have synced yet. I checked using this url: https://ldap.toolforge.org/user/ntsa... [14:08:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:09:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM webperf1002.eqiad.wmnet [14:09:41] (03CR) 10Ladsgroup: [C: 03+2] Merge db-codfw.php and db-eqiad.php into db-production.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702421 (https://phabricator.wikimedia.org/T260297) (owner: 10Legoktm) [14:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:49] (03PS1) 10Elukey: custom_deploy.d: improve istio ingress gateway's config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/753461 (https://phabricator.wikimedia.org/T289835) [14:10:26] marostegui: deploying db-*.php merge now [14:10:48] (03PS7) 10Jbond: bgpalerter: Add email alerting and tweek default config [puppet] - 10https://gerrit.wikimedia.org/r/753445 (https://phabricator.wikimedia.org/T230600) [14:10:54] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: Make varnish-frontend-restart work on Beta Cluster - https://phabricator.wikimedia.org/T299054 (10Majavah) [14:11:32] (03Merged) 10jenkins-bot: Merge db-codfw.php and db-eqiad.php into db-production.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702421 (https://phabricator.wikimedia.org/T260297) (owner: 10Legoktm) [14:11:44] Amir1: ok! [14:11:46] (03CR) 10Jbond: [C: 03+2] bgpalerter: Add email alerting and tweek default config [puppet] - 10https://gerrit.wikimedia.org/r/753445 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [14:12:06] (03PS2) 10Elukey: custom_deploy.d: improve istio ingress gateway's config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/753461 (https://phabricator.wikimedia.org/T289835) [14:12:15] (03PS1) 10Cathal Mooney: Add Marco Fossati to LDAP WMF Group [puppet] - 10https://gerrit.wikimedia.org/r/753463 (https://phabricator.wikimedia.org/T298766) [14:13:28] (03CR) 10Elukey: "Tested on ml-serve-eqiad:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/753461 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [14:13:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:17] !log ladsgroup@deploy1002 Synchronized wmf-config/db-production.php: Config: [[gerrit:702421|Merge db-codfw.php and db-eqiad.php into db-production.php (T260297)]], Part I (duration: 01m 07s) [14:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:21] T260297: db-eqiad and db-codfw sectionsByLoad can get out of sync - https://phabricator.wikimedia.org/T260297 [14:14:32] (03PS2) 10Majavah: P:mw::maintenance: add centralauth group purge job [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) [14:15:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM webperf1002.eqiad.wmnet [14:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:36] !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:702421|Merge db-codfw.php and db-eqiad.php into db-production.php (T260297)]], Part II (duration: 01m 08s) [14:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:41] (03CR) 10Majavah: P:mw::maintenance: add centralauth group purge job (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/752341 (https://phabricator.wikimedia.org/T153815) (owner: 10Majavah) [14:17:11] !log ladsgroup@deploy1002 Synchronized wmf-config: Config: [[gerrit:702421|Merge db-codfw.php and db-eqiad.php into db-production.php (T260297)]], Part III (duration: 01m 07s) [14:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:36] > Warning: require_once(/srv/mediawiki/wmf-config/db-eqiad.php): failed to open stream: No such file or directory in /srv/mediawiki/docroot/noc/db.php on line 63 [14:17:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P18663 and previous config saved to /var/cache/conftool/dbconfig/20220112-141736-marostegui.json [14:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:40] https://noc.wikimedia.org/db.php [14:17:43] :/ [14:17:44] (03CR) 10Cathal Mooney: [C: 03+2] Add Marco Fossati to LDAP WMF Group [puppet] - 10https://gerrit.wikimedia.org/r/753463 (https://phabricator.wikimedia.org/T298766) (owner: 10Cathal Mooney) [14:17:46] I fix that [14:18:51] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:19:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:19:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:46] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [14:20:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:01] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10bking) @MoritzMuehlenhoff This is a good point; will discuss further with my team today. [14:21:04] (03CR) 10Elukey: "At this point it could be useful to have the same config for the cluster-local gateway, will try and update in case." [deployment-charts] - 10https://gerrit.wikimedia.org/r/753461 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [14:22:28] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd1004.eqiad.wmnet with reason: switch to DRBD disk storage [14:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd1004.eqiad.wmnet with reason: switch to DRBD disk storage [14:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:25] !log switch kubestagetcd1004 to DRBD (needed to be able to shuffle instances around for the Ganeti buster update) [14:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:26] (03CR) 10Ottomata: "Both if possible? Doing it later is fine too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/753425 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [14:24:29] (03PS3) 10Elukey: custom_deploy.d: improve istio ingress gateways' config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/753461 (https://phabricator.wikimedia.org/T289835) [14:24:55] (03CR) 10Ottomata: helmfile.d: move eventgate* to the WMF CA cert bundle (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/753425 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [14:25:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:10] (03PS3) 10Elukey: helmfile.d: move eventgate* to the WMF CA cert bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/753425 (https://phabricator.wikimedia.org/T296064) [14:26:19] (03CR) 10Elukey: helmfile.d: move eventgate* to the WMF CA cert bundle (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/753425 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [14:26:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:26:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:42] (03CR) 10Jelto: [V: 03+1 C: 03+2] "Just to be sure I tested in WMCS what happens if this change is applied to a existing machine with deployment_server::kubernetes role." [puppet] - 10https://gerrit.wikimedia.org/r/753026 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [14:27:45] (03PS1) 10Ladsgroup: docroot: Clean up db.php after db-$dc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753464 (https://phabricator.wikimedia.org/T260297) [14:27:53] RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:28:25] (03CR) 10jerkins-bot: [V: 04-1] docroot: Clean up db.php after db-$dc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753464 (https://phabricator.wikimedia.org/T260297) (owner: 10Ladsgroup) [14:29:29] (03Abandoned) 10Ladsgroup: docroot: Clean up db.php after db-$dc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753464 (https://phabricator.wikimedia.org/T260297) (owner: 10Ladsgroup) [14:30:16] (03PS8) 10Jbond: profile::installserver::proxy: update squid template [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) [14:30:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:27] (03CR) 10Jbond: profile::installserver::proxy: update squid template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [14:30:29] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:30:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow1002.eqiad.wmnet [14:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T297191)', diff saved to https://phabricator.wikimedia.org/P18664 and previous config saved to /var/cache/conftool/dbconfig/20220112-143241-marostegui.json [14:32:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [14:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [14:32:45] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:32:45] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [14:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [14:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [14:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [14:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [14:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T297191)', diff saved to https://phabricator.wikimedia.org/P18665 and previous config saved to /var/cache/conftool/dbconfig/20220112-143258-marostegui.json [14:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:07] 7 [14:33:10] nope [14:33:40] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, 10Patch-For-Review: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) p:05Triage→03High [14:36:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T297191)', diff saved to https://phabricator.wikimedia.org/P18666 and previous config saved to /var/cache/conftool/dbconfig/20220112-143606-marostegui.json [14:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow1002.eqiad.wmnet [14:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:30] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply on staging [14:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:34] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply on production [14:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:41] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: sync on staging [14:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:21] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [14:39:40] 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10Jclark-ctr) @ papaul i had not finished green cables and are mostly going same direction and was hoping i could use that temp so possibly would not block others from starting [14:40:02] (03CR) 10Ottomata: [C: 03+1] helmfile.d: move eventgate* to the WMF CA cert bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/753425 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [14:40:11] !log remove helm2 from deployment_server T251305 https://gerrit.wikimedia.org/r/c/operations/puppet/+/753026 [14:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:15] T251305: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 [14:41:46] (03PS1) 10Kormat: wmfdb/mycnf: Allow Cnf.pymysql_conn_args to take kwargs [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753466 [14:42:11] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply on main [14:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:29] (03PS5) 10Jbond: P:installserver::proxy: Add domain whitelist to proxy [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) [14:42:33] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: helm-repo-update.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:47] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: sync on main [14:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:03] ^ I'll take a look at deploy1002, should be because of helm2 removal [14:44:55] (03CR) 10Hnowlan: [C: 03+2] maps: disable every OSM cron to perform re-import of data [puppet] - 10https://gerrit.wikimedia.org/r/753453 (https://phabricator.wikimedia.org/T299049) (owner: 10MSantos) [14:44:59] (03CR) 10Klausman: [C: 03+1] wmfdb/mycnf: Allow Cnf.pymysql_conn_args to take kwargs [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753466 (owner: 10Kormat) [14:45:16] (03CR) 10Kormat: [C: 03+2] wmfdb/mycnf: Allow Cnf.pymysql_conn_args to take kwargs [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753466 (owner: 10Kormat) [14:46:40] (03CR) 10Jbond: P:installserver::proxy: Add domain whitelist to proxy (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [14:47:05] (03Merged) 10jenkins-bot: wmfdb/mycnf: Allow Cnf.pymysql_conn_args to take kwargs [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753466 (owner: 10Kormat) [14:49:19] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:51:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P18667 and previous config saved to /var/cache/conftool/dbconfig/20220112-145111-marostegui.json [14:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:19] (03PS1) 10David Caro: wmcs: Added README [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753469 [14:52:41] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Marco_Fossati - https://phabricator.wikimedia.org/T298766 (10cmooney) @mfossati I have added you to the required LDAP group now. Can you test your access and advise if it is working? Thanks. [14:54:03] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply on main [14:54:06] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply on main [14:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:42] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply on main [14:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:14] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: sync on main [14:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: Added README [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753469 (owner: 10David Caro) [14:59:08] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd1005.eqiad.wmnet with reason: switch to DRBD disk storage [14:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd1005.eqiad.wmnet with reason: switch to DRBD disk storage [14:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:41] !log switch kubestagetcd1005 to DRBD (needed to be able to shuffle instances around for the Ganeti buster update) [14:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:21] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:44] (03CR) 10David Caro: [C: 03+2] wmcs: Added README [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753469 (owner: 10David Caro) [15:03:53] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-repo-update.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:19] ^ I'll take a look [15:06:00] (03PS1) 10ArielGlenn: Partial revert of I1a691f01cd82e60bf41207d32501edb4b9835e37 to unbreak dumps [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753085 (https://phabricator.wikimedia.org/T299020) [15:06:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P18668 and previous config saved to /var/cache/conftool/dbconfig/20220112-150616-marostegui.json [15:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:55] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: imposm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:23] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:22] ACKNOWLEDGEMENT - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: imposm.service Hnowlan Disabled for planet reimport https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:09] !log stop kafka* on kafka-main1001 to allow dcops maintenance (nic/bios upgrades) - T298867 [15:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:14] T298867: Installation issues on PowerEdge R440 Kafka main eqiad servers with buster / firmware update needed - https://phabricator.wikimedia.org/T298867 [15:14:23] (03PS1) 10Cparle: Revert "Undo update to the way the search interface is set" [extensions/MediaSearch] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753487 [15:18:43] 10SRE, 10LDAP-Access-Requests: Grant Access to cn=wmf and cn=ops for Nmaphophe - https://phabricator.wikimedia.org/T298868 (10ntsako) Hi @cmooney, Please note that I get the below error message when I try to access the below: "Service access denied due to missing privileges." - https://superset.wikimedia.org... [15:19:47] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [15:20:59] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) I removed `helm2` from `deploy1001` and `deploy2001` by merging https://gerrit.wikimedia.org/r/753026. I tested the removal before on WMCS and a temporary pontoon setup before (se... [15:21:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T297191)', diff saved to https://phabricator.wikimedia.org/P18669 and previous config saved to /var/cache/conftool/dbconfig/20220112-152121-marostegui.json [15:21:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [15:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [15:21:25] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [15:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [15:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [15:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:32] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [15:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T297191)', diff saved to https://phabricator.wikimedia.org/P18670 and previous config saved to /var/cache/conftool/dbconfig/20220112-152133-marostegui.json [15:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T297191)', diff saved to https://phabricator.wikimedia.org/P18671 and previous config saved to /var/cache/conftool/dbconfig/20220112-152240-marostegui.json [15:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:32] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2051.codfw.wmnet with OS stretch [15:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:40] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2051.codfw.wmnet with OS stretch [15:31:29] (03CR) 10TChin: [C: 03+1] Partial revert of I1a691f01cd82e60bf41207d32501edb4b9835e37 to unbreak dumps [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753085 (https://phabricator.wikimedia.org/T299020) (owner: 10ArielGlenn) [15:32:57] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:35:01] (03CR) 10Klausman: [C: 03+1] custom_deploy.d: improve istio ingress gateways' config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/753461 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [15:35:13] (03CR) 10JMeybohm: [C: 03+1] "Looks sane to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/753425 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [15:37:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P18672 and previous config saved to /var/cache/conftool/dbconfig/20220112-153745-marostegui.json [15:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:51] (03PS5) 10Jbond: kerberos: manage users with custom puppet type [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [15:38:53] (03PS1) 10Jbond: admin: make admin::kerberos_users more generic [puppet] - 10https://gerrit.wikimedia.org/r/753479 [15:40:33] 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10Papaul) @Jclark-ctr thanks make sense [15:52:25] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, 10Patch-For-Review: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) a:03Joe I ran the command locally (I think!) on mwmaint1002, and it took a comparable time to what it took calling shellbox - apparently th... [15:52:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P18673 and previous config saved to /var/cache/conftool/dbconfig/20220112-155250-marostegui.json [15:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:20] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2051.codfw.wmnet with OS stretch [15:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:26] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2051.codfw.wmnet with OS stretch executed with errors: - elastic2051 (*... [15:56:27] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply on main [15:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:00] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: sync on main [15:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:54] 10SRE-Access-Requests: Requesting access to RESOURCE for Ntsako Maphophe - https://phabricator.wikimedia.org/T299066 (10ntsako) [16:02:11] !log stop kafka* on kafka-main1002 to allow dcops maintenance (nic/bios upgrades) - T298867 [16:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:15] T298867: Installation issues on PowerEdge R440 Kafka main eqiad servers with buster / firmware update needed - https://phabricator.wikimedia.org/T298867 [16:05:34] 10SRE, 10ops-eqiad: db1169 reimage/idrac failure - https://phabricator.wikimedia.org/T299025 (10wiki_willy) a:03Cmjohnson [16:05:46] 10SRE, 10ops-eqiad: db1169 reimage/idrac failure - https://phabricator.wikimedia.org/T299025 (10wiki_willy) Hi @Cmjohnson - just a heads up, this one is high priority. Thanks, Willy [16:05:51] PROBLEM - DPKG on gitlab2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [16:06:13] (03CR) 10JMeybohm: [C: 03+1] custom_deploy.d: improve istio ingress gateways' config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/753461 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [16:07:08] (03PS2) 10Cparle: Revert "Undo update to the way the search interface is set" [extensions/MediaSearch] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753487 [16:07:31] 10SRE, 10ops-eqiad: Degraded RAID on dumpsdata1004 - https://phabricator.wikimedia.org/T298582 (10Cmjohnson) 05Open→03Resolved The disk has been replaced and is rebuilding. Resolving this task [16:07:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T297191)', diff saved to https://phabricator.wikimedia.org/P18674 and previous config saved to /var/cache/conftool/dbconfig/20220112-160755-marostegui.json [16:07:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [16:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [16:07:59] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [16:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T297191)', diff saved to https://phabricator.wikimedia.org/P18675 and previous config saved to /var/cache/conftool/dbconfig/20220112-160802-marostegui.json [16:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:09:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T297191)', diff saved to https://phabricator.wikimedia.org/P18676 and previous config saved to /var/cache/conftool/dbconfig/20220112-160910-marostegui.json [16:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:27] (03PS1) 10Alexandros Kosiaris: Depool poolcounter1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753506 (https://phabricator.wikimedia.org/T294120) [16:10:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:11:09] (03CR) 10jerkins-bot: [V: 04-1] Depool poolcounter1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753506 (https://phabricator.wikimedia.org/T294120) (owner: 10Alexandros Kosiaris) [16:13:38] (03PS2) 10Alexandros Kosiaris: Depool poolcounter1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753506 (https://phabricator.wikimedia.org/T294120) [16:13:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [16:19:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] Depool poolcounter1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753506 (https://phabricator.wikimedia.org/T294120) (owner: 10Alexandros Kosiaris) [16:19:28] 10SRE, 10ops-eqiad: db1169 reimage/idrac failure - https://phabricator.wikimedia.org/T299025 (10Cmjohnson) @Marostegui at first glance the settings are correct but it's definitely stuck in a weird boot process. I am updating firmware first and will go from there [16:19:52] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd1006.eqiad.wmnet with reason: switch to DRBD disk storage [16:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd1006.eqiad.wmnet with reason: switch to DRBD disk storage [16:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:02] (03Merged) 10jenkins-bot: Depool poolcounter1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753506 (https://phabricator.wikimedia.org/T294120) (owner: 10Alexandros Kosiaris) [16:20:10] 10SRE, 10ops-eqiad: db1169 reimage/idrac failure - https://phabricator.wikimedia.org/T299025 (10Marostegui) Thank you @Cmjohnson - once it is able to boot up, I can take it from there and attempt a reimage. [16:20:41] !log switch kubestagetcd1006 to DRBD (needed to be able to shuffle instances around for the Ganeti buster update) [16:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:19] (03CR) 10Jbond: "See inline for comments." [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [16:23:48] (03CR) 10Daniel Kinzler: [C: 03+1] Partial revert of I1a691f01cd82e60bf41207d32501edb4b9835e37 to unbreak dumps [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753085 (https://phabricator.wikimedia.org/T299020) (owner: 10ArielGlenn) [16:23:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [16:24:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P18677 and previous config saved to /var/cache/conftool/dbconfig/20220112-162414-marostegui.json [16:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:24] !log stop kafka* on kafka-main1003 to allow dcops maintenance (nic/bios upgrades) - T298867 [16:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:27] T298867: Installation issues on PowerEdge R440 Kafka main eqiad servers with buster / firmware update needed - https://phabricator.wikimedia.org/T298867 [16:25:39] !log akosiaris@deploy1002 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 01m 16s) [16:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:31:36] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM poolcounter1004.eqiad.wmnet [16:31:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:24] (03CR) 10Elukey: [C: 03+2] custom_deploy.d: improve istio ingress gateways' config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/753461 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [16:34:03] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:35:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM mx1001.wikimedia.org [16:35:25] (03PS1) 10Alexandros Kosiaris: Repool poolcounter1004, depool poolcounter1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753511 (https://phabricator.wikimedia.org/T294120) [16:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:59] (03CR) 10Jbond: [V: 03+1] "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) (owner: 10Jbond) [16:36:46] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM poolcounter1004.eqiad.wmnet [16:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:53] RECOVERY - DPKG on gitlab2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [16:38:12] (03PS2) 10Jbond: admin: make admin::kerberos_users more generic [puppet] - 10https://gerrit.wikimedia.org/r/753479 [16:39:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33208/console" [puppet] - 10https://gerrit.wikimedia.org/r/753479 (owner: 10Jbond) [16:39:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM mx1001.wikimedia.org [16:39:09] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [16:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P18678 and previous config saved to /var/cache/conftool/dbconfig/20220112-163919-marostegui.json [16:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:31] !log elukey@prometheus1003:~$ sudo apt-get remove linux-image-4.9.0-11-amd64 linux-image-4.9.0-12-amd64 linux-image-4.9.0-13-amd64 linux-image-4.9.0-8-amd64 linux-image-4.9.0-9-amd64 [16:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] Repool poolcounter1004, depool poolcounter1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753511 (https://phabricator.wikimedia.org/T294120) (owner: 10Alexandros Kosiaris) [16:39:47] (03CR) 10Jbond: [V: 03+1] "ready" [puppet] - 10https://gerrit.wikimedia.org/r/753479 (owner: 10Jbond) [16:40:51] (03Merged) 10jenkins-bot: Repool poolcounter1004, depool poolcounter1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753511 (https://phabricator.wikimedia.org/T294120) (owner: 10Alexandros Kosiaris) [16:40:57] (03PS3) 10Sharvaniharan: Add event stream config for android.customize_toolbar_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747991 (https://phabricator.wikimedia.org/T297818) [16:40:58] !log elukey@prometheus1004:~$ sudo apt-get remove linux-image-4.9.0-8-amd64 linux-image-4.9.0-9-amd64 linux-image-4.9.0-11-amd64 linux-image-4.9.0-12-amd64 linux-image-4.9.0-13-amd64 [16:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:15] 10SRE-Access-Requests: Requesting access to RESOURCE for Ntsako Maphophe - https://phabricator.wikimedia.org/T299066 (10cmooney) @CMacholan can I get your thumbs up for this one also? thanks. [16:41:39] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 Kafka main eqiad servers with buster / firmware update needed - https://phabricator.wikimedia.org/T298867 (10Cmjohnson) 05Open→03Resolved all 3 servers have been updated. [16:42:41] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [16:43:44] (03PS3) 10Sharvaniharan: Add event stream config for ios.notification_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747993 (https://phabricator.wikimedia.org/T290920) [16:44:35] !log elukey@prometheus2003:~$ sudo apt-get remove linux-image-4.9.0-8-amd64 linux-image-4.9.0-9-amd64 linux-image-4.9.0-11-amd64 linux-image-4.9.0-12-amd64 linux-image-4.9.0-13-amd64 [16:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:21] RECOVERY - Disk space on prometheus2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [16:45:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:30] !log elukey@prometheus2004:~$ sudo apt-get remove linux-image-4.9.0-8-amd64 linux-image-4.9.0-9-amd64 linux-image-4.9.0-11-amd64 linux-image-4.9.0-12-amd64 linux-image-4.9.0-13-amd64 [16:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:13] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:46:21] !log akosiaris@deploy1002 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 01m 21s) [16:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:31] RECOVERY - Disk space on prometheus1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1004&var-datasource=eqiad+prometheus/ops [16:47:31] RECOVERY - Disk space on prometheus1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [16:47:31] RECOVERY - Disk space on prometheus2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [16:47:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:51] (03PS1) 10Cathal Mooney: Change SSH pubkey for WMDE user migr [puppet] - 10https://gerrit.wikimedia.org/r/753512 (https://phabricator.wikimedia.org/T269610) [16:48:00] (03CR) 10JHathaway: [C: 03+2] profile::installserver::proxy: update squid template [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [16:48:18] oops, wrong button, gets me everytime :( [16:48:30] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [16:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:38] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [16:49:17] (03CR) 10Cathal Mooney: [C: 03+2] Change SSH pubkey for WMDE user migr [puppet] - 10https://gerrit.wikimedia.org/r/753512 (https://phabricator.wikimedia.org/T269610) (owner: 10Cathal Mooney) [16:53:19] !log Decommissioning cassandra instance restbase2009-c via nodetool [16:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [16:54:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [16:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:32] !log akosiaris@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM poolcounter1005.eqiad.wmnet [16:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T297191)', diff saved to https://phabricator.wikimedia.org/P18680 and previous config saved to /var/cache/conftool/dbconfig/20220112-165434-marostegui.json [16:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [16:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:39] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [16:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T297191)', diff saved to https://phabricator.wikimedia.org/P18681 and previous config saved to /var/cache/conftool/dbconfig/20220112-165542-marostegui.json [16:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:49] (03PS2) 10Clare Ming: Add new vector skin key to RelatedArticlesFooterAllowedSkins. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753187 (https://phabricator.wikimedia.org/T298916) [16:58:16] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM poolcounter1005.eqiad.wmnet [16:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:00] (03PS1) 10Alexandros Kosiaris: Repool poolcounter1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753519 (https://phabricator.wikimedia.org/T294120) [16:59:09] 10SRE, 10ops-eqiad: db1169 reimage/idrac failure - https://phabricator.wikimedia.org/T299025 (10Cmjohnson) 05Open→03Resolved @Marostegui The server was hung up during POST in the memory collection process, I ended up removing all the DIMM"s with the exception of A1 and B1 and the server booted properly, u... [16:59:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10cmooney) This has been updated. @Michael please test and advise if you have any problems. Thanks. [16:59:40] 10SRE, 10ops-eqiad: db1169 reimage/idrac failure - https://phabricator.wikimedia.org/T299025 (10Marostegui) Thanks Chris, going to try a reimage then! I will let you know how it goes [17:00:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS bullseye [17:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] Repool poolcounter1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753519 (https://phabricator.wikimedia.org/T294120) (owner: 10Alexandros Kosiaris) [17:02:12] (03Merged) 10jenkins-bot: Repool poolcounter1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753519 (https://phabricator.wikimedia.org/T294120) (owner: 10Alexandros Kosiaris) [17:05:19] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10akosiaris) [17:06:14] !log akosiaris@deploy1002 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 01m 21s) [17:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:30] 10SRE, 10ops-eqiad: db1169 reimage/idrac failure - https://phabricator.wikimedia.org/T299025 (10Marostegui) @Cmjohnson the host got reimaged fine. Thank you for fixing this so fast! [17:10:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P18682 and previous config saved to /var/cache/conftool/dbconfig/20220112-171047-marostegui.json [17:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:02] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [17:22:40] (03CR) 10Ppchelko: [C: 03+1] "I think this will work." [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [17:24:51] 10SRE-Access-Requests: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10MNovotny_WMF) [17:25:36] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:46] !log dancy@deploy1002 Started scap: testing [17:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P18683 and previous config saved to /var/cache/conftool/dbconfig/20220112-172551-marostegui.json [17:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:12] !log dancy@deploy1002 Started scap: testing [17:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1169.eqiad.wmnet with OS bullseye [17:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:17] (03CR) 10Hnowlan: "Two questions but mostly lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/753111 (https://phabricator.wikimedia.org/T298246) (owner: 10Jgiannelos) [17:33:51] <_joe_> !lgo deploying scap 4.1.1 to the mediawiki canaries T298986 [17:33:52] T298986: Deploy Scap version 4.1.1 - https://phabricator.wikimedia.org/T298986 [17:33:59] (03PS5) 10Dylsss: Dumps: Clarify licensing for Wikidata and update various links [puppet] - 10https://gerrit.wikimedia.org/r/730243 (https://phabricator.wikimedia.org/T279436) [17:34:39] <_joe_> !log deploying scap 4.1.1 to the mediawiki canaries T298986 [17:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:18] <_joe_> !log deploying scap 4.1.1 to the restbase canaries T298986 [17:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T297191)', diff saved to https://phabricator.wikimedia.org/P18684 and previous config saved to /var/cache/conftool/dbconfig/20220112-174056-marostegui.json [17:40:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [17:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [17:41:00] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [17:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T297191)', diff saved to https://phabricator.wikimedia.org/P18685 and previous config saved to /var/cache/conftool/dbconfig/20220112-174103-marostegui.json [17:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:35] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10serviceops-radar, 10Release-Engineering-Team (Radar): contint2001.mgmt disappeared from Icinga - https://phabricator.wikimedia.org/T298861 (10herron) [17:42:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T297191)', diff saved to https://phabricator.wikimedia.org/P18686 and previous config saved to /var/cache/conftool/dbconfig/20220112-174211-marostegui.json [17:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:23] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for Ntsako Maphophe - https://phabricator.wikimedia.org/T299066 (10CMacholan) @cmooney approved as well. Thank you and sorry for the delay! [17:57:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P18687 and previous config saved to /var/cache/conftool/dbconfig/20220112-175715-marostegui.json [17:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P18688 and previous config saved to /var/cache/conftool/dbconfig/20220112-181220-marostegui.json [18:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:15] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:27:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T297191)', diff saved to https://phabricator.wikimedia.org/P18689 and previous config saved to /var/cache/conftool/dbconfig/20220112-182725-marostegui.json [18:27:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:27:29] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [18:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [18:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [18:27:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [18:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [18:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [18:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [18:27:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T297191)', diff saved to https://phabricator.wikimedia.org/P18690 and previous config saved to /var/cache/conftool/dbconfig/20220112-182806-marostegui.json [18:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T297191)', diff saved to https://phabricator.wikimedia.org/P18691 and previous config saved to /var/cache/conftool/dbconfig/20220112-182913-marostegui.json [18:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:28] (03PS4) 10Herron: assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) [18:33:58] (03CR) 10jerkins-bot: [V: 04-1] assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [18:35:13] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 (10RKemper) @wiki_willy Yes, please go ahead and ignore/resolve [18:37:19] (03CR) 10Dzahn: [C: 03+2] "thanks, Jelto. deploying first codfw-only" [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [18:40:57] !log phab1001 - temp disabling puppet - deployed firewall change on phab2001 - debugging - no impact [18:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:19] well that was nice [18:41:19] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:43:02] (03CR) 10Dzahn: "it removed the 10_ssh_public rule completely when testing first on inactive server..." [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [18:43:17] RoanKattouw or urbanecm, whoever is running the backport window, I'm first up with a mw core change to wmf17; should I +2 now so that it is merged by the start of the window? [18:44:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P18692 and previous config saved to /var/cache/conftool/dbconfig/20220112-184418-marostegui.json [18:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:21] apergos: yeah, it's usually fine to +2 a backport before the window, as long as there's no other mediawiki deployments happening at the same time [18:45:05] there's a config change scheduled second in the window [18:45:09] that's all I see [18:46:03] sorry, meant no other mediawiki deployments when you +2 (before the backport window) [18:46:13] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:46:29] jouncebot: next [18:46:29] In 0 hour(s) and 13 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220112T1900) [18:46:30] In 0 hour(s) and 13 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220112T1900) [18:46:43] eh, the pybal alert was caused by me then [18:46:48] but nothing to worry [18:48:36] we're in between windows right now, there should be nothing happening [18:48:39] I'll go ahead then [18:48:39] ACKNOWLEDGEMENT - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled daniel_zahn firewall issue on phab2001 - git-ssh https://wikitech.wikimedia.org/wiki/PyBal [18:49:02] (03CR) 10ArielGlenn: [C: 03+2] Partial revert of I1a691f01cd82e60bf41207d32501edb4b9835e37 to unbreak dumps [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753085 (https://phabricator.wikimedia.org/T299020) (owner: 10ArielGlenn) [18:49:58] (03PS1) 10Dzahn: Revert "phabricator: move vcs firewall rules to profile" [puppet] - 10https://gerrit.wikimedia.org/r/753494 [18:51:15] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=phab2001-vcs.codfw.wmnet [18:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:25] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 268 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:52:42] (03CR) 10Dzahn: [C: 03+2] Revert "phabricator: move vcs firewall rules to profile" [puppet] - 10https://gerrit.wikimedia.org/r/753494 (owner: 10Dzahn) [18:55:04] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=phab2001-vcs.codfw.wmnet [18:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:37] jouncebot: now [18:55:38] No deployments scheduled for the next 0 hour(s) and 4 minute(s) [18:55:42] jouncebot: next [18:55:42] In 0 hour(s) and 4 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220112T1900) [18:55:42] In 0 hour(s) and 4 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220112T1900) [18:56:44] ACKNOWLEDGEMENT - DNS on mw1376.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.2.135 daniel_zahn need DRAC firmware upgrade https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:56:44] ACKNOWLEDGEMENT - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn need DRAC firmware upgrade https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:56:44] ACKNOWLEDGEMENT - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn need DRAC firmware upgrade https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:58:15] pybal alert should recover soon [18:58:30] fatals alert caused by deployment presumably [18:58:37] unrelated things happening [18:59:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P18693 and previous config saved to /var/cache/conftool/dbconfig/20220112-185923-marostegui.json [18:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:27] RECOVERY - PyBal backends health check on lvs2008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:00:05] RoanKattouw and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220112T1900). [19:00:05] apergos and cjming: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:05] dduvall and twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220112T1900). [19:00:14] here [19:00:23] o/ [19:00:58] cjming: hey, do you wish to lead the B&C today? :) [19:01:09] (I'm around if neeeded, but I prefer not to drive) [19:01:46] urbanecm: sure - thanks for being on standby - hopefully it all goes smoothly [19:01:58] (03PS1) 10Dzahn: Revert "Revert "phabricator: move vcs firewall rules to profile"" [puppet] - 10https://gerrit.wikimedia.org/r/753495 [19:02:00] thanks cjming. Shout if i'm needed :) [19:02:09] will do [19:05:06] still waiting for zuul to finish up on mine [19:05:11] apergos: thanks for +2ing - waiting for merge [19:05:16] !log [mwmaint1002:~] $ sudo systemctl status mediawiki_job_updatequerypages_mostlinked_s3@13.service (running fine but had failed for unknown reason last time it was supposed to run automatically) [19:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:32] another 2 minutes in theory [19:05:48] 🤞 [19:06:17] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:33] !log imported jenkins 2.319.2 to thirdparty/ci fpr buster-wikimedia [19:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:37] ^ hashar [19:06:44] awesome [19:07:54] ACKNOWLEDGEMENT - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2022-02-10 08:02:21 +0000 (expires in 28 days) daniel_zahn known issue with cert monitoring non-LE vs LE, based on geo,etc https://phabricator.wikimedia.org/tag/phabricator/ [19:08:25] (03Merged) 10jenkins-bot: Partial revert of I1a691f01cd82e60bf41207d32501edb4b9835e37 to unbreak dumps [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753085 (https://phabricator.wikimedia.org/T299020) (owner: 10ArielGlenn) [19:08:50] merged [19:08:57] cjming: ^ [19:09:04] ACKNOWLEDGEMENT - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn db servers with disabled notifications and one more DRAC firmware upgrade https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:18] yup - rebasing now [19:09:26] !log Upgraded releases Jenkins from 2.319.1 to 2.319.2 # T298691 [19:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:30] T298691: 2022-01-12 Jenkins security advisory pre-announcement - https://phabricator.wikimedia.org/T298691 [19:10:13] apergos: can you test on mwdebug1001? [19:10:18] doing [19:11:36] I only tested that basical functionality is working (Special:Export); I can't test the maintenance script that was broken over there, but I have tested that with this patch from a dumps host [19:11:38] thumbs up [19:11:55] cool - syncing now [19:11:58] ty [19:12:10] !log mirror1001 - CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service - T286898 [19:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:14] T286898: Setup new mirror server (mirror1001.wikimedia.org) - https://phabricator.wikimedia.org/T286898 [19:13:04] ACKNOWLEDGEMENT - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service daniel_zahn https://phabricator.wikimedia.org/T286898 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:29] !log cjming@deploy1002 Synchronized php-1.38.0-wmf.17/includes/export/WikiExporter.php: Backport: [[gerrit:753085|Partial revert of I1a691f01cd82e60bf41207d32501edb4b9835e37 to unbreak dumps (T299020)]] (duration: 01m 22s) [19:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:33] T299020: Exception from dumps on group0 wikis after MediaWiki deployment - https://phabricator.wikimedia.org/T299020 [19:13:43] apergos: should be live! [19:14:09] (03CR) 10Clare Ming: [C: 03+2] Add new vector skin key to RelatedArticlesFooterAllowedSkins. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753187 (https://phabricator.wikimedia.org/T298916) (owner: 10Clare Ming) [19:14:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T297191)', diff saved to https://phabricator.wikimedia.org/P18694 and previous config saved to /var/cache/conftool/dbconfig/20220112-191428-marostegui.json [19:14:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [19:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [19:14:34] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [19:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T297191)', diff saved to https://phabricator.wikimedia.org/P18695 and previous config saved to /var/cache/conftool/dbconfig/20220112-191436-marostegui.json [19:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:38] !log elastic10180 - one power supply seeming failed - see icinga IPMI alert - [Status = Critical, PS Redundancy = Critical] T294805 [19:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:43] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [19:14:52] still looks fine live on group0 wikis [19:15:11] I'll be around pingable for another 45 minutes in case something goes wrong; I can't imagine it would [19:15:13] you're up next [19:15:32] cjming: ^ [19:15:33] gtk - onward [19:15:42] when done I will upgade the CI jenkins which should be done before dduvall runs the train [19:15:52] (03Merged) 10jenkins-bot: Add new vector skin key to RelatedArticlesFooterAllowedSkins. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753187 (https://phabricator.wikimedia.org/T298916) (owner: 10Clare Ming) [19:16:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:14] !log zookeeper-test1002 - CRITICAL - degraded: The following units failed: ifup@ens5.service - for this issue see T273026 (T268074) [19:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:19] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [19:17:19] T268074: Create kafka test cluster - https://phabricator.wikimedia.org/T268074 [19:18:04] ACKNOWLEDGEMENT - Check systemd state on zookeeper-test1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service daniel_zahn https://phabricator.wikimedia.org/T268074 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:55] !log pybal-test2002 - apt-get clean after icinga alert about disk space running out [19:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:06] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:753187|Add new vector skin key to RelatedArticlesFooterAllowedSkins. (T298916)]] (duration: 01m 21s) [19:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:10] T298916: Any code that checks getSkinName for vector must now also check vector-2022 - https://phabricator.wikimedia.org/T298916 [19:19:44] ACKNOWLEDGEMENT - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service daniel_zahn why do we have prod crit alerts for test servers in the first place https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:42] alrighty - that seems to be it for patches in this window -- urbanecm: is it ok to log that i'm closing the window now or do you usually hang out for a while longer? [19:20:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:20:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:58] cjming: feel free to close it early if no customers are left :) [19:21:06] roger that [19:21:29] cjming: h.ashar appears to want to do something when deployments finish (see few lines above), so might be good time to ping him too :)) [19:21:32] !log [deneb:~] $ sudo systemctl start package_builder_Clean_up_build_directory.service [19:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:43] !log end of UTC evening backport & config window [19:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:31] hashar: B&C window closed - feel free to do what you need to do [19:22:33] !log deneb - for some reason the "package builder clean up build directory"-service fails T287222 [19:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:36] T287222: Clean up old Docker images on deneb - https://phabricator.wikimedia.org/T287222 [19:22:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T297191)', diff saved to https://phabricator.wikimedia.org/P18696 and previous config saved to /var/cache/conftool/dbconfig/20220112-192244-marostegui.json [19:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:48] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [19:23:59] ACKNOWLEDGEMENT - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service daniel_zahn https://phabricator.wikimedia.org/T287222#7617579 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:24:01] 10SRE, 10serviceops: Clean up old Docker images on deneb - https://phabricator.wikimedia.org/T287222 (10Dzahn) ` [deneb:~] $ sudo systemctl status package_builder_Clean_up_build_directory.service ● package_builder_Clean_up_build_directory.service - Delete builds older the 2 weeks Loaded: loaded (/lib/syste... [19:25:05] !log begin eqiad opensearch upgrade T288621 [19:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:08] T288621: Logs and events produced by the WMF are consumed using the Elastic Common Schema by OpenSearch - https://phabricator.wikimedia.org/T288621 [19:25:20] (03CR) 10Cwhite: [C: 03+2] site: reprovision eqiad logging cluster to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/752756 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [19:25:51] RECOVERY - Disk space on pybal-test2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=pybal-test2002&var-datasource=codfw+prometheus/ops [19:25:54] (03PS2) 10Cwhite: site: reprovision eqiad logging cluster to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/752756 (https://phabricator.wikimedia.org/T288621) [19:26:41] 10SRE, 10Infrastructure-Foundations: decom sodium - https://phabricator.wikimedia.org/T298727 (10Dzahn) was alerting in Icinga but ACKed with reference to this ticket. it will disappear from monitoring once it's removed from puppet DB which will be done by the cookbook for decom [19:26:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:28:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:35] !log wdqs2003 - one power supply failed so it's not redundant anymore, says Icinga [19:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:59] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:30:22] 10SRE, 10ops-codfw, 10Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs2003 - https://phabricator.wikimedia.org/T152644 (10Dzahn) I don't know if this still relevant but wdqs2003 has one power supply failed so they are not redundant anymore, says an Icinga check. Would you like that checked? [19:32:28] cjming: thx [19:32:43] I am going to upgrade the CI Jenkins for T298691 [19:32:44] T298691: 2022-01-12 Jenkins security advisory pre-announcement - https://phabricator.wikimedia.org/T298691 [19:34:32] (03PS1) 10Eigyan: [wmf-config] Deploy GDI survey to production cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753543 (https://phabricator.wikimedia.org/T296657) [19:34:49] !log Upgrading CI Jenkins and Gearman plugin T298691 [19:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:02] hmm [19:36:21] now I am confused cause the jobs are properly running even though I haven't upgraded the gearman plugin [19:37:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P18697 and previous config saved to /var/cache/conftool/dbconfig/20220112-193749-marostegui.json [19:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:49] 10SRE, 10SRE-swift-storage: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122 (10Dzahn) currently there is this alert in Icinga: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ms-fe1005&service=Check+systemd+state The following units failed: s... [19:39:49] ACKNOWLEDGEMENT - Check systemd state on ms-fe1005 is CRITICAL: CRITICAL - degraded: The following units failed: swiftrepl-mw.service daniel_zahn https://phabricator.wikimedia.org/T162122 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:10] (03PS1) 10Cwhite: hiera: fix opensearch common_settings namespace [puppet] - 10https://gerrit.wikimedia.org/r/753547 (https://phabricator.wikimedia.org/T288621) [19:42:25] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:43:51] 10SRE-tools, 10Infrastructure-Foundations, 10netbox: Netbox Reports Ideas and Requests - https://phabricator.wikimedia.org/T222931 (10Dzahn) Any suggestions what we can about monitoring of the reports? Just spent some time cleaning out unhandled Icinga alerts but we always have the netbox alerts there.. see... [19:44:25] !log Clearing /srv partition on integration-castor03 [19:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:52] (03CR) 10Cwhite: [C: 03+2] hiera: fix opensearch common_settings namespace [puppet] - 10https://gerrit.wikimedia.org/r/753547 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [19:49:44] (03PS2) 10Eigyan: [wmf-config] Deploy the cawiki test safety survey to production. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753543 (https://phabricator.wikimedia.org/T296657) [19:52:14] !log Restarting CI Jenkins once more to apply the Gearman plugin update T298691 [19:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:17] T298691: 2022-01-12 Jenkins security advisory pre-announcement - https://phabricator.wikimedia.org/T298691 [19:52:48] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 (10Papaul) 05Open→03Resolved [19:52:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P18698 and previous config saved to /var/cache/conftool/dbconfig/20220112-195254-marostegui.json [19:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:02] (03CR) 10Dzahn: "this is kind of bad, inconsistency that I want to fix first now:" [puppet] - 10https://gerrit.wikimedia.org/r/753495 (owner: 10Dzahn) [19:55:18] (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753543 (https://phabricator.wikimedia.org/T296657) (owner: 10Eigyan) [19:55:36] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10serviceops-radar, 10Release-Engineering-Team (Radar): contint2001.mgmt disappeared from Icinga - https://phabricator.wikimedia.org/T298861 (10Papaul) The IDRAC on this server needs reset. Please coordinate a day and time that is best for this... [19:56:52] (03CR) 10Dzahn: Revert "Revert "phabricator: move vcs firewall rules to profile"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753495 (owner: 10Dzahn) [19:58:22] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, analytics-admins for Ntsako Maphophe - https://phabricator.wikimedia.org/T299066 (10Peachey88) [20:00:05] dduvall and twentyafterfour: Your horoscope predicts another unfortunate MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220112T2000). [20:05:49] I have finished the CI Jenkins upgrade [20:05:56] seems all fine [20:06:03] cool! [20:06:20] (03CR) 10Jdlrobson: [C: 04-1] "Need to update this for beta cluster..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752760 (https://phabricator.wikimedia.org/T298923) (owner: 10Jdlrobson) [20:07:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T297191)', diff saved to https://phabricator.wikimedia.org/P18699 and previous config saved to /var/cache/conftool/dbconfig/20220112-200759-marostegui.json [20:08:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [20:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [20:08:03] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [20:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T297191)', diff saved to https://phabricator.wikimedia.org/P18700 and previous config saved to /var/cache/conftool/dbconfig/20220112-200806-marostegui.json [20:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:02] (03PS2) 10Dzahn: Revert "Revert "phabricator: move vcs firewall rules to profile"" [puppet] - 10https://gerrit.wikimedia.org/r/753495 [20:10:22] (03PS1) 10Ssingh: Add test environment for Mozilla TRR tests [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/753553 [20:10:50] me starts building stuff on integration.wikimedia.org and it works [20:11:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T297191)', diff saved to https://phabricator.wikimedia.org/P18701 and previous config saved to /var/cache/conftool/dbconfig/20220112-201114-marostegui.json [20:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:40] hashar: just used it, thanks [20:11:41] nice, hashar [20:11:56] getting ready to roll to group1 over here [20:12:12] (03CR) 10Ssingh: [C: 03+2] Add test environment for Mozilla TRR tests [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/753553 (owner: 10Ssingh) [20:12:19] I am surprised it worked on the first try to be honst [20:12:50] (03CR) 10Dzahn: "this is more like it: https://puppet-compiler.wmflabs.org/pcc-worker1003/33211/" [puppet] - 10https://gerrit.wikimedia.org/r/753495 (owner: 10Dzahn) [20:14:06] (03PS3) 10Dzahn: Revert "Revert "phabricator: move vcs firewall rules to profile"" [puppet] - 10https://gerrit.wikimedia.org/r/753495 [20:14:15] heh:) [20:16:24] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "phabricator: move vcs firewall rules to profile"" [puppet] - 10https://gerrit.wikimedia.org/r/753495 (owner: 10Dzahn) [20:16:26] (03PS1) 10Dduvall: group1 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753554 [20:16:28] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753554 (owner: 10Dduvall) [20:17:33] !log applying firewall change on phabricator (VCS, git-ssh), second attempt, first codfw-only [20:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:45] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753554 (owner: 10Dduvall) [20:19:49] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.17 refs T293958 [20:19:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:52] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [20:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:20:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:10] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.17 refs T293958 (duration: 01m 21s) [20:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:17] (03PS1) 10Dzahn: phabricator: debug edit [puppet] - 10https://gerrit.wikimedia.org/r/753555 [20:26:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P18702 and previous config saved to /var/cache/conftool/dbconfig/20220112-202619-marostegui.json [20:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:33] i'm seeing more "Database is read-only: The database is read-only until replication lag decreases." errors today as well as "Parser state cleared while parsing" [20:26:39] it seems to have subsided, however [20:26:57] twentyafterfour: do you see these ^ last week as well? [20:27:00] *did* [20:27:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:05] (03PS2) 10Dzahn: phabricator: debug edit [puppet] - 10https://gerrit.wikimedia.org/r/753555 [20:28:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:28:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [20:29:17] i take that back. they have not completely subsided [20:29:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:34] (03PS3) 10Dzahn: phabricator: debug edit [puppet] - 10https://gerrit.wikimedia.org/r/753555 [20:29:38] (03CR) 10jerkins-bot: [V: 04-1] phabricator: debug edit [puppet] - 10https://gerrit.wikimedia.org/r/753555 (owner: 10Dzahn) [20:30:12] ah yeah, and slow queries is showing a massive spike [20:30:14] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:30:20] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.8356 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:30:21] rolling back [20:30:24] are you doing the train dance? [20:30:26] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.9194 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:30:41] dancy: haha, not yet [20:30:50] well, not the fun one [20:30:56] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:31:40] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06849 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:32:12] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:32:42] PROBLEM - MariaDB Replica Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 430.69 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:32:50] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:33:02] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06452 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:33:25] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: Revert group1 wikis to 1.38.0-wmf.17 [20:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [20:34:32] (03PS4) 10Dzahn: phabricator: debug edit [puppet] - 10https://gerrit.wikimedia.org/r/753555 [20:36:55] !log 1.38.0-wmf.17 rolled back from group1 due to large spike in db read-only errors and slow queries (T293958) [20:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:58] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [20:37:06] RECOVERY - MariaDB Replica Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:38:38] PROBLEM - PyBal backends health check on lvs1014 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled: git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:39:06] somebody enabled puppet on phab1001 [20:39:14] but it was disabled on purpose to prevent that [20:39:22] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled: git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:40:00] sigh [20:40:40] * dancy pats mutante on the back. Sorry man [20:41:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P18703 and previous config saved to /var/cache/conftool/dbconfig/20220112-204124-marostegui.json [20:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:41] ACKNOWLEDGEMENT - PyBal backends health check on lvs1014 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled: git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled daniel_zahn . https://wikitech.wikimedia.org/wiki/PyBal [20:42:41] ACKNOWLEDGEMENT - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled: git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled daniel_zahn . https://wikitech.wikimedia.org/wiki/PyBal [20:45:45] (03PS5) 10Dzahn: phabricator: debug edit [puppet] - 10https://gerrit.wikimedia.org/r/753555 [20:46:28] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 164 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:46:38] !log joal@deploy1002 Started deploy [analytics/refinery@988b7d2]: Regular analytics weekly train [analytics/refinery@988b7d2] [20:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:24] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 110 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:51:08] (03PS6) 10Dzahn: phabricator: debug edit [puppet] - 10https://gerrit.wikimedia.org/r/753555 [20:52:34] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 24 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:52:40] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 48 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:54:07] (03PS7) 10Dzahn: phabricator: debug edit [puppet] - 10https://gerrit.wikimedia.org/r/753555 [20:56:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T297191)', diff saved to https://phabricator.wikimedia.org/P18704 and previous config saved to /var/cache/conftool/dbconfig/20220112-205629-marostegui.json [20:56:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [20:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [20:56:33] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [20:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T297191)', diff saved to https://phabricator.wikimedia.org/P18705 and previous config saved to /var/cache/conftool/dbconfig/20220112-205636-marostegui.json [20:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:40] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,POST,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:57:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T297191)', diff saved to https://phabricator.wikimedia.org/P18706 and previous config saved to /var/cache/conftool/dbconfig/20220112-205744-marostegui.json [20:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] dduvall and twentyafterfour: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220112T2000). [21:00:04] chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220112T2100). [21:04:22] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [21:05:30] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [21:05:36] (03PS2) 10Jdlrobson: Enable CirrusSearch on it/en Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752751 [21:06:36] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [21:06:45] (03PS8) 10Dzahn: phabricator: debug edit [puppet] - 10https://gerrit.wikimedia.org/r/753555 [21:08:58] train is being held at group0 fyi. no eta [21:09:23] (03PS4) 10Jdlrobson: Skip vector-2022 skin in config, not Vector skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752760 (https://phabricator.wikimedia.org/T298923) [21:09:45] (03PS5) 10Jdlrobson: Skip vector-2022 skin in config, not Vector skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752760 (https://phabricator.wikimedia.org/T298923) [21:10:02] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [21:10:58] !log joal@deploy1002 Finished deploy [analytics/refinery@988b7d2]: Regular analytics weekly train [analytics/refinery@988b7d2] (duration: 24m 20s) [21:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:02] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [21:11:18] !log joal@deploy1002 Started deploy [analytics/refinery@988b7d2] (thin): Regular analytics weekly train THIN [analytics/refinery@988b7d2] [21:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:25] !log joal@deploy1002 Finished deploy [analytics/refinery@988b7d2] (thin): Regular analytics weekly train THIN [analytics/refinery@988b7d2] (duration: 00m 07s) [21:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:35] !log joal@deploy1002 Started deploy [analytics/refinery@988b7d2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@988b7d2] [21:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P18707 and previous config saved to /var/cache/conftool/dbconfig/20220112-211248-marostegui.json [21:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:19] (03PS9) 10Dzahn: phabricator: fix ferm rules for VCS, git-ssh [puppet] - 10https://gerrit.wikimedia.org/r/753555 (https://phabricator.wikimedia.org/T114209) [21:15:27] 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: IPMI Power Supply Failure for wdqs2003.codfw.wmnet - https://phabricator.wikimedia.org/T299098 (10RKemper) [21:15:40] 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: IPMI Power Supply Failure for wdqs2003.codfw.wmnet - https://phabricator.wikimedia.org/T299098 (10RKemper) [21:16:08] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/33219/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/753555 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [21:16:41] 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: IPMI Power Supply Failure for wdqs2003.codfw.wmnet - https://phabricator.wikimedia.org/T299098 (10RKemper) [21:17:18] 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: IPMI Power Supply Failure for wdqs2003.codfw.wmnet - https://phabricator.wikimedia.org/T299098 (10RKemper) [21:17:32] 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: IPMI Power Supply Failure for wdqs2003.codfw.wmnet - https://phabricator.wikimedia.org/T299098 (10RKemper) [21:17:59] 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: IPMI Power Supply Failure (PS2) for wdqs2003.codfw.wmnet - https://phabricator.wikimedia.org/T299098 (10RKemper) [21:18:31] ACKNOWLEDGEMENT - IPMI Sensor Status on wdqs2003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] Ryan Kemper https://phabricator.wikimedia.org/T299098 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:18:32] !log joal@deploy1002 Finished deploy [analytics/refinery@988b7d2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@988b7d2] (duration: 06m 57s) [21:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:24] !log [WDQS] T299098 depooled `wdqs2003` so dc-ops can take a look at the PS2 failure [21:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:33] T299098: hw troubleshooting: IPMI Power Supply Failure (PS2) for wdqs2003.codfw.wmnet - https://phabricator.wikimedia.org/T299098 [21:20:56] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:22:24] RECOVERY - PyBal backends health check on lvs1014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:24:44] ^ entirely not related to appserver or DB [21:25:46] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Dzahn) [21:26:14] (03CR) 10Clare Ming: [C: 03+1] Skip vector-2022 skin in config, not Vector skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752760 (https://phabricator.wikimedia.org/T298923) (owner: 10Jdlrobson) [21:26:28] (03CR) 10Clare Ming: [C: 03+1] Enable CirrusSearch on it/en Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752751 (owner: 10Jdlrobson) [21:26:59] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Dzahn) phabricator done anways [21:27:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P18708 and previous config saved to /var/cache/conftool/dbconfig/20220112-212753-marostegui.json [21:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:39] !log mbsantos@maps1009.eqiad.wmnet: start imposm-initial-import - full planet re-import (T299049) [21:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:42] T299049: Re-import full planet data into eqiad - https://phabricator.wikimedia.org/T299049 [21:36:42] (03PS1) 10Dzahn: phabricator: de-duplicate list of VCS IPs and usage in module [puppet] - 10https://gerrit.wikimedia.org/r/753561 [21:37:20] (03CR) 10jerkins-bot: [V: 04-1] phabricator: de-duplicate list of VCS IPs and usage in module [puppet] - 10https://gerrit.wikimedia.org/r/753561 (owner: 10Dzahn) [21:40:17] (03CR) 10Dzahn: [C: 04-2] "nevermind, not going to work, we would have to move the entire VCS class into profile::phabricator::main or its own profile::phabricator::" [puppet] - 10https://gerrit.wikimedia.org/r/753561 (owner: 10Dzahn) [21:41:56] (03CR) 10Dzahn: [C: 04-2] "yea, so we should turn this into "move phabricator::vcs to profile::phabricator::vcs"" [puppet] - 10https://gerrit.wikimedia.org/r/753561 (owner: 10Dzahn) [21:42:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T297191)', diff saved to https://phabricator.wikimedia.org/P18709 and previous config saved to /var/cache/conftool/dbconfig/20220112-214258-marostegui.json [21:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:02] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [21:43:21] 10SRE, 10Traffic, 10SRE Observability (FY2021/2022-Q3), 10User-ema: Investigate cp5006 crash - https://phabricator.wikimedia.org/T292506 (10lmata) [21:43:26] 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10lmata) [21:44:27] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10Sustainability (Incident Followup): Alert that should have paged via VictorOps was delayed because of partial networking outage - https://phabricator.wikimedia.org/T294166 (10lmata) [21:44:38] 10SRE, 10SRE-OnFire (FY2021/2022-Q2), 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) [21:45:08] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10lmata) [21:49:44] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1023.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:50:14] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1023.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:50:33] ^^ known [21:51:13] i've commented on the train blocker, the issue seems mitigated but still ongoing, I may disconnect soon, so someone please keep an eye on it [21:51:24] https://phabricator.wikimedia.org/T299095#7618125 [21:52:11] it may be just a long tail because cache or job queue or something and it may eventually solve itself, haven't researched [21:54:08] ok, thanks cwhite, glad I got _the other_ pybal alert cleared in time, heh :) [21:56:43] :) [22:05:15] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10Platonides) If I understand this task correctly, currently the Ganeti cluster is running on stretch nodes. The VM themselves have no explicit kvm:machine_version se... [22:07:50] jynus: a train blocker is UBN by default so it'll stay that until it's solved and deployers are happy to reroll [22:08:16] in any case, what I mean it is still hot [22:08:32] as in, the issue didn't fully gone after revert [22:08:56] even if it is probably not causing user-facing issues [22:09:29] which is also weird [22:11:39] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:12:17] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 104 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:12:45] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 229 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:13:39] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:15:05] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:21:17] (03PS2) 10Bartosz Dziewoński: DiscussionTools: Use bullet indentation on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753192 (https://phabricator.wikimedia.org/T259864) [22:37:08] (03PS1) 10Cwhite: logstash: ensure dlq directory exists [puppet] - 10https://gerrit.wikimedia.org/r/753571 [22:38:08] (03CR) 10jerkins-bot: [V: 04-1] logstash: ensure dlq directory exists [puppet] - 10https://gerrit.wikimedia.org/r/753571 (owner: 10Cwhite) [22:40:27] (03PS2) 10Cwhite: logstash: ensure dlq directory exists [puppet] - 10https://gerrit.wikimedia.org/r/753571 [22:48:04] !log end eqiad opensearch upgrade T288621 [22:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:08] T288621: Logs and events produced by the WMF are consumed using the Elastic Common Schema by OpenSearch - https://phabricator.wikimedia.org/T288621 [22:50:35] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 44 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:52:12] (03CR) 10Scardenasmolinar: [C: 03+1] "Thank you for working on this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753543 (https://phabricator.wikimedia.org/T296657) (owner: 10Eigyan) [22:54:21] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 44 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:54:49] not sure if getting fixed or just the traffic is much lower [23:05:16] 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10RobH) [23:05:24] 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10RobH) [23:12:25] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:38] jynus: exceptions recovered! [23:13:47] nice! [23:13:56] that was a nasty tail [23:14:23] mmm, altough not to pre-deployment levels [23:14:24] also for the record, i started a maintenance job manually earlier around the same time [23:14:32] maybe it is unrelated? [23:14:34] but that was s3 and something that normally runs as cron/timer [23:15:20] it was 'mediawiki_job_updatequerypages_mostlinked_s3@13.service' and is still running. I dont think it's related to anything [23:16:57] it is "RuntimeException: Could not acquire lock for page ID" [23:17:07] consistent with a high level of writes [23:17:32] I don't think it is that job- that would be mostly big reads [23:18:13] and because it says "s3" and you were takling about other clusters [23:18:15] plus it is mostly on api servers + job runner [23:18:35] maintenance would be quite different pattern [23:18:44] ack, ok [23:19:04] :-( [23:22:08] /srv/mediawiki/php-1.38.0-wmf.16/includes/deferred/LinksUpdate.php [23:22:34] mostly on wikidata [23:22:55] (03PS1) 10Tim Starling: Database::factorConds(): fix insufficient parenthesization [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753498 (https://phabricator.wikimedia.org/T299095) [23:24:57] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 8.583e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [23:25:13] jynus: see TimStarling's update on the task. looks like errant DELETE is causing the lock contention [23:25:35] TimStarling: is it causing data corruption. should i rollback group0 for now? [23:25:35] either everything needs to be rolled back to .16 or my patch needs to be pushed out immediately [23:25:55] i can sync it now [23:25:55] yes it is causing data corruption [23:26:29] ok. a rollback is easy enough though. i will rollback and then we can merge/sync your patch and move to group0/verify [23:26:31] :-( I am assuming of not canonical data? [23:26:43] just links parsing, etc.? [23:27:18] just links tables, I introduced factorConds to support my links table work and so LinksUpdate is the only caller [23:27:33] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:28:00] ok, I know that may have quite some impact, but at least it doesn't mean I have to pull a one nighter for backup recovery, looking at the bright side [23:29:08] we may have to dust off an old maintenance script for refreshing links tables [23:29:17] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: Revert group0 wikis to 1.38.0-wmf.17 [23:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:34] db2101 has some weird soft errors [23:29:38] looking [23:29:41] (03PS1) 10Samwilson: Enable Disambiguator notifications on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753584 (https://phabricator.wikimedia.org/T293319) [23:30:27] (03PS1) 10Dduvall: Revert "group1 wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753585 [23:30:29] (03CR) 10Dduvall: [C: 03+2] Revert "group1 wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753585 (owner: 10Dduvall) [23:30:31] (03PS1) 10Dduvall: Revert "group0 wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753586 [23:30:33] (03CR) 10Dduvall: [C: 03+2] Revert "group0 wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753586 (owner: 10Dduvall) [23:31:50] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753585 (owner: 10Dduvall) [23:31:54] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753586 (owner: 10Dduvall) [23:33:27] exceptions looks super nice now [23:36:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [23:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [23:37:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [23:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [23:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:29] (03CR) 10jerkins-bot: [V: 04-1] Database::factorConds(): fix insufficient parenthesization [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753498 (https://phabricator.wikimedia.org/T299095) (owner: 10Tim Starling) [23:43:29] we should check if any bad delete queries are still running [23:45:24] looks ok on db1138 -- I guess the worst queries were killed already somehow [23:45:41] according https://tendril-legacy.wikimedia.org/activity? no query longer than 10 seconds is running [23:46:34] I changed the DELETE query I found in the logs to SELECT COUNT(*) and it showed 40M rows [23:47:00] so I guess something killed it before it actually managed to delete 40M rows [23:47:44] there is a watchdog on the dbs, although not sure how well it works (kills all queries on overload) [23:48:56] https://tendril-legacy.wikimedia.org/report/slow_queries?host=%5Edb&user=wikiuser&schema=wik&qmode=eq&query=DELETE&hours=4 [23:53:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [23:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:15] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook