[00:00:05] twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210916T0000). [00:33:03] 10SRE, 10SRE-Access-Requests: Updating mbinder's keys for phabricator-bulk-manager - https://phabricator.wikimedia.org/T291141 (10MBinder_WMF) [00:36:19] (03CR) 10RLazarus: [C: 03+1] puppet: reduce verbosity of Cumin's output [software/spicerack] - 10https://gerrit.wikimedia.org/r/720996 (owner: 10Volans) [00:42:52] (03CR) 10RLazarus: [C: 03+1] "LGTM with one comment, feel free to merge without another roundtrip." [software/spicerack] - 10https://gerrit.wikimedia.org/r/720995 (owner: 10Volans) [00:47:07] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:41] (03CR) 10RLazarus: [C: 03+1] pylint: fix newly reported issues [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720910 (owner: 10Volans) [00:49:05] (03CR) 10Krinkle: [C: 03+1] Message: Remove deprecated format property [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721313 (https://phabricator.wikimedia.org/T146416) (owner: 10Daimona Eaytoy) [00:58:29] (03CR) 10RLazarus: [C: 03+1] "Yikes, sorry to have let this sit so long! LGTM pending the comment below, feel free to merge without waiting for another roundtrip." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/716535 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [02:12:51] (03CR) 10Krinkle: [C: 03+1] Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [02:13:31] (03PS2) 10Huji: Temporarily disable anonymous editing on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) [02:13:33] (03PS2) 10Krinkle: clinic-duty: Minor DOM handling clean up [software] - 10https://gerrit.wikimedia.org/r/717653 [02:42:27] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:46:53] (03CR) 10Jforrester: "πŸŽ‰" [puppet] - 10https://gerrit.wikimedia.org/r/721358 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [04:12:18] (03CR) 10Albertoleoncio: [C: 04-1] "Some thoughts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) (owner: 10Huji) [05:02:41] 10SRE, 10Datacenter-Switchover, 10Patch-For-Review, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Marostegui) [05:02:57] 10SRE, 10Datacenter-Switchover, 10Patch-For-Review, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Marostegui) [05:11:19] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:11:39] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:16:56] 10SRE, 10MW-on-K8s, 10Performance-Team, 10Traffic, 10serviceops: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [05:35:51] !log Optimize dewiki.logging in codfw T287344 [05:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:57] T287344: Please optimize logging table in dewiki - https://phabricator.wikimedia.org/T287344 [05:43:29] (03PS1) 10Marostegui: mariadb.yaml: Change replication_type [puppet] - 10https://gerrit.wikimedia.org/r/721421 (https://phabricator.wikimedia.org/T291144) [05:44:45] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:45:24] (03CR) 10Marostegui: [C: 04-2] "This needs to wait for s6 to be cleaned up as it is still replicating codfw -> eqiad until wikitech is moved." [puppet] - 10https://gerrit.wikimedia.org/r/721421 (https://phabricator.wikimedia.org/T291144) (owner: 10Marostegui) [06:17:21] 10SRE, 10MW-on-K8s, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.20; 2021-08-23): Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Joe) >>! In T288848#7357474, @Legoktm wrote: > We could teach MediaWiki how to use a transparent proxy instead, I'll poke at that... [06:59:57] (03CR) 10Giuseppe Lavagetto: safe-service-restart: only verify pooled services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [07:02:24] (03CR) 10DCausse: "should be a noop meant to remove some repetitions from the hieradata host overrides as we gradually activate the streaming updater on a pe" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [07:03:01] (03CR) 10DCausse: [C: 04-1] "to be merged while the data-transfer to this machine is running" [puppet] - 10https://gerrit.wikimedia.org/r/721281 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [07:09:51] 10SRE, 10SRE Observability, 10Traffic: VarnishTrafficDrop alert false positives due to DCs depooled - https://phabricator.wikimedia.org/T291148 (10ema) [07:12:07] (03CR) 10Legoktm: [C: 04-1] "Thanks for taking care of this! I noticed two things" [puppet] - 10https://gerrit.wikimedia.org/r/721244 (owner: 10Elukey) [07:13:06] (03CR) 10Muehlenhoff: [C: 03+2] Disable scope cleanup cron on Thanos backends [puppet] - 10https://gerrit.wikimedia.org/r/720940 (https://phabricator.wikimedia.org/T199911) (owner: 10Muehlenhoff) [07:20:37] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:20:39] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv6: Connect - Telia, AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:22:23] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:25:57] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 125 probes of 620 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:27:00] 10SRE, 10SRE Observability, 10Traffic: VarnishTrafficDrop IRC alert does not include DC name anymore - https://phabricator.wikimedia.org/T291149 (10ema) [07:29:59] (03PS2) 10Elukey: profile::configmaster::disc_desired_state.py: update after switchover [puppet] - 10https://gerrit.wikimedia.org/r/721244 [07:31:32] (03CR) 10Elukey: "Fixed Lego's comments, but I am still wondering about few things:" [puppet] - 10https://gerrit.wikimedia.org/r/721244 (owner: 10Elukey) [07:37:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/721383 (https://phabricator.wikimedia.org/T290991) (owner: 10Cathal Mooney) [07:37:33] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 49 probes of 620 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:40:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy) - https://phabricator.wikimedia.org/T290766 (10MoritzMuehlenhoff) >>! In T290766#7356684, @MRaishWMF wrote: > Hi @cmooney , actually I just checked again (80 minutes later) and I actually do have... [07:47:13] (03PS1) 10Giuseppe Lavagetto: conftool::safe_service_restart: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/721468 [07:48:11] !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=eqiad [07:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:31] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:48:38] <_joe_> elukey: are we sure chartmuseum is active in both DCs normally? [07:48:44] <_joe_> (just asking, I have no idea [07:48:55] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:50:13] _joe_: I asked to Janis on #service-ops, this is why I turned it on, but I'll double check [07:50:37] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31097/console" [puppet] - 10https://gerrit.wikimedia.org/r/721468 (owner: 10Giuseppe Lavagetto) [07:52:08] _joe_: yes I see chartmuseum.service on chartmuseum2001, should be ok [07:52:41] now I have no idea if active/active leads to issues [07:53:27] yes, yes. We are sure [07:53:35] (03PS3) 10Elukey: profile::configmaster::disc_desired_state.py: update after switchover [puppet] - 10https://gerrit.wikimedia.org/r/721244 [07:53:48] * elukey trusts jayme [07:55:27] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Systemd session creation fails under I/O load - https://phabricator.wikimedia.org/T199911 (10MoritzMuehlenhoff) Since the Thanos hosts run Buster and a more recent kernel/glibc/systemd, I disabled the cleanup cron job on these hosts, so... [07:55:27] It even says so in https://wikitech.wikimedia.org/wiki/ChartMuseum#Operations :-p [08:01:23] * elukey trusts jayme even more [08:01:45] (03CR) 10Gehel: "Minor comment inline. This is probably worth a more synchronous discussion, ping me on IRC!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/720993 (owner: 10Volans) [08:04:48] !log upgrading scandium to PHP 7.2 backport of patch for enhanced DOM replaceChild/removeChild performance T291052 [08:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:54] T291052: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 [08:05:39] 10SRE, 10serviceops: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10MoritzMuehlenhoff) scandium has been upgraded. If tests are fine, I'd upload to apt.wikimedia.org [08:36:50] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP-wmf for erayfield - https://phabricator.wikimedia.org/T291126 (10Aklapper) 05Openβ†’03Invalid Hi and welcome @ERayfield! None of Gerrit, Phabricator or IRC usage in itself require being a member of an LDAP group. Thus I'm closing this ticket. If there are... [08:52:26] (03CR) 10Hashar: [C: 03+2] "Also taked about it with Ariel and Daniel this morning. I will deploy it ;)" [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721313 (https://phabricator.wikimedia.org/T146416) (owner: 10Daimona Eaytoy) [08:54:02] (03CR) 10Giuseppe Lavagetto: safe-service-restart: only verify pooled services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [08:54:10] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] conftool::safe_service_restart: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/721468 (owner: 10Giuseppe Lavagetto) [08:59:00] (03CR) 10Gehel: tests: fix typo in test name (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/720926 (owner: 10Volans) [09:00:40] (03CR) 10Giuseppe Lavagetto: Add configuration for wmerrors to php-multiversion-base (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/721333 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [09:00:57] 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10akosiaris) >>! In T290445#7355990, @akosiaris wrote: > And this doesn't add up. In grafana https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?viewPanel=15&orgId=1&f... [09:10:39] !log in-place re-installation of mx2002.wikimedia.org (test VM) to test the new installer key support in the sre.puppet.renew-cert cookbook [09:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:47] (03CR) 10jerkins-bot: [V: 04-1] Message: Remove deprecated format property [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721313 (https://phabricator.wikimedia.org/T146416) (owner: 10Daimona Eaytoy) [09:35:21] PROBLEM - Exim SMTP on mx2002 is CRITICAL: connect to address 208.80.153.72 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [09:36:01] ^ mx2002 is me, fixing down time [09:36:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2002.wikimedia.org with reason: reimage [09:36:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2002.wikimedia.org with reason: reimage [09:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:14] 10SRE, 10DNS, 10Traffic: One more DNS request for Wikilearn - https://phabricator.wikimedia.org/T291090 (10Vgutierrez) a:03Vgutierrez [09:48:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy) - https://phabricator.wikimedia.org/T290766 (10cmooney) Ok @MRaishWMF thanks for confirming! And @MoritzMuehlenhoff that indeed makes sense, I'll delay a little before responding after processi... [09:48:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy) - https://phabricator.wikimedia.org/T290766 (10cmooney) 05Openβ†’03Resolved p:05Triageβ†’03Medium [09:51:36] (03PS1) 10Vgutierrez: learn.wiki: Add apex A records [dns] - 10https://gerrit.wikimedia.org/r/721480 (https://phabricator.wikimedia.org/T291090) [09:52:58] (03CR) 10Vgutierrez: [C: 03+2] learn.wiki: Add apex A records [dns] - 10https://gerrit.wikimedia.org/r/721480 (https://phabricator.wikimedia.org/T291090) (owner: 10Vgutierrez) [09:56:14] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: One more DNS request for Wikilearn - https://phabricator.wikimedia.org/T291090 (10Vgutierrez) 05Openβ†’03Resolved ` $ host -t A learn.wiki learn.wiki has address 76.223.57.52 learn.wiki has address 13.248.190.88 ` [10:00:05] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210916T1000). [10:00:24] !log jmm@cumin2002 START - Cookbook sre.puppet.renew-cert for mx2002.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [10:00:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for mx2002.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [10:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:07] !log jmm@cumin2002 START - Cookbook sre.puppet.renew-cert for mx2002.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [10:01:09] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for mx2002.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [10:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:14] (03CR) 10Tobias Andersson: "Tried this out a bit locally, and seems to do what we want. However there is one issue we need to fix." [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [10:08:48] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) The set of patches above should allow us to get wmerrors working; we can work on moving php7-fatal-error.php to mediawiki-config separately. [10:11:36] (03PS1) 10Muehlenhoff: renew-cert: Don't disable Puppet in installer mode [cookbooks] - 10https://gerrit.wikimedia.org/r/721482 [10:11:55] (03CR) 10Ayounsi: [C: 03+1] Add durum hosts durum[12345]00[12] to BGP anycast [homer/public] - 10https://gerrit.wikimedia.org/r/721018 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [10:11:59] !log depool mw1422 for network testing [10:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:55] (03CR) 10jerkins-bot: [V: 04-1] renew-cert: Don't disable Puppet in installer mode [cookbooks] - 10https://gerrit.wikimedia.org/r/721482 (owner: 10Muehlenhoff) [10:14:09] !log depool mw1455 for network testing [10:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:45] (03PS2) 10Muehlenhoff: renew-cert: Don't disable Puppet in installer mode [cookbooks] - 10https://gerrit.wikimedia.org/r/721482 [10:21:25] !log Changing default gateway on mw1422 to use VRRP backup (cr2), to determine if tail drops from switches to cr1 is cause of TCP retransmissions. [10:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:36] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add configuration for wmerrors to php-multiversion-base [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/721333 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [10:28:57] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is nice! Thanks for this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [10:29:57] (03CR) 10Hashar: [C: 03+2] "CI agent went out of disk space :-\" [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721313 (https://phabricator.wikimedia.org/T146416) (owner: 10Daimona Eaytoy) [10:36:07] 10SRE, 10MediaWiki-General, 10observability, 10serviceops, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Addshore) We would consider using this in some up coming work if it gets merged [10:41:08] (03PS9) 10Vgutierrez: envoyproxy: Support PreserveCase HeaderKeyFormat [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) [10:41:48] (03PS4) 10Vgutierrez: envoyproxy: Allow configuring TLS handshake timeout [puppet] - 10https://gerrit.wikimedia.org/r/714039 (https://phabricator.wikimedia.org/T271421) [10:45:19] <_joe_> jouncebot: next [10:45:20] In 0 hour(s) and 14 minute(s): EU Backport and Config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210916T1100) [10:47:32] (03PS5) 10Vgutierrez: envoyproxy: Allow configuring TLS handshake timeout [puppet] - 10https://gerrit.wikimedia.org/r/714039 (https://phabricator.wikimedia.org/T271421) [10:47:46] (03Merged) 10jenkins-bot: Message: Remove deprecated format property [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721313 (https://phabricator.wikimedia.org/T146416) (owner: 10Daimona Eaytoy) [10:51:38] (03CR) 10Volans: [C: 03+1] "LGTM, good catch" [cookbooks] - 10https://gerrit.wikimedia.org/r/721482 (owner: 10Muehlenhoff) [10:54:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:59] I am deploying https://gerrit.wikimedia.org/r/c/mediawiki/core/+/721313 [10:56:03] for group0 wikis [10:57:00] (03CR) 10Muehlenhoff: [C: 03+2] renew-cert: Don't disable Puppet in installer mode [cookbooks] - 10https://gerrit.wikimedia.org/r/721482 (owner: 10Muehlenhoff) [10:59:26] (03CR) 10Volans: "Replies inline, no new PS at this stage." [software/spicerack] - 10https://gerrit.wikimedia.org/r/720993 (owner: 10Volans) [10:59:34] !log hashar@deploy1002 Synchronized php-1.37.0-wmf.21/includes/language/Message.php: Message: Remove deprecated format property - T146416 T291124 (duration: 01m 06s) [10:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:42] T291124: PHP Notice: Undefined index: format - https://phabricator.wikimedia.org/T291124 [10:59:42] T146416: Message -> string transformations should not affect each other - https://phabricator.wikimedia.org/T146416 [11:00:05] Amir1, Lucas_WMDE, and apergos: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for EU Backport and Config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210916T1100). [11:00:05] Lucas_WMDE: A patch you scheduled for EU Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] o/ [11:00:23] here [11:00:29] no trainees have signed up for this session [11:00:30] any trainees? [11:00:32] ok [11:00:32] hi [11:00:34] (03CR) 10Volans: [C: 03+2] pylint: fix newly reported issues [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720910 (owner: 10Volans) [11:00:47] I did push a patch for core wmf.21 which was a train blocker: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/721313 [11:00:47] Lucas_WMDE: yours is the oly patch, it looks fine to my not expert eyes, I assume you will self serve? [11:00:53] yup [11:01:04] it took longer than expected cause CI exploded for unrelated reason (jenkins agent got a full disk) [11:01:06] hashar: should I pause for a bit before deploying? [11:01:15] it seems fine [11:01:24] no new log entered ;] [11:01:34] so I think it is fine for the backport window [11:01:42] ok thanks [11:01:44] πŸ‘ [11:02:01] if you're doing it during the window can you (or whoever is doing it) please add it to the calendar for the record? [11:02:11] otherwise yeah glad to see it go out [11:02:28] oh do you mean it already went? [11:02:42] yeah there was a logmsgbot message just before jouncebot [11:02:50] ah ok [11:02:58] yeah I already deployed it [11:03:01] !log jmm@cumin2002 START - Cookbook sre.puppet.renew-cert for mx2002.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [11:03:04] \o/ [11:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:14] (03PS2) 10Lucas Werkmeister (WMDE): Add new WikimediaBadges config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721305 (https://phabricator.wikimedia.org/T232927) [11:03:16] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for mx2002.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [11:03:16] I wanted to do it earlier in the morning but CI failed and I was in a video meeting that took longer than expected :D [11:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:35] well the important thing is the train is unblocked :-) [11:03:56] yup. Not sure whether I will push it this morning or rather wait for tonight [11:04:11] (03PS1) 10Urbanecm: UncachedMenteeOverviewDataProvider: Do not fatal with zero mentees [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721318 (https://phabricator.wikimedia.org/T291088) [11:04:17] may it all go smoothly whenever you do it! [11:04:22] (03Merged) 10jenkins-bot: pylint: fix newly reported issues [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720910 (owner: 10Volans) [11:04:29] (03PS1) 10Urbanecm: UncachedMenteeOverviewDataProvider: Do not fatal with zero mentees [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721319 (https://phabricator.wikimedia.org/T291088) [11:04:42] (03PS1) 10Muehlenhoff: renew-cert: One more step skipped in installer mode [cookbooks] - 10https://gerrit.wikimedia.org/r/721488 [11:05:05] Lucas_WMDE: assuming you're deploying, would it be ok to get the two backports I just uploaded out too? :) [11:05:16] I am taking a lunch break. Have good deployments [11:05:20] urbanecm: sure [11:05:23] enjoy your meal hashar [11:05:31] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add new WikimediaBadges config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721305 (https://phabricator.wikimedia.org/T232927) (owner: 10Lucas Werkmeister (WMDE)) [11:05:38] Lucas_WMDE: thanks. +2'ing them to give jenkins some time [11:05:41] (03CR) 10Urbanecm: [C: 03+2] UncachedMenteeOverviewDataProvider: Do not fatal with zero mentees [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721319 (https://phabricator.wikimedia.org/T291088) (owner: 10Urbanecm) [11:05:45] (03CR) 10Urbanecm: [C: 03+2] UncachedMenteeOverviewDataProvider: Do not fatal with zero mentees [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721318 (https://phabricator.wikimedia.org/T291088) (owner: 10Urbanecm) [11:05:58] ok [11:06:08] my config change doesn’t need testing so it should go pretty quickly [11:06:13] ack [11:06:34] this is a no-testing case too, the only place where that code is used is a maint script [11:06:35] excellent [11:06:40] cool [11:06:53] oh sneaking in two more patches, I seeeeee [11:07:18] (03CR) 10JMeybohm: [C: 04-1] hier::common::deployment_server add environment helmfile-defaults (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721373 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [11:07:27] please urbanecm don't forget to add them to the calendar entry! [11:07:31] urbanecm working on mentor growthexperiments stuff, that rings a bell πŸ€” [11:07:32] just did that apergos :) [11:07:42] ah my reload was too quick, I didn't see them there then [11:07:52] (03Merged) 10jenkins-bot: Add new WikimediaBadges config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721305 (https://phabricator.wikimedia.org/T232927) (owner: 10Lucas Werkmeister (WMDE)) [11:08:03] Lucas_WMDE: you mean T291100? or some other bell? πŸ™‚ [11:08:04] T291100: "Say hi to your new mentor!" message partially truncated - https://phabricator.wikimedia.org/T291100 [11:08:16] (03CR) 10JMeybohm: [C: 04-1] hier::common::deployment_server add environment helmfile-defaults (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721373 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [11:08:21] just that one :P [11:08:27] looking good, urbane cm [11:09:45] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:721305|Add new WikimediaBadges config (T232927)]] (1/2) (duration: 01m 05s) [11:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:51] T232927: Update the code behind the Wikimedia Commons link in the sidebar to use P910 and P1754 instead of P373 - https://phabricator.wikimedia.org/T232927 [11:10:09] Lucas_WMDE: okay πŸ™‚. Let me know if you find some other bugs there, I'm officially maintaining mentorship features in GE :) [11:10:18] ok! [11:11:19] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:721305|Add new WikimediaBadges config (T232927)]] (2/2) (duration: 01m 05s) [11:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:44] I’m done, over to you for the backports [11:11:49] thanks [11:13:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:34] (03PS2) 10Jelto: hier::common::deployment_server add environment helmfile-defaults [puppet] - 10https://gerrit.wikimedia.org/r/721373 (https://phabricator.wikimedia.org/T251305) [11:15:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:34] (03CR) 10Muehlenhoff: [C: 03+2] renew-cert: One more step skipped in installer mode [cookbooks] - 10https://gerrit.wikimedia.org/r/721488 (owner: 10Muehlenhoff) [11:21:12] !log jmm@cumin2002 START - Cookbook sre.puppet.renew-cert for mx2002.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [11:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for mx2002.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [11:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:49] (03PS17) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [11:22:38] (03CR) 10Muehlenhoff: "Tested with an in-place reinstall of mx2002, works like a charm, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/721263 (owner: 10Volans) [11:23:06] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31098/console" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [11:25:11] PROBLEM - spamassassin on mx2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.72: Connection reset by peer https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [11:28:01] (03PS18) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [11:28:33] (03Merged) 10jenkins-bot: UncachedMenteeOverviewDataProvider: Do not fatal with zero mentees [extensions/GrowthExperiments] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721319 (https://phabricator.wikimedia.org/T291088) (owner: 10Urbanecm) [11:28:38] finally [11:28:53] (03CR) 10Elukey: kubernetes: add revscoring-editquality in the services configs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [11:29:37] :-) [11:29:47] (03CR) 10Cathal Mooney: [C: 03+2] Add rhuang-ctr user to puppet data.yaml file. [puppet] - 10https://gerrit.wikimedia.org/r/721383 (https://phabricator.wikimedia.org/T290991) (owner: 10Cathal Mooney) [11:30:23] but i want the wmf.23 too, as i currently only can reproduce on testwiki :D [11:30:28] (03Merged) 10jenkins-bot: UncachedMenteeOverviewDataProvider: Do not fatal with zero mentees [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721318 (https://phabricator.wikimedia.org/T291088) (owner: 10Urbanecm) [11:30:38] i just need to complain, good :) [11:31:06] (03CR) 10JMeybohm: [C: 04-1] services: deploy services with helm3 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [11:31:27] :-D [11:32:24] (03CR) 10JMeybohm: [C: 04-1] services: deploy services with helm3 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [11:32:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:51] (03CR) 10Jelto: hier::common::deployment_server add environment helmfile-defaults (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/721373 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [11:34:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:47] !log installing 4.9.272 kernels on stretch hosts (no reboots yet) [11:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:07] PROBLEM - Check systemd state on mx2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.72: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:07] PROBLEM - Check the NTP synchronisation status of timesyncd on mx2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.72: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [11:35:30] * urbanecm investigating weird status of wmf.23 at deployment host [11:35:45] RECOVERY - Exim SMTP on mx2002 is OK: OK - Certificate mx1001.wikimedia.org will expire on Sun 14 Nov 2021 01:37:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [11:36:01] RECOVERY - spamassassin on mx2002 is OK: PROCS OK: 3 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [11:36:09] (wmf.23's AbuseFilter appears to be dirty, with no secpatches applied) [11:36:54] apparently someone did `git reset --hard wmf/1.37.0-wmf.23` in that submodule? [11:36:55] RECOVERY - Check systemd state on mx2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:02] (03PS1) 10Arturo Borrero Gonzalez: wmcs: nfs: nfs-manage: don't harcode /24 netmasks [puppet] - 10https://gerrit.wikimedia.org/r/721503 [11:41:09] !log [urbanecm@deploy1002 /srv/mediawiki-staging/php-1.37.0-wmf.23/extensions/AbuseFilter (wmf/1.37.0-wmf.23 u=)]$ git co 0d2bc7ca17b9f767ae5753db7e4e41fd9e7d3531 # reset repo to expected state, fixing incorrect deploy of a backport in T291123 [11:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:15] T291123: TypeError: Argument 5 passed to MediaWiki\Extension\AbuseFilter\Parser\ParserStatus::__construct() must be of the type integer, null given, called in /srv/mediawiki/php-1.37.0-wmf.23/extensions/AbuseFilter/includes/Parser/ParserStatus.php on line 107 - https://phabricator.wikimedia.org/T291123 [11:41:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:55] !log [urbanecm@deploy1002 /srv/mediawiki-staging/php-1.37.0-wmf.23 (wmf/1.37.0-wmf.23 * u+2-2)]$ git rebase && git submodule update extensions/AbuseFilter/ # fixing an incorrect deployment that happened in T291123 [11:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:48] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.23/extensions/AbuseFilter/: Fixing incorrect deployment of 01e4450 for T291123. This is supposed to be a no-op. (duration: 01m 05s) [11:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:26] summarized what i did at https://phabricator.wikimedia.org/T291123#7358635 [11:48:00] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/721318/ fixed what it is supposed to do, syncing [11:49:23] back [11:49:34] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.23/extensions/GrowthExperiments/includes/MentorDashboard/MenteeOverview/UncachedMenteeOverviewDataProvider.php: 9e0f6f84240bf621e97806a94a0e786817001668: UncachedMenteeOverviewDataProvider: Do not fatal with zero mentees (T291088) (duration: 01m 04s) [11:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:39] T291088: updateMenteeData.php fatals for testwiki - https://phabricator.wikimedia.org/T291088 [11:49:42] * urbanecm waves to hashar [11:49:49] if you intend to deploy, i'Ll be finished in a couple of minutes [11:50:27] take your time and feel free to extend [11:50:52] and train window is one hour from now so plenty of time [11:51:07] I think I will promote group1 [11:51:10] just finishing the scaps now, https://phabricator.wikimedia.org/T291123#7358635 was what took most time today :/ [11:51:12] so we can do rest of wikis later tonight [11:51:35] oh [11:51:46] I guess some do a git pull inside extensions/Foobar [11:51:47] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.21/extensions/GrowthExperiments/includes/MentorDashboard/MenteeOverview/UncachedMenteeOverviewDataProvider.php: 529f86c5a998820c32e7d7f2d952317080383e05: UncachedMenteeOverviewDataProvider: Do not fatal with zero mentees (T291088) (duration: 01m 04s) [11:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:01] which is how we used to deploy extensions/skins [11:52:11] that's...not what they're supposed to do, those days, at least :) [11:52:12] when nowadays they are submodules bumps so one has to submodule update extensions/Foobar [11:52:25] might just be an old habit yeah [11:52:29] maybe [11:52:32] anyway, I'm done hashar [11:52:36] feel free to take over [11:52:45] moare bug fixed thank you! [11:53:01] I will check the status of blocked tasks and prepare for the group1 promote in one hour from now [11:53:26] great! [11:59:38] (is there a phab admin around who could block a spammer? https://phabricator.wikimedia.org/p/International1993/ ) [11:59:56] kostajh: I can do it [12:00:01] kostajh: sure, done [12:00:04] thx [12:00:04] !log start OSM re-import script in maps2009 (depooled) [12:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210916T1200) [12:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:10] hnowlan: ^ [12:00:35] mbsantos: ack, thanks [12:01:30] kostajh: you can do it yourself ;). https://phab-ban.toolforge.org/ allows most people to ban Phab accounts [12:01:56] thx [12:03:41] s/most people/people in acl*userdisable project [12:04:06] Right. But we add a lot of people there :) [12:05:57] RECOVERY - Check the NTP synchronisation status of timesyncd on mx2002 is OK: OK: synced at Thu 2021-09-16 12:05:55 UTC. https://wikitech.wikimedia.org/wiki/NTP [12:08:35] !log Deploy schema change on s2 codfw (lag will show up) T290057 [12:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:40] T290057: Optimize flaggedtemplates tables in production - https://phabricator.wikimedia.org/T290057 [12:17:39] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP-wmf for erayfield - https://phabricator.wikimedia.org/T291126 (10ERayfield) 05Invalidβ†’03In progress Thank you for the kind welcome. My manager, Maggie Epps, has it on my list of onboarding items to do. Since my knowledge of the system are limited at thi... [12:21:42] 10SRE, 10LDAP-Access-Requests: Grant Access to WMF for Rui Huang - https://phabricator.wikimedia.org/T290991 (10cmooney) Hi Rui, the additional access should now be set up. Can you test and let us know if everything is ok? thanks! [12:22:02] (03CR) 10Thiemo Kreuz (WMDE): "Can someone please tell this bot to not add me to patches I have nothing to do with?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/720976 (owner: 10PipelineBot) [12:22:18] (03CR) 10Thiemo Kreuz (WMDE): "Can someone please tell this bot to not add me to patches I have nothing to do with?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/720980 (owner: 10PipelineBot) [12:22:30] (03CR) 10Thiemo Kreuz (WMDE): "Can someone please tell this bot to not add me to patches I have nothing to do with?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/720981 (owner: 10PipelineBot) [12:22:50] (03PS1) 10BBlack: VarnishTrafficDrop: fix site label in summary [alerts] - 10https://gerrit.wikimedia.org/r/721507 (https://phabricator.wikimedia.org/T291149) [12:25:08] (03CR) 10jerkins-bot: [V: 04-1] VarnishTrafficDrop: fix site label in summary [alerts] - 10https://gerrit.wikimedia.org/r/721507 (https://phabricator.wikimedia.org/T291149) (owner: 10BBlack) [12:28:05] 10SRE, 10SRE Observability, 10Traffic: VarnishTrafficDrop alert false positives due to DCs depooled - https://phabricator.wikimedia.org/T291148 (10BBlack) The solution to this in the icinga version of this check was to include an additional term in the prometheus query that would cause a null result if the a... [12:33:17] (03CR) 10Muehlenhoff: apt::package_from_component: add update condition for multiple packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721275 (owner: 10Hnowlan) [12:35:40] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP-wmf for erayfield - https://phabricator.wikimedia.org/T291126 (10cmooney) Hi @ERayfield, Welcome to the foundation! I am relatively new myself, and first time taking care of these tasks, but I believe the process is best documented here: https://wikitech.... [12:41:40] (03PS2) 10Arturo Borrero Gonzalez: wmcs: nfs: nfs-manage: don't hardcode /24 netmasks [puppet] - 10https://gerrit.wikimedia.org/r/721503 [12:45:48] (03PS3) 10Arturo Borrero Gonzalez: wmcs: nfs: nfs-manage: don't hardcode /24 netmasks [puppet] - 10https://gerrit.wikimedia.org/r/721503 [12:48:50] (03CR) 10Arturo Borrero Gonzalez: "PCC as expected https://puppet-compiler.wmflabs.org/compiler1001/31100/" [puppet] - 10https://gerrit.wikimedia.org/r/721503 (owner: 10Arturo Borrero Gonzalez) [12:49:00] (03PS3) 10Vgutierrez: envoyproxy: Allow setting per_connection_buffer_limit_bytes [puppet] - 10https://gerrit.wikimedia.org/r/714379 (https://phabricator.wikimedia.org/T271421) [12:51:51] (03PS4) 10Vgutierrez: envoyproxy: Allow setting per_connection_buffer_limit_bytes [puppet] - 10https://gerrit.wikimedia.org/r/714379 (https://phabricator.wikimedia.org/T271421) [12:52:20] (03PS19) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [12:54:18] (03PS3) 10Vgutierrez: envoyproxy: Add downstream idle_timeout config option [puppet] - 10https://gerrit.wikimedia.org/r/714380 (https://phabricator.wikimedia.org/T271421) [12:59:06] (03PS4) 10Vgutierrez: envoyproxy: Add downstream idle_timeout config option [puppet] - 10https://gerrit.wikimedia.org/r/714380 (https://phabricator.wikimedia.org/T271421) [13:00:05] hashar and twentyafterfour: Your horoscope predicts another unfortunate MediaWiki train - European+American Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210916T1300). [13:04:26] 10SRE, 10SRE Observability, 10Traffic: VarnishTrafficDrop alert false positives due to DCs depooled - https://phabricator.wikimedia.org/T291148 (10ema) p:05Triageβ†’03Medium [13:07:10] 10SRE, 10SRE Observability, 10Traffic: VarnishTrafficDrop alert false positives due to DCs depooled - https://phabricator.wikimedia.org/T291148 (10ema) [13:07:48] 10SRE, 10SRE Observability, 10Traffic: VarnishTrafficDrop alert false positives due to DCs depooled - https://phabricator.wikimedia.org/T291148 (10ema) >>! In T291148#7358739, @BBlack wrote: > The solution to this in the icinga version of this check was to include an additional term in the prometheus query t... [13:08:19] tchou tchou train again [13:09:26] (03PS1) 10Hashar: group1 wikis to 1.37.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721519 [13:09:28] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.37.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721519 (owner: 10Hashar) [13:10:25] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721519 (owner: 10Hashar) [13:11:47] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.23 [13:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:59] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP-wmf for erayfield - https://phabricator.wikimedia.org/T291126 (10Aklapper) 05In progressβ†’03Open Thanks for the clarification! > Request access to the LDAP-wmf group through Phabricator Hmm, this makes me wonder if the "Purpose" field in the SRE task temp... [13:12:51] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.23 (duration: 01m 04s) [13:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:15] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:16:20] no new errors! [13:16:37] (03PS4) 10Vgutierrez: envoyproxy: Allow setting http2 protocol options [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) [13:16:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:56] (03CR) 10Vgutierrez: envoyproxy: Allow setting http2 protocol options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [13:17:37] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Allow setting http2 protocol options [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [13:17:49] (03CR) 10Muehlenhoff: [C: 03+2] Prefer mx2001 over mx1001 for internal smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/721289 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [13:17:53] uh [13:18:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:53] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:22:08] 10SRE, 10serviceops: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10ssastry) Thanks! I've started tests now. Will have results in about 10 hours. [13:23:06] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP-wmf for erayfield - https://phabricator.wikimedia.org/T291126 (10Urbanecm) ERayfield appears to be a new software engineer, so they likely need the WMF group to be able to +2 in mediawiki/* repositories. [13:24:56] !log poiol mw1422 mw1455 [13:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:02] !log pool mw1422 mw1455 [13:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:52] 10SRE, 10SRE Observability, 10Traffic, 10Patch-For-Review: VarnishTrafficDrop IRC alert does not include DC name anymore - https://phabricator.wikimedia.org/T291149 (10ema) p:05Triageβ†’03Medium [13:29:35] (03PS5) 10Vgutierrez: envoyproxy: Allow setting http2 protocol options [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) [13:30:44] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) Last set of benchmarks of Round 1, we added a run with 6 pods x 8 workers: https://people.wikimedia.org/~jiji/benchmarks-bare... [13:32:19] (03PS2) 10Ema: VarnishTrafficDrop: fix site label in summary [alerts] - 10https://gerrit.wikimedia.org/r/721507 (https://phabricator.wikimedia.org/T291149) (owner: 10BBlack) [13:34:12] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Marostegui) Talked to @MoritzMuehlenhoff on IRC; we are going to wait 2 weeks to have a meeting to sync up about this once @Kormat and @LSobanski are back [13:39:53] (03CR) 10Ssingh: [C: 03+2] Add durum hosts durum[12345]00[12] to BGP anycast [homer/public] - 10https://gerrit.wikimedia.org/r/721018 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [13:41:08] (03Merged) 10jenkins-bot: Add durum hosts durum[12345]00[12] to BGP anycast [homer/public] - 10https://gerrit.wikimedia.org/r/721018 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [13:44:42] !log homer: running for Gerrit: 721018: set up BGP peering to durum hosts in {eqiad,codfw,esams,ulsfo,eqsin} [13:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:47] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:50:05] uh.. ^^ sukhe is that you? [13:50:37] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:51:20] I am certainly running homer but not sure if it's me [13:51:27] checking [13:52:49] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:53:00] hm [13:53:10] In around 1h we'll have wikitech in read only move for a few minutes for a migration (https://phabricator.wikimedia.org/T167973) [13:53:15] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:53:23] XioNoX, topranks ^ can this be due to the fact that hosts are not yet up? [13:54:04] (03PS3) 10Ssingh: site: update role for durum[12345]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/721021 (https://phabricator.wikimedia.org/T289536) [13:54:09] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:05] (03CR) 10Ssingh: [C: 03+2] site: update role for durum[12345]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/721021 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [13:55:17] merging the role, let's see [13:55:29] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:56:59] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 81, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:57:03] oh yeah [13:57:16] that was it. that's interesting. the order does matter, which makes sense [13:57:44] (03PS1) 10Dzahn: switch mwmaint.discovery.wmnet from codfw to eqiad [dns] - 10https://gerrit.wikimedia.org/r/721546 (https://phabricator.wikimedia.org/T267607) [13:58:17] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:58:23] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:58:28] ^ resolving these [13:58:57] (03CR) 10Dzahn: DHCP: switch mwmaint2002 from stretch to buster installer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721358 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [14:03:15] (03CR) 10Muehlenhoff: [C: 03+1] switch mwmaint.discovery.wmnet from codfw to eqiad [dns] - 10https://gerrit.wikimedia.org/r/721546 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [14:03:29] (03CR) 10Muehlenhoff: [C: 03+1] DHCP: switch mwmaint2002 from stretch to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/721358 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [14:03:36] (03CR) 10Dzahn: [V: 03+1] "We have tests for this:)" [dns] - 10https://gerrit.wikimedia.org/r/721546 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [14:04:36] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [14:07:53] (03PS1) 10Muehlenhoff: mediawiki: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/721549 [14:08:42] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) [14:09:03] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 103, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:09:11] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) @WDoranWMF I'd ask you to please wait for the deployment until T290731 is resolved and a n... [14:09:17] (03CR) 10Dzahn: [V: 03+1 C: 03+2] switch mwmaint.discovery.wmnet from codfw to eqiad [dns] - 10https://gerrit.wikimedia.org/r/721546 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [14:09:27] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 74, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:11:59] sukhe: just seeing this. [14:12:11] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:12:33] Yes the message makes sense if you add a BGP adjacency to a local server/host and it is not created / ready yet. [14:12:43] !log switching https://noc.wikimedia.org from codfw to eqiad (T287539, T267607) [14:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:49] T267607: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 [14:12:50] T287539: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 [14:13:33] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 98, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:50] topranks: thanks, yes! noted for future :) [14:13:55] sorry for the noise [14:14:02] (03CR) 10Dzahn: "mwmaint.discovery.wmnet switched, tested noc.wikimedia.org with httpbb tests, watching logs on both backends" [puppet] - 10https://gerrit.wikimedia.org/r/721358 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [14:15:24] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "PASS: 3 requests sent to mwmaint1002.eqiad.wmnet. All assertions passed." [puppet] - 10https://gerrit.wikimedia.org/r/721358 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [14:16:01] sukhe: No probs at all [14:16:08] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) Coming to logstash: right now on bare metal we rely the logs to rsyslogd talking to it via TCP on localhost. This is not possible on kubern... [14:18:53] (03PS20) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [14:20:12] (03CR) 10Dzahn: "logs in codfw became silent" [dns] - 10https://gerrit.wikimedia.org/r/721546 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [14:21:03] (03PS1) 10Elukey: role::deployment_server: add the admin_ng_services private helmfile conf [labs/private] - 10https://gerrit.wikimedia.org/r/721551 [14:21:14] (03CR) 10Elukey: [V: 03+2 C: 03+2] role::deployment_server: add the admin_ng_services private helmfile conf [labs/private] - 10https://gerrit.wikimedia.org/r/721551 (owner: 10Elukey) [14:21:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mwmaint2002.codfw.wmnet with reason: reimage [14:21:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwmaint2002.codfw.wmnet with reason: reimage [14:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:30] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31101/console" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [14:25:59] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) [14:26:13] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) [14:26:28] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) [14:26:31] (03PS1) 10Elukey: role::deployment_server: fix admin_ng helfile fake config hiera key [labs/private] - 10https://gerrit.wikimedia.org/r/721553 [14:26:35] 10SRE, 10Wikimedia-Mailing-lists: Please subscribe Majavah to ops mailing list - https://phabricator.wikimedia.org/T291191 (10Majavah) [14:26:40] (03PS1) 10Muehlenhoff: Switch a few domains over to mx2001 [dns] - 10https://gerrit.wikimedia.org/r/721554 (https://phabricator.wikimedia.org/T286911) [14:26:42] (03PS1) 10Muehlenhoff: Switch remaining MX records to mx2001 [dns] - 10https://gerrit.wikimedia.org/r/721555 (https://phabricator.wikimedia.org/T286911) [14:26:46] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) [14:26:52] (03CR) 10Elukey: [V: 03+2 C: 03+2] role::deployment_server: fix admin_ng helfile fake config hiera key [labs/private] - 10https://gerrit.wikimedia.org/r/721553 (owner: 10Elukey) [14:27:04] (03CR) 10Effie Mouzeli: [C: 03+1] thumbor: convert generate-thumbor-age-metrics to timer [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [14:28:09] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31102/console" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [14:31:42] (03PS2) 10Effie Mouzeli: tegola-vector-tiles: use v0.3 templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/720019 [14:32:44] 10SRE, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10dancy) [14:32:50] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Please subscribe Majavah to ops mailing list - https://phabricator.wikimedia.org/T291191 (10Ladsgroup) a:03Ladsgroup Let me double check with people and let you know. [14:32:54] (03PS21) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [14:33:40] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 69, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:33:48] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 323, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:35:05] (03CR) 10Alexandros Kosiaris: profile::configmaster::disc_desired_state.py: update after switchover (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/721244 (owner: 10Elukey) [14:35:07] (03PS2) 10Muehlenhoff: Switch a few domains over to mx2001 [dns] - 10https://gerrit.wikimedia.org/r/721554 (https://phabricator.wikimedia.org/T286911) [14:35:07] !log reimaging mwmaint2002 to buster (T267607, T245757) [14:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:14] T267607: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 [14:35:14] T245757: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 [14:35:23] (03PS2) 10Muehlenhoff: Switch remaining MX records to mx2001 [dns] - 10https://gerrit.wikimedia.org/r/721555 (https://phabricator.wikimedia.org/T286911) [14:38:44] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) 05Stalledβ†’03In progress [14:39:00] (03CR) 10Elukey: profile::configmaster::disc_desired_state.py: update after switchover (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721244 (owner: 10Elukey) [14:39:10] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [14:39:14] 10SRE, 10serviceops, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) 05Stalledβ†’03In progress [14:39:22] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [14:40:13] 10SRE, 10Commons, 10Datasets-Archiving, 10Datasets-General-or-Unknown, and 2 others: Back up of Commons files - https://phabricator.wikimedia.org/T160229 (10jcrespo) 05Openβ†’03In progress [14:40:20] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) [14:40:32] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) 05Openβ†’03In progress [14:41:03] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Please subscribe Majavah to ops mailing list - https://phabricator.wikimedia.org/T291191 (10Ladsgroup) 05Openβ†’03Resolved Got approval and done now. [14:42:08] (03CR) 10Herron: Switch a few domains over to mx2001 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/721554 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [14:44:39] 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [14:44:43] 10SRE, 10Traffic, 10Patch-For-Review: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10ssingh) 05Openβ†’03Resolved a:03ssingh durum has been deployed and is now running on all our PoPs. Marking this as closed. Thanks to @Dzahn for helping create all the VMs! [14:45:49] (03CR) 10Muehlenhoff: Switch a few domains over to mx2001 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/721554 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [14:46:20] In around 15 minutes we'll have wikitech in read only mode for a few minutes for a migration (https://phabricator.wikimedia.org/T167973) [14:51:24] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mwmaint2002.codfw.wmnet with reason: REIMAGE [14:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:56] (03CR) 10Herron: [C: 03+1] Switch a few domains over to mx2001 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/721554 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [14:53:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:53:31] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mwmaint2002.codfw.wmnet with reason: REIMAGE [14:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mwmaint2002.codfw.wmnet with reason: reimage [14:55:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mwmaint2002.codfw.wmnet with reason: reimage [14:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:16] (03CR) 10Ladsgroup: miscweb: Add CSP headers for query builder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [15:00:05] marostegui, andrewbogott, bd808, Amir1, Reedy, and kormat: It is that lovely time of the day again! You are hereby commanded to deploy Wikitech migration. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210916T1500). [15:00:11] \o/ [15:00:12] o/ [15:00:18] o/ [15:00:47] (03CR) 10Muehlenhoff: [C: 03+2] Switch a few domains over to mx2001 [dns] - 10https://gerrit.wikimedia.org/r/721554 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [15:01:19] So from the DB side this is the workflow: set wikitech on read-only, check it, change mediawiki to point it to s6, rename tables on m5 master, remove read-only,see what breaks [15:01:25] something that needs to be done from wmcs? [15:02:09] (03PS4) 10Elukey: profile::configmaster::disc_desired_state.py: update after switchover [puppet] - 10https://gerrit.wikimedia.org/r/721244 [15:02:11] (03PS6) 10Marostegui: wmf-config: Wikitech migration from s10 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) [15:02:55] marostegui: I don't think there is anything special to do from the WMCS side. The mw config change should do all the needful things [15:03:03] excellent, so I am going to start [15:03:24] !log Set wikitech on read-only (from now on all SAL changes will fail) T167973 [15:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:30] T167973: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 [15:03:45] (03CR) 10Elukey: [C: 03+2] "Going to merge and see if there are discrepancies :)" [puppet] - 10https://gerrit.wikimedia.org/r/721244 (owner: 10Elukey) [15:03:47] sadly, sal may fail to log to wikitech :-D [15:04:10] marostegui: do you want me to make a patch for the read-only on mediawiki side? [15:04:25] Coordinating here or in a different channel? [15:04:26] Amir1: it is done via dbctl [15:04:29] andrewbogott: here [15:04:31] * andrewbogott hoping there's nothing to coordinate [15:04:45] okay cool [15:04:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set wikitech on read-only for maintenance T287454', diff saved to https://phabricator.wikimedia.org/P17283 and previous config saved to /var/cache/conftool/dbconfig/20210916-150444-marostegui.json [15:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:51] ok, let's see if wikitech is RO [15:04:51] T287454: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 [15:05:08] (03CR) 10Herron: [C: 03+1] alertmanager: set search-platform team [puppet] - 10https://gerrit.wikimedia.org/r/719931 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [15:05:09] actually, that log went through :-) [15:05:28] "Warning: The database has been locked for maintenance, so you will not be able to publish your edits right now. " looks right [15:05:29] "Warning: The database has been locked for maintenance" [15:05:31] It is RO for me [15:05:34] yep [15:05:39] Excellent, going to push MW change now [15:05:50] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Krinkle) There is also monolog [ErrorLogHandler](https://github.com/Seldaek/monolog/blob/2.3.4/src/Monolog/Handler/ErrorLogHandler.php) which mi... [15:06:01] (03CR) 10Marostegui: [C: 03+2] wmf-config: Wikitech migration from s10 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [15:06:09] This is the change: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/708716 [15:06:17] Should I scap-file for all those files individually? [15:08:22] Amir1: ^ is there a better way today to send out a pile of wmf-config files than individually (/me is way out of mw deploy practice) [15:08:43] bd808: I don't think so [15:08:53] you have to keep in mind which order is important [15:09:00] :-/ [15:09:06] so which order should be the one? [15:09:10] this helps a bit https://deploy-commands.toolforge.org/bacc/708716 [15:09:15] (03Merged) 10jenkins-bot: wmf-config: Wikitech migration from s10 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [15:09:16] but not on the order [15:09:21] PROBLEM - SSH on gerrit2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:09:31] let me think [15:09:48] s6.dblist before labswiki.yaml for sure [15:09:48] Going to stop replication meanwhile on m5->s6 [15:10:11] the rest maybe don't matter much on ordering? [15:10:18] db-codfw.php can go first [15:10:24] the test one doesn't matter [15:10:57] and the list probably can be mostly asynchronous if they are for jobs [15:10:57] ACKNOWLEDGEMENT - SSH on gerrit2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:11:14] I am unsure about labswiki.yaml [15:11:54] okay, I looked at the add a wiki page [15:11:57] * Reedy waves [15:11:59] which has the similar problem [15:12:08] (03CR) 10Bstorm: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/721503 (owner: 10Arturo Borrero Gonzalez) [15:12:09] Reedy: welcome to the party! [15:12:14] dblists/s6 should got first [15:12:17] traffic was a bit meh [15:12:39] air traffic? [15:12:43] This is what I have for now in terms of order from the comments here: https://phabricator.wikimedia.org/P17284 [15:12:45] no no [15:12:45] unfortuantely not :D [15:12:49] https://phabricator.wikimedia.org/T276246 [15:12:56] this says db-eqiad [15:13:04] so first, db-codfw, then db-eqiad [15:13:11] then dblists s6 [15:13:18] then dblists 10 [15:13:46] the labswiki yaml doesn't matter but preferably last [15:13:46] dblists10 is a deletion [15:14:08] haha, I don't know how scap handles deletion [15:14:15] I think it doesn't [15:14:17] we can maybe pause and check for hard 5xx with custom codfw load after codfw? [15:14:19] but it doesn't matter [15:14:21] (03PS1) 10Elukey: profile::configmaster:disc_desired_state: set more service statuses [puppet] - 10https://gerrit.wikimedia.org/r/721562 [15:14:48] So how about this order: https://phabricator.wikimedia.org/P17284 ? [15:14:50] marostegui: you can create an empty file and sync it [15:14:54] (03CR) 10Elukey: [C: 03+2] profile::configmaster:disc_desired_state: set more service statuses [puppet] - 10https://gerrit.wikimedia.org/r/721562 (owner: 10Elukey) [15:14:57] Amir1: if you sync-dir the parent dir, it should do the delete [15:15:11] cool [15:15:12] Reedy: is there a way to sync everything at once? [15:15:19] Reedy: or do we have to go file by file? [15:15:20] marostegui: it would break thigns [15:15:24] ah ok [15:15:29] been there, done that [15:15:56] So the above order looks good? Apart from having to handle s10.dblist in a special way? [15:15:59] a full scap would sync it all at once, but the config changes might not work as hoped. the mw config caching system is a bit fragile [15:16:07] db-codfw.php now, and then see? [15:16:10] because there is no guarantee on order of arrival so half of things would start to 500 if they get the wrong file first [15:16:20] I see [15:16:27] yup [15:16:31] sounds good to me [15:16:33] jynus: but I guess it will fail cause s6.dblist isn't synced? [15:16:53] I think dblist is only used for cron and jobs and other stuff, but I could be wrong [15:17:00] Ah ok [15:17:10] Ok, so let's proceed with db-codfw.php? [15:17:15] yeah, the section dblists aren't used with a great deal of things [15:17:25] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 (10MoritzMuehlenhoff) [15:17:33] great [15:17:53] going for codfw then [15:18:17] we can open mwdebug2XXX, and if everything is 5XX we rethink [15:18:27] sounds good [15:18:34] not sure it is the right decision, but I think it is the safest [15:18:34] it will be RO [15:18:47] (03PS1) 10Arturo Borrero Gonzalez: spicerack: add DRBD controller [software/spicerack] - 10https://gerrit.wikimedia.org/r/721563 [15:19:25] !log marostegui@deploy1002 Synchronized wmf-config/db-codfw.php: Wikitech move from s10 to s6 T167973 (duration: 01m 05s) [15:19:26] marostegui@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [15:19:27] let's test! [15:19:28] T167973: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 [15:19:31] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [15:19:34] We are on it stashbot! [15:19:35] (03CR) 10Ladsgroup: [C: 04-1] Switched from cron to systemd timer for elasticsearch modules (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [15:20:05] I can navigate fine from mwdebug2002 [15:20:09] 10SRE, 10serviceops, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) https://noc.wikimedia.org (mwmaint.discovery.wmnet) has been switched from codfw to eqiad. mwmaint2002 has been upgraded to buster. monitoring all green. [15:20:29] 10SRE, 10serviceops, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) 05In progressβ†’03Resolved [15:20:36] Amir1: afaik yaml files aren't read in prod [15:20:42] how to check we are on the right db set? [15:20:51] They're used at build time only for the dblist [15:20:53] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [15:21:10] jynus: sql on mwmaint1002; this will show you an IP, then you can go from there [15:21:16] (that's what i do when creating new wikis) [15:21:34] Krinkle: I thought so too (specially given T223602) but I wasn't sure [15:21:34] T223602: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 [15:22:02] urbanecm: but 1002 will use db-eqiad I guess? [15:22:05] You want the db php first which changes query routing to the new db [15:22:14] "Error: unable to get reader index" [15:22:18] marostegui: right -- you can use mwmaint2002 instead [15:22:23] but mayby that happens to use the dblists :-) [15:22:30] Then the db list later as otherwise will be looking for a db list file that doesn't exist [15:22:36] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) 05In progressβ†’03Resolved https://noc.wikimedia.org (mwmaint.discovery.wmnet) has been switched from codfw to eqiad.... [15:23:02] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [15:23:35] jynus: I have live hacked mwmaint2002 and I still get that error [15:23:53] DBConnection [15:23:53] error: [15:23:53] :real_connect(): (HY000/1049): Unknown database 'labswiki' [15:23:58] ah right [15:23:58] on logstash [15:24:00] I know why [15:24:04] ? [15:24:09] codfw doesn't have labswiki yet :) [15:24:13] ah! [15:24:19] that'd explain it :) [15:24:27] then it really makes it complicated to test it [15:24:39] db2117 [15:24:44] as long as that is s6 [15:24:46] we should be good [15:24:52] yes [15:24:53] that's s6 [15:24:57] (on eqiad) [15:24:59] so it worked [15:25:07] db2116 is codfw s6 [15:25:08] * andrewbogott waves excitedly at T237773 [15:25:08] marostegui: you can do scap pull on mwmaint [15:25:08] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10akosiaris) > via TCP on localhost. UDP not TCP, (I am just being pedantic, I know). [15:25:14] db2117 sorry [15:25:22] (03CR) 10jerkins-bot: [V: 04-1] spicerack: add DRBD controller [software/spicerack] - 10https://gerrit.wikimedia.org/r/721563 (owner: 10Arturo Borrero Gonzalez) [15:25:26] Amir1: Ah I see [15:25:36] so what's next then? We've only sync-ed db-codfw.php [15:25:38] wfGetDB(DB_REPLICA) in shell.php with labswiki says db2087:3316, that's s6 in codfw it looks [15:25:38] it also cleans your hacks [15:25:39] What should be next? [15:25:47] urbanecm: correct [15:25:48] sync the rest [15:25:49] :D [15:25:55] pull on mwmaint1002 db-eqiad? [15:25:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:25:57] mwdebug-deploy@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [15:26:46] with that, if something breaks, we would see it before applying to all the cluster? what do you think? [15:26:48] ok, let's do that [15:27:06] (I am not super sure of what I saying, BTW) [15:27:14] that is why I am asking [15:27:34] we can also pull on mwdebug1002 and navigate there [15:27:38] thoughts? [15:27:52] oh, sorry [15:27:54] that is what I meant [15:27:58] mwdebug [15:28:00] not mwmaint [15:28:07] I thought you meant mwmaint to test sql labswiki [15:28:11] Let's go for mwdebug [15:28:17] we can do both? :-) [15:28:27] yeah :-) [15:28:30] I think we should have an evaluation some time soon where we think this through a bit more for future reference and document it. Eg consider read only mode, and maybe replicating to the new section in both DCs first etc then switch masters, or document why we don't/can't/shouldn't [15:28:31] 10SRE, 10SRE-Access-Requests: Updating mbinder's keys for phabricator-bulk-manager - https://phabricator.wikimedia.org/T291141 (10cmooney) @MBinder, I think all we need is your new SSH public key when you have generated a new keypair. Please attach here and I can have it updated on production systems. I hop... [15:29:04] This conversion should not take place in the future at this low level of uncertainty [15:29:40] tested after scap pull on mwmaint1002, I get sql labswiki working [15:29:49] it's on 10.64.16.187 [15:30:03] dig -x says db1165.eqiad.wmnet. [15:30:04] marostegui, ping when debug finishes [15:30:36] yup, it's s6 [15:30:43] Krinkle: I appreciate your comments, but let's please move that to a separate conversation. This move is very complicated and it is the first time we've moved this in production, given it is wikitech we thought we could "debug" a bit more. We did s8 move with just 0 downtime, but this is way different as it is moved from a misc section to a core section. So please let's stay on topic [15:31:02] Amir1: Can you do the pull on mwdebug too? [15:31:09] so we can browse the site [15:31:23] sure sure [15:32:07] mwdebug1002 has it [15:32:12] checking! [15:32:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:33:00] mwdebug-deploy@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [15:33:10] I can see everything fine [15:33:27] let's sync then [15:33:28] * urbanecm too [15:33:39] ok, let's go with this order then? https://phabricator.wikimedia.org/P17284 [15:33:44] we actually can handle some async, as long as we are on read only [15:33:46] m.arostegui: ack, that's precisely what I meant. Schedule an evaluation soon (not now) so that next time it's not like this. [15:33:49] as m5 won't just disappear [15:34:04] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=delete https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [15:34:04] jynus: I will rename tables on m5 before enabling RW [15:34:08] (Replication is also stopped) [15:34:11] the problem will be if something is still cached AND we are in read write [15:34:12] yup [15:34:16] that will help [15:34:36] the sync order looks good to me [15:34:39] so I think we aer good so far [15:34:56] GOing to start syncing but we still need to think what to do with s10.dblists [15:34:57] jynus: it would cause some 500 but given that wikitech is not read much, It's inside the error budget if you ask me [15:35:12] Here it comes db-eqiad.php [15:35:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: nfs: nfs-manage: don't hardcode /24 netmasks [puppet] - 10https://gerrit.wikimedia.org/r/721503 (owner: 10Arturo Borrero Gonzalez) [15:35:18] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:35:26] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [15:35:30] db lists are used to expand settings and can fatal all wikis if absent [15:35:52] Krinkle: yep, but how to sync a file we are deleting? [15:35:59] Sync the directory [15:36:03] I did say this earlier :P [15:36:10] a bit counterintuitive but sync-files takes directory arguments [15:36:17] since y'know it's mostly just rsync πŸ™ƒ [15:36:19] Yeah, I have never done that before [15:36:22] !log marostegui@deploy1002 Synchronized wmf-config/db-eqiad.php: Wikitech move from s10 to s6 T167973 (duration: 01m 05s) [15:36:23] marostegui@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [15:36:24] T167973: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 [15:36:28] We only sync db-eqiad.php and db-codfw.php :( [15:36:32] or calll it as sync-dit which is an alias [15:36:32] wikitech fatals [15:36:39] sync-sir [15:36:43] sync-dir [15:36:44] [5fb42fe3-654a-4a7b-82a1-aff052dc38c2] 2021-09-16 15:36:37: Fatal exception of type "Wikimedia\Rdbms\DBTransactionStateError" [15:36:44] yep, wikitech down for me [15:36:49] I suggest syncing s6 first then directory [15:36:50] syncing the dir [15:36:56] we need that list + config I am afraid [15:37:17] doing it [15:37:18] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719931 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [15:37:39] I am syncing s6 first [15:37:48] marostegui: don't worry, if it causes a full outage, scap stops [15:37:58] actuallyn no db errors [15:38:00] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:38:02] the error logs says `Caused by: [Exception Wikimedia\Rdbms\DBQueryError] (/srv/mediawiki/php-1.37.0-wmf.23/includes/libs/rdbms/database/Database.php:1809) Error 1142: SELECT command denied to user 'wikiuser'@'208.80.155.109' for table 'heartbeat' (db1168)` [15:38:04] so it must be config errors [15:38:04] it waits for canaries (10%) [15:38:18] !log marostegui@deploy1002 Synchronized dblists/s6.dblist: Wikitech move from s10 to s6 T167973 (duration: 01m 05s) [15:38:19] urbanecm: ah right, good one, jynus mind checking that? [15:38:20] marostegui@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [15:38:41] mmm [15:38:46] grants may be missing from that range [15:38:50] I am sycning now dblist dir [15:39:00] I checked only conenction errors, not grants ones [15:39:19] Might be missing those specific IPs yeah [15:39:29] classic [15:39:39] dblists dir synced, going to proceed with the rest of files on the order we agreed on [15:39:49] !log marostegui@deploy1002 Synchronized dblists/: Wikitech move from s10 to s6 T167973 (duration: 01m 05s) [15:39:50] 10SRE, 10SRE-Access-Requests: Updating mbinder's keys for phabricator-bulk-manager - https://phabricator.wikimedia.org/T291141 (10MBinder_WMF) Thanks, @cmooney ! After much consternation and gratuitous error messages, the Clam has been reborn as simply Maxs-MacBook-Air. Apparently I've gotten less fun with a... [15:39:50] marostegui@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [15:40:04] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers labweb1002.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:40:19] andrewbogott: bd808 ^ expected I guess :) [15:40:50] yeah, I think just pybal noticing the fatals [15:40:52] probably? I'd ignore until the dust has settled from the db migration [15:41:02] !log marostegui@deploy1002 Synchronized tests/dblistTest.php: Wikitech move from s10 to s6 T167973 (duration: 01m 05s) [15:41:03] marostegui@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [15:41:23] I am now syncying the last file [15:41:46] jynus: I am almost done, want me to check the grants? [15:41:55] I certainly will need hel [15:42:00] ok, let me check [15:42:01] I just found what is missing [15:42:16] !log marostegui@deploy1002 Synchronized wmf-config/config/labswiki.yaml: Wikitech move from s10 to s6 T167973 (duration: 01m 04s) [15:42:17] marostegui@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [15:42:18] T167973: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 [15:42:20] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:42:28] GRANT SELECT ON `heartbeat`.`heartbeat` TO `wikiuser`@`208.80.155.109` [15:42:33] ^marostegui [15:43:00] yeah, that looks good [15:43:09] that grant isn't on m5, so that's why it wasn't on the new hosts [15:43:12] and the target, everthing with an error on logstash: [15:43:14] do you apply or I do? [15:43:21] pleaes do, I can show you the page [15:43:35] done [15:43:40] wikitech up again [15:43:40] wikitech is back [15:43:44] ok, time to rename the tables [15:43:56] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:44:06] marostegui: before doing so, can you make sure _both_ wikitech hosts have the grants? IIRC we have two (labsweb1001 and labsweb1002) [15:44:09] *labweb [15:44:24] there was another 2XX ip [15:44:25] indeed [15:44:35] 208.80.154.160 [15:44:41] urbanecm: done thanks, good one [15:44:46] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:44:50] the account was there but not heartbeat [15:44:50] 208.80.154.160 and 208.80.155.109 [15:44:55] 10SRE, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10dancy) [15:45:02] * bd808 sees the wikitech main page [15:45:07] tables renamed on m5 [15:45:12] I will check more errors [15:45:18] in case there are more, I mean [15:45:21] wikitech still up with tables renamed on m5, so that's good [15:45:31] thanks marostegui :) [15:45:34] I was checking connection errors and was seeing none [15:45:36] At this point we are ready to enable writes [15:45:40] since the dust has settled, the underlying problem is that mediawiki-config sucks. Like it's really bad. IS.php is +20K lines and the bug I mentioned would have helped but since it's half-done, mediawiki-config is in a worse state atm. [15:45:42] because they were query errors :-) [15:46:09] A !log went through! https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL#2021-09-16 [15:46:15] uh?? [15:46:18] let me see [15:46:23] <_joe_> Amir1: hear, hear [15:46:38] as in, there was an account, but it required extra privs, which were the ones for the load balancer, so super-important :-( [15:46:46] 10SRE, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10dancy) [15:46:47] I edited my user page, seemed to work [15:46:52] We didn't set RO on MW level, did we? So that's why it's RW now [15:46:57] (unless all of s6 is RO) [15:47:08] urbanecm: we did on s10 [15:47:11] _joe_: I'm not saying it necessarily has to be that solution but it clearly needs resources and love [15:47:20] urbanecm: unless that...now s10 doesn't exist anymore [15:47:22] so maybe dbctl let it? [15:47:27] yes, but since you synced db-eqiad and db-codfw, wikitech now uses s6 [15:47:37] indeed [15:47:38] I can write [15:47:43] but it is writing to s6 [15:47:49] that's fair, but s6 shouldn't go read only [15:47:50] so RO is gone cause s10.dblist is gone? [15:47:58] I am glad we renamed the tables :) [15:48:00] so it came back automatically [15:48:01] cause you synced the db-* files :) [15:48:03] yeah [15:48:13] so we are effectively on RW [15:48:17] if we want RO, we can do it on MW level, like we did for the muswiki move. [15:48:18] which is ok, it was expected, just probably earlier thaan expected [15:48:34] urbanecm, don't worry, etc is a better method for that [15:48:52] although I think doesn't have wiki granularity [15:48:54] only seciont [15:49:00] yep, only sections [15:49:06] so while I was checking the grants [15:49:13] I missed the current status, all synced? [15:49:15] but we can do it in mw config if really nedded [15:49:16] so we are done [15:49:19] *needed [15:49:30] andrewbogott: bd808 the DB migration is done [15:49:36] thank you marostegui ! [15:49:40] πŸŽ‰ [15:49:40] If something tries to write to the old host, it will fail as the tables are renamed [15:49:45] marostegui: now it's done, what is s11? [15:49:46] can you double check from your side? [15:49:55] Amir1: I think it is labtestwiki or something like that [15:50:01] jeez [15:50:03] Amir1: a train line in Berlin [15:50:06] congrats [15:50:08] yes, thanks marostegui :) [15:50:10] can we just kill it? [15:50:17] I believe it's still used [15:50:19] Amir1: it is owned by WMCS I believe [15:50:22] Amir1: no, we use it [15:50:30] Amir1, ask andrewbogott for that, he is just a DBA :-D [15:51:04] we should start naming our non-mw sections u then, u1, u2, Reedy [15:51:08] the test/staging WMCS stuff in codfw needs a wiki as part of it's testable surface [15:51:10] labtestwiki is https://labtestwikitech.wikimedia.org/wiki/Main_Page [15:51:12] the DBA that fixed wikitech, but just a DBA like the ancient romans said during parades [15:51:12] okay then :D [15:51:21] So for the clean up, I will wait till next week, I am not going to delete old tables or anything [15:51:30] But, I believe that its DB is now locally hosted by me rather than on the official dbservers [15:51:31] +1 [15:51:32] (03CR) 10Krinkle: "There is another mention of s10 in modules/profile/types/mariadb/valid_section.pp [1]" [puppet] - 10https://gerrit.wikimedia.org/r/708631 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [15:51:40] so, the config is needed but not the DB (if you're seeing stray tables someplace) [15:52:01] In case you see errors, the old tables are named this way: T167973_OLDTABLENAME, ie: T167973_text [15:52:07] !log marostegui is awesome and made wikitech better today. :) [15:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:16] I would like to doublecheck that before anything gets deleted (labtestwikitechwise) :) [15:52:23] ^ imagine that fails bd808 XDDD [15:52:29] andrewbogott: yes, no rush at all [15:52:37] and I'm also able to do xwiki rights changes on wikitech! Sounds good πŸ™‚ [15:53:01] next step, CentralAuth? :D [15:53:02] cdanis: could you check https://gerrit.wikimedia.org/r/c/operations/puppet/+/708631 ? [15:53:07] urbanecm: sweet [15:53:16] Reedy: *moving to mw* hosts ;) [15:53:17] Reedy: CA is in s7 [15:53:18] Reedy: let me know when so I can be on holidays [15:53:22] Reedy: not quite, but T237773 :) [15:53:23] T237773: Move Wikitech onto the production MW cluster - https://phabricator.wikimedia.org/T237773 [15:53:46] heh [15:53:50] Reedy: T237773 is next, /then/ central auth [15:53:54] and then we can do the last SUL migration :) [15:54:04] heh [15:54:07] Party time [15:54:10] How come stashbot captions Bryan's bug refs but not mine? unfair! [15:54:27] rate limiting :-) [15:54:30] (03CR) 10CDanis: [C: 03+1] "LGTM assuming it is staying in codfw for now" [puppet] - 10https://gerrit.wikimedia.org/r/708631 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [15:54:32] * urbanecm will try to document as doing the foundation.wikimedia.org SUL migration :D [15:54:40] I wonder what happens to previous users in a SUL-enabled wiki? [15:54:40] lol [15:54:40] it's got a ~5 minute counter to avoid spamming for the same task [15:54:57] arturo: TLDR: It's a mess [15:55:01] arturo: You can claim your accounts [15:55:06] (03CR) 10Marostegui: [C: 03+1] "Thanks Chris, yes, it will be migrated to codfw in a few days" [puppet] - 10https://gerrit.wikimedia.org/r/708631 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [15:55:10] arturo: it _should_ be possible to reclaim them...in theory [15:55:11] If they're the same name, and you know both passwords etc [15:55:17] hmmm it didn't caption it when I mentioned it earlier either but maybe it was in the backscroll already [15:55:26] I remember it was being migrated during the original SUL work in 2014 [15:55:30] I see [15:55:59] that's why there are lots of users like Foo~barwiki [15:56:18] all renamed automatically [15:56:44] those are name collisions from the various pre-sul wikis. Which may happen a bit for wikitech too, but not as much [15:57:12] <_joe_> arturo: buy a beer to legoktm next time we meet in person and ask him about the SUL migration :P [15:57:26] <_joe_> nevermind, probably something stronger [15:57:33] heh [15:57:57] legoktm was 100% the hero of that project. [15:59:12] * legoktm blushes [15:59:18] also good morning :) [15:59:53] * urbanecm waves to legoktm [16:00:04] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210916T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:19] * legoktm waves back [16:00:38] * urbanecm hides into a meeting [16:00:55] urbanecm: thanks for your help [16:01:02] any time :) [16:01:11] I'm glad it was done :) [16:03:07] This is what's next: https://phabricator.wikimedia.org/T167973#7359504 [16:04:05] !log Disconnect s6 master from m5 master (noting the replication position) [16:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:13] (03PS1) 10MewOphaswongse: Use growthexperiments-structuredtask-no-suggestions-found-dialog-button in outdated suggestions dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721570 [16:04:14] !log Disconnect s6 master from m5 master (noting the replication position) T167973 [16:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:20] T167973: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 [16:06:15] (03CR) 10Marostegui: mariadb.yaml: Change replication_type [puppet] - 10https://gerrit.wikimedia.org/r/721421 (https://phabricator.wikimedia.org/T291144) (owner: 10Marostegui) [16:06:19] (03CR) 10Marostegui: [C: 03+2] mariadb.yaml: Change replication_type [puppet] - 10https://gerrit.wikimedia.org/r/721421 (https://phabricator.wikimedia.org/T291144) (owner: 10Marostegui) [16:10:08] RECOVERY - SSH on gerrit2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:12:03] (03CR) 10Ottomata: [C: 03+1] Remove obsolete java::security [puppet] - 10https://gerrit.wikimedia.org/r/719261 (https://phabricator.wikimedia.org/T282454) (owner: 10Muehlenhoff) [16:12:27] (03PS1) 10Ebernhardson: [DNM] Have pcc build a prod catalog to look at [puppet] - 10https://gerrit.wikimedia.org/r/721572 [16:13:58] 10SRE, 10Datacenter-Switchover, 10Patch-For-Review, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Marostegui) [16:17:40] !log btullis@cumin1001 START - Cookbook sre.hosts.decommission for hosts an-test-coord1002.eqiad.wmnet [16:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:57] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:31:35] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10WDoranWMF) @Daimona can you let @hnowlan once everything is ready to be deployed and please update... [16:32:54] (03PS1) 10Ebernhardson: query_service: Disable query logging [puppet] - 10https://gerrit.wikimedia.org/r/721575 [16:33:52] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) >>! In T285857#7359608, @WDoranWMF wrote: > @Daimona can you let @hnowlan once everything... [16:37:16] (03Abandoned) 10Ebernhardson: [DNM] Have pcc build a prod catalog to look at [puppet] - 10https://gerrit.wikimedia.org/r/721572 (owner: 10Ebernhardson) [16:38:12] (03CR) 10Ebernhardson: "Verified by manually editing /etc/default/wcqs-blazegraph on wcqs1001, this allows blazegraph to finish booting up and respond to basic re" [puppet] - 10https://gerrit.wikimedia.org/r/721575 (owner: 10Ebernhardson) [16:39:53] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-test-coord1002.eqiad.wmnet [16:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:28] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10Bstorm) So far so good: `lang=shell-session [bstorm@labstore1005]:~ $ sudo journalctl -S "2021-09-15" | grep "Controller encountered a fatal error a... [16:58:23] 10SRE, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Dumps-Generation: Image tarball dumps on your.org are not being generated - https://phabricator.wikimedia.org/T53001 (10jcrespo) @ArielGlenn I just saw your question today- hopefully you saw the gradual updates at T262668 already :-D. Sadly, we... [16:59:10] 10SRE, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Dumps-Generation: Image tarball dumps on your.org are not being generated - https://phabricator.wikimedia.org/T53001 (10ArielGlenn) That's great to hear and I look forward to a future discussion! [17:00:05] chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210916T1700). [17:00:35] (03PS1) 10Michael DiPietro: create role to deploy staging instance for quarry [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) [17:02:15] (03CR) 10jerkins-bot: [V: 04-1] create role to deploy staging instance for quarry [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) (owner: 10Michael DiPietro) [17:03:52] I think some jobs are failing on labswiki [17:03:56] but I think I an fix it [17:04:27] I think we need to add the grants manuel added to the job db user, too [17:04:31] that should be an easy fix [17:04:51] hey [17:04:55] let me checl [17:04:58] marostegui, don't worry, I get it [17:05:04] you sure? [17:05:09] go back to your family time [17:05:21] I am staying longer because I will leave eary tomorrow :-) [17:05:22] thanks <3 [17:06:56] GRANT SELECT ON `heartbeat`.`heartbeat` TO `(admin user)`@`208.80.154.160`; [17:07:10] that is what I will add from the master, for that ip and the other [17:07:18] sounds good [17:07:24] marostegui, go! [17:07:26] :-D [17:09:21] !log deployed extra grants for admin user on s6 primary [17:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:41] and not I will check replication did not break [17:09:47] and that logstash is happy [17:11:23] looking good: https://logstash.wikimedia.org/goto/82c8acc37f43571b6f8d6272b4c30520 [17:13:37] and leaving this here: https://i.imgflip.com/5n7f69.jpg :-D [17:14:22] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: TBD) rack/setup/install an-presto10[06-15] - https://phabricator.wikimedia.org/T290987 (10odimitrijevic) [17:17:21] (03PS2) 10Ryan Kemper: query_service: Disable query logging [puppet] - 10https://gerrit.wikimedia.org/r/721575 (owner: 10Ebernhardson) [17:17:29] 10SRE, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Data-Engineering: Q1:(Need By: TBD) rack/setup/install an-presto10[06-15] - https://phabricator.wikimedia.org/T290987 (10odimitrijevic) [17:18:42] hahaha [17:21:22] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721575 (owner: 10Ebernhardson) [17:29:31] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10RobH) 05In progressβ†’03Stalled >>! In T290318#7359652, @Bstorm wrote: > So far so good: > > `lang=shell-session > [bstorm@labstore1005]:~ $ sudo... [17:30:24] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31104/console" [puppet] - 10https://gerrit.wikimedia.org/r/721575 (owner: 10Ebernhardson) [17:31:02] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10RobH) [17:31:08] !log turn of lldp agent on NIC (both ports) on ms-be2051 - T290984 [17:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:15] T290984: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 [17:32:04] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] query_service: Disable query logging [puppet] - 10https://gerrit.wikimedia.org/r/721575 (owner: 10Ebernhardson) [17:36:53] (03CR) 10Dzahn: [C: 03+1] mediawiki: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/721549 (owner: 10Muehlenhoff) [17:38:41] 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10Volans) All hosts have the same identifiers: ` $ sudo cumin 'ms-be105[1-9]*,ms-be205[2-6]*' 'ls -1 /sys/kernel/debug/i4... [17:44:10] (03PS1) 10Dzahn: thumbor: remove absented cron code for generate-thumbor-age-metrics [puppet] - 10https://gerrit.wikimedia.org/r/721589 (https://phabricator.wikimedia.org/T273673) [17:48:33] 10SRE, 10Datacenter-Switchover, 10Patch-For-Review, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Legoktm) I sent a short recap to wikitech-l: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/6UZCCACCBCZ... [17:50:53] (03CR) 10Tobias Andersson: miscweb: Add CSP headers for query builder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [17:51:54] (03CR) 10Tobias Andersson: miscweb: Add CSP headers for query builder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [17:54:12] !log turn of lldp agent on NIC (both ports) on ms-be105[1-9],ms-be205[2-6] - T290984 [17:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:17] T290984: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 [18:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210916T1800). [18:00:05] mewoph: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:17] I can deploy today! [18:00:26] 10SRE, 10LDAP-Access-Requests: Grant Access to WMF for Rui Huang - https://phabricator.wikimedia.org/T290991 (10rhuang) Hi Cathal, I have the access now. Thank you so much for the help! [18:00:26] hi mewoph :) [18:00:40] πŸ‘‹ [18:00:48] (03CR) 10Urbanecm: [C: 03+2] Use growthexperiments-structuredtask-no-suggestions-found-dialog-button in outdated suggestions dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721570 (owner: 10MewOphaswongse) [18:00:56] I'll ping you when it can be tested :) [18:01:10] mewoph: is there any reason to do only wmf.23? [18:01:37] (i guess since it's Thursday, Wikipedias will get promoted...soon) [18:02:25] The initial change to rename the message keys are in wmf.23 and I missed that one key. Here is the task for the rename https://phabricator.wikimedia.org/T290040 [18:02:41] ah, makes sense [18:02:48] (03PS1) 10Ryan Kemper: query_service: use log_dir param for query log [puppet] - 10https://gerrit.wikimedia.org/r/721594 [18:04:47] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31105/console" [puppet] - 10https://gerrit.wikimedia.org/r/721594 (owner: 10Ryan Kemper) [18:06:44] (03CR) 10Ryan Kemper: [V: 03+1] "PCC looks as expected: https://puppet-compiler.wmflabs.org/compiler1002/31105/" [puppet] - 10https://gerrit.wikimedia.org/r/721594 (owner: 10Ryan Kemper) [18:14:20] (03CR) 10RLazarus: [C: 03+1] "πŸš€" [puppet] - 10https://gerrit.wikimedia.org/r/721549 (owner: 10Muehlenhoff) [18:18:51] (03PS2) 10Ryan Kemper: query_service: tear out on-disk query event logs [puppet] - 10https://gerrit.wikimedia.org/r/721594 [18:19:23] (03Merged) 10jenkins-bot: Use growthexperiments-structuredtask-no-suggestions-found-dialog-button in outdated suggestions dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721570 (owner: 10MewOphaswongse) [18:20:58] mewoph: can be tested at mwdebug1001, if you want to have a look [18:21:23] checking now [18:21:41] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721594 (owner: 10Ryan Kemper) [18:23:04] (03PS1) 10Dzahn: geoip: replace maxmind update cron with system timer and config [puppet] - 10https://gerrit.wikimedia.org/r/721595 (https://phabricator.wikimedia.org/T273673) [18:23:47] (03CR) 10jerkins-bot: [V: 04-1] geoip: replace maxmind update cron with system timer and config [puppet] - 10https://gerrit.wikimedia.org/r/721595 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [18:25:44] urbanecm: lgtm [18:25:49] mewoph: thanks, syncing [18:27:53] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.23/extensions/GrowthExperiments/extension.json: bb8cba102fe417e8e41b7c4e9179d119c7d25a43: Use growthexperiments-structuredtask-no-suggestions-found-dialog-button in outdated suggestions dialog (1/2) (duration: 01m 07s) [18:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:50] (03PS4) 10Ebernhardson: Declare wikimedia_cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721089 [18:29:00] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.23/extensions/GrowthExperiments/modules/ext.growthExperiments.StructuredTask/addlink/AddLinkArticleTarget.js: bb8cba102fe417e8e41b7c4e9179d119c7d25a43: Use growthexperiments-structuredtask-no-suggestions-found-dialog-button in outdated suggestions dialog (2/2) (duration: 01m 06s) [18:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:05] mewoph: should be live [18:29:09] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:29:10] anything else i can do for you today? [18:30:22] (03PS3) 10Ryan Kemper: query_service: tear out on-disk query event logs [puppet] - 10https://gerrit.wikimedia.org/r/721594 [18:32:12] (03PS2) 10Dzahn: geoip: replace maxmind update cron with system timer and config [puppet] - 10https://gerrit.wikimedia.org/r/721595 (https://phabricator.wikimedia.org/T273673) [18:33:16] (03CR) 10jerkins-bot: [V: 04-1] geoip: replace maxmind update cron with system timer and config [puppet] - 10https://gerrit.wikimedia.org/r/721595 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [18:33:19] urbanecm: that's it for me, thanks a lot! [18:33:25] any time! [18:40:24] (03PS3) 10Dzahn: geoip: replace maxmind update cron with system timer and config [puppet] - 10https://gerrit.wikimedia.org/r/721595 (https://phabricator.wikimedia.org/T273673) [18:43:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:44:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:46:09] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [18:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:12] 10SRE: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367 (10cmooney) This BCM NIC/driver doesn't seem to support priv-flags via ethtool (where I believe that should show up): ` cmooney@lvs2010:~$ sudo /sbin/ethtool -i ens3f1np1 driver: bnxt_en version: 1.9.2 firmware-version: 214.0... [18:49:44] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:25] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:25] (03PS4) 10Ryan Kemper: query_service: tear out on-disk query event logs [puppet] - 10https://gerrit.wikimedia.org/r/721594 [18:52:32] (03CR) 10Ebernhardson: [C: 03+1] query_service: tear out on-disk query event logs [puppet] - 10https://gerrit.wikimedia.org/r/721594 (owner: 10Ryan Kemper) [18:53:31] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10RobH) [18:53:51] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10RobH) 05Openβ†’03In progress [18:54:35] (03CR) 10Ryan Kemper: [C: 03+2] query_service: tear out on-disk query event logs [puppet] - 10https://gerrit.wikimedia.org/r/721594 (owner: 10Ryan Kemper) [18:55:56] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:31] 10SRE, 10LDAP-Access-Requests: Grant Access to WMF for Rui Huang - https://phabricator.wikimedia.org/T290991 (10cmooney) No problem :) [18:56:39] 10SRE, 10LDAP-Access-Requests: Grant Access to WMF for Rui Huang - https://phabricator.wikimedia.org/T290991 (10cmooney) 05Openβ†’03Resolved [18:59:06] (03PS5) 10Ebernhardson: Declare wikimedia_cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721089 [18:59:08] (03PS1) 10Ebernhardson: Add dsh targets for the new wcqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/721600 [18:59:25] (03CR) 10jerkins-bot: [V: 04-1] Declare wikimedia_cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721089 (owner: 10Ebernhardson) [18:59:44] (03CR) 10jerkins-bot: [V: 04-1] Add dsh targets for the new wcqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/721600 (owner: 10Ebernhardson) [19:00:05] hashar and twentyafterfour: (Dis)respected human, time to deploy MediaWiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210916T1900). Please do the needful. [19:00:23] wrapping up a meeting [19:00:47] (03PS6) 10Ebernhardson: Declare wikimedia_cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721089 [19:01:01] (03PS2) 10Ebernhardson: Add dsh targets for the new wcqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/721600 [19:02:05] ok hmm train [19:02:37] (03PS2) 10Michael DiPietro: create role to deploy staging instance for quarry [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) [19:03:06] (03CR) 10jerkins-bot: [V: 04-1] create role to deploy staging instance for quarry [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) (owner: 10Michael DiPietro) [19:03:25] (03PS1) 10Ottomata: Add analytics-research and analytics-platform-eng to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/721601 (https://phabricator.wikimedia.org/T284225) [19:03:38] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:03:51] (03PS2) 10Ottomata: Add analytics-research and analytics-platform-eng to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/721601 (https://phabricator.wikimedia.org/T284225) [19:04:32] (03PS3) 10Ottomata: Add analytics-research and analytics-platform-eng to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/721601 (https://phabricator.wikimedia.org/T284225) [19:05:34] train going [19:06:40] (03CR) 10Ottomata: [C: 03+2] Add analytics-research and analytics-platform-eng to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/721601 (https://phabricator.wikimedia.org/T284225) (owner: 10Ottomata) [19:06:46] (03PS1) 10Hashar: all wikis to 1.37.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721602 [19:06:48] (03CR) 10Hashar: [C: 03+2] all wikis to 1.37.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721602 (owner: 10Hashar) [19:07:36] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721602 (owner: 10Hashar) [19:08:49] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.23 [19:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:15] (03PS3) 10Ryan Kemper: wdqs: remove codfw hourly restarts [puppet] - 10https://gerrit.wikimedia.org/r/720102 (https://phabricator.wikimedia.org/T290330) [19:13:16] kibana interface is still loading :\ [19:13:38] hashar: I'll take a look at kibana [19:13:51] I guess it is bandwith heavy [19:13:58] or I should not use firefox for it [19:14:19] twentyafterfour: thx! :] [19:14:23] no it works fine in firefox [19:14:38] someone set the time period to 15 minutes and there are no logs for the last 15 minutes [19:14:45] set the range to 4 hours and it works [19:15:00] ahh [19:15:10] very few errors recently, that's good [19:15:14] looks quiet beside "Unresolved redirect from Q100917584 to Q95304986" [19:15:20] which sounds like a data error rather than a code one [19:15:27] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/720102 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper) [19:15:52] yes but why is a data error getting into this log dashboard? that's a bug in itself [19:16:03] looks like everything got caught on group1 yesterday [19:16:33] I am filing it [19:16:58] yeah everything looks pretty clear other than that [19:21:17] so might be a {Success} [19:21:29] I am touring grafana dashboards [19:21:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:14] (03PS2) 10Ottomata: Point analytics-web CNAME at an-web1001 [dns] - 10https://gerrit.wikimedia.org/r/721327 (https://phabricator.wikimedia.org/T285355) [19:24:22] (03CR) 10Ottomata: [C: 03+2] "Proceeding" [dns] - 10https://gerrit.wikimedia.org/r/721327 (https://phabricator.wikimedia.org/T285355) (owner: 10Ottomata) [19:25:20] jouncebot: nowandnext [19:25:21] For the next 1 hour(s) and 34 minute(s): MediaWiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210916T1900) [19:25:21] In 3 hour(s) and 34 minute(s): US Backport and Config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210916T2300) [19:26:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:44] (03PS3) 10Michael DiPietro: create role to deploy staging instance for quarry [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) [19:26:51] hashar: that happens usually when a double redirect is not caught in the code [19:28:17] (03CR) 10jerkins-bot: [V: 04-1] create role to deploy staging instance for quarry [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) (owner: 10Michael DiPietro) [19:29:50] (03PS1) 10Ladsgroup: Set jQuery migrate to false for wikibooks and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721610 (https://phabricator.wikimedia.org/T280944) [19:34:07] Amir1 possibly. I filed it against wikidata-campsite but haven't marked it as a train blocker [19:34:12] seems it was a single Qxx item [19:34:51] thanks [19:35:00] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:36:54] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:41:14] backend save timing have regressed, but that is from Monday 09/13th so poredate the train [19:41:26] might be parsoid related. I notified #wikimedia-perf about it [19:41:43] twentyafterfour: looks like we can claim 1.37.0-wmf.23 to be fully deployed? [19:42:49] hashar: I think so [19:43:28] twentyafterfour: done! thank you for your assistance yesterday night ;-] [19:43:53] (and all the people that fixed or adjusted things here and there) [19:48:42] no prob! :) [19:55:28] Now we can deploy all the clean-up config patches. [19:55:54] (03PS3) 10Jforrester: Set wgProhibitedFileExtensions not wgFileBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721011 (https://phabricator.wikimedia.org/T290640) [19:55:59] (03PS2) 10Jforrester: Alter wgMimeTypeExclusions not wgMimeTypeBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721030 [19:56:17] (03PS2) 10Jforrester: Rename wmfFileBlacklist to wmgProhibitedFileExtensions part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721012 [19:56:22] (03PS2) 10Jforrester: Rename wmfFileBlacklist to wmgProhibitedFileExtensions part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721013 [19:56:27] (03PS2) 10Jforrester: Rename wmfFileBlacklist to wmgProhibitedFileExtensions part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721014 [19:56:43] (03PS4) 10Jforrester: Add new config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720362 (https://phabricator.wikimedia.org/T277932) [19:57:01] Just a few minor patches. ;-) [20:11:16] (03PS1) 10Ahmon Dancy: role::releases: Include ::profile::kubernetes::deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/721615 (https://phabricator.wikimedia.org/T288629) [20:15:00] (03Abandoned) 10Umherirrender: Add SpecialFewestrevisions to wgDisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697151 (https://phabricator.wikimedia.org/T238199) (owner: 10Umherirrender) [20:18:12] (03PS3) 10Jforrester: Disable LocalisationUpdate, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677326 (https://phabricator.wikimedia.org/T158360) [20:20:30] (03CR) 10Jforrester: "Nothing's blown up in nearly six months; let's finish this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677326 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [20:24:14] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 394 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:27:47] James_F: let me know once you're done, I have some one patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/721610/ :D [20:30:02] s/some // [20:30:28] (03PS2) 10Ahmon Dancy: role::releases: Include ::profile::kubernetes::deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/721615 (https://phabricator.wikimedia.org/T288629) [20:30:34] (03PS2) 10SDineshKumar: Switch from cron to systemd timer for elasticsearch modules Test Results: https://phabricator.wikimedia.org/P17286 [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) [20:31:09] (03CR) 10Ebernhardson: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [20:31:54] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 12 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:32:11] (03CR) 10jerkins-bot: [V: 04-1] Switch from cron to systemd timer for elasticsearch modules Test Results: https://phabricator.wikimedia.org/P17286 [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [20:33:08] (03CR) 10Ladsgroup: [C: 04-1] Switch from cron to systemd timer for elasticsearch modules Test Results: https://phabricator.wikimedia.org/P17286 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [20:41:02] Amir1: Oh, sorry, I wasn't deploying, just thinking about it. [20:41:18] oh okay :D [20:42:42] (03CR) 10Ladsgroup: [C: 03+2] Set jQuery migrate to false for wikibooks and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721610 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup) [20:43:33] (03PS2) 10Ladsgroup: Set jQuery migrate to false for wikibooks and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721610 (https://phabricator.wikimedia.org/T280944) [20:43:44] (03CR) 10Ladsgroup: [C: 03+2] Set jQuery migrate to false for wikibooks and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721610 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup) [20:46:40] (03PS3) 10SDineshKumar: Switch from cron to systemd timer for elasticsearch module [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) [20:46:59] (03Merged) 10jenkins-bot: Set jQuery migrate to false for wikibooks and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721610 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup) [20:47:43] (03PS3) 10Ebernhardson: query_service: Support proxying to microsite from backend [puppet] - 10https://gerrit.wikimedia.org/r/720801 (https://phabricator.wikimedia.org/T280247) [20:47:51] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/720801 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [20:48:02] (03CR) 10jerkins-bot: [V: 04-1] query_service: Support proxying to microsite from backend [puppet] - 10https://gerrit.wikimedia.org/r/720801 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [20:48:07] (03PS1) 10Legoktm: mediawiki: Remove lilypond [puppet] - 10https://gerrit.wikimedia.org/r/721618 [20:48:19] (03PS1) 10RobH: ganeti4004 setup [puppet] - 10https://gerrit.wikimedia.org/r/721619 (https://phabricator.wikimedia.org/T289715) [20:48:28] (03PS2) 10Legoktm: mediawiki: Remove lilypond [puppet] - 10https://gerrit.wikimedia.org/r/721618 [20:49:08] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:721610|Set jQuery migrate to false for wikibooks and Commons (T280944)]] (duration: 00m 56s) [20:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:14] T280944: Phase out jQuery Migrate v3 - https://phabricator.wikimedia.org/T280944 [20:49:34] (03CR) 10SDineshKumar: Switch from cron to systemd timer for elasticsearch module (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [20:50:20] (03PS3) 10Legoktm: mediawiki: Remove lilypond [puppet] - 10https://gerrit.wikimedia.org/r/721618 [20:51:02] (03CR) 10RobH: [C: 03+2] ganeti4004 setup [puppet] - 10https://gerrit.wikimedia.org/r/721619 (https://phabricator.wikimedia.org/T289715) (owner: 10RobH) [20:52:14] (03PS4) 10Ebernhardson: query_service: Support proxying to microsite from backend [puppet] - 10https://gerrit.wikimedia.org/r/720801 (https://phabricator.wikimedia.org/T280247) [20:52:45] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10RobH) [20:53:36] (03CR) 10Ladsgroup: miscweb: Add CSP headers for query builder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [20:55:21] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31109/console" [puppet] - 10https://gerrit.wikimedia.org/r/721618 (owner: 10Legoktm) [20:58:32] (03CR) 10Ladsgroup: [C: 03+1] "To trigger jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [20:59:02] (03CR) 10jerkins-bot: [V: 04-1] Switch from cron to systemd timer for elasticsearch module [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [20:59:23] (03CR) 10Ladsgroup: [C: 03+1] Switch from cron to systemd timer for elasticsearch module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [21:00:20] (03PS1) 10Legoktm: Revert "Drop action api token methods deprecated in 1.24" [core] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721540 (https://phabricator.wikimedia.org/T291202) [21:00:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:35] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ganeti4004.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-rei... [21:04:40] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti4004.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['ganeti4004.ulsfo.wmnet'] ` [21:04:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:01] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ganeti4004.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-rei... [21:07:41] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/720801 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [21:19:46] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4004.ulsfo.wmnet with reason: REIMAGE [21:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:22] (03CR) 10jerkins-bot: [V: 04-1] Revert "Drop action api token methods deprecated in 1.24" [core] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721540 (https://phabricator.wikimedia.org/T291202) (owner: 10Legoktm) [21:22:45] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti4004.ulsfo.wmnet with reason: REIMAGE [21:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:07] (03PS2) 10Legoktm: Revert "Drop action api token methods deprecated in 1.24" [core] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721540 (https://phabricator.wikimedia.org/T291202) [21:26:09] (03PS1) 10Legoktm: Revert "Drop i18n messages for removed token API" [core] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721620 [21:31:43] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti4004.ulsfo.wmnet'] ` and were **ALL** successful. [21:34:22] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10RobH) [21:34:42] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10RobH) 05In progressβ†’03Resolved [21:42:44] (03PS3) 10Ahmon Dancy: role::releases: Include ::profile::kubernetes::deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/721615 (https://phabricator.wikimedia.org/T288629) [21:42:46] (03PS1) 10Ebernhardson: query_service: Repair non-scap deployment methods [puppet] - 10https://gerrit.wikimedia.org/r/721623 [21:43:31] (03CR) 10jerkins-bot: [V: 04-1] query_service: Repair non-scap deployment methods [puppet] - 10https://gerrit.wikimedia.org/r/721623 (owner: 10Ebernhardson) [21:43:39] (03PS2) 10Ebernhardson: query_service: Repair non-scap deployment methods [puppet] - 10https://gerrit.wikimedia.org/r/721623 [21:44:24] (03CR) 10jerkins-bot: [V: 04-1] query_service: Repair non-scap deployment methods [puppet] - 10https://gerrit.wikimedia.org/r/721623 (owner: 10Ebernhardson) [21:47:38] (03CR) 10Legoktm: [C: 03+2] Revert "Drop action api token methods deprecated in 1.24" [core] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721540 (https://phabricator.wikimedia.org/T291202) (owner: 10Legoktm) [21:47:41] (03CR) 10Legoktm: [C: 03+2] Revert "Drop i18n messages for removed token API" [core] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721620 (owner: 10Legoktm) [21:47:56] (03PS4) 10Ahmon Dancy: DNM: role::releases: Add profiles needed for image testing [puppet] - 10https://gerrit.wikimedia.org/r/721615 (https://phabricator.wikimedia.org/T288629) [21:48:23] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10WDoranWMF) @ldelench_wmf @NRodriguez awesome, we'll our best to get it done as soon as it is unbloc... [21:50:25] (03PS5) 10Ahmon Dancy: DNM: role::releases: Add profiles needed for image testing [puppet] - 10https://gerrit.wikimedia.org/r/721615 (https://phabricator.wikimedia.org/T288629) [21:53:34] (03PS4) 10Ryan Kemper: wdqs: remove codfw hourly restarts [puppet] - 10https://gerrit.wikimedia.org/r/720102 (https://phabricator.wikimedia.org/T290330) [21:54:20] (03PS3) 10Ebernhardson: query_service: Repair non-scap deployment methods [puppet] - 10https://gerrit.wikimedia.org/r/721623 [21:56:30] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721623 (owner: 10Ebernhardson) [22:02:07] (03CR) 10SDineshKumar: Switch from cron to systemd timer for elasticsearch module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [22:07:05] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/721624 [22:07:13] (03CR) 10Ryan Kemper: [C: 03+2] query_service: Repair non-scap deployment methods [puppet] - 10https://gerrit.wikimedia.org/r/721623 (owner: 10Ebernhardson) [22:09:38] (03Merged) 10jenkins-bot: Revert "Drop i18n messages for removed token API" [core] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721620 (owner: 10Legoktm) [22:10:37] (03CR) 10Ebernhardson: [C: 04-1] "This likely doesn't handle wcqs-beta properly yet. Needs more work." [puppet] - 10https://gerrit.wikimedia.org/r/720801 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [22:11:03] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/721625 [22:11:32] (03Merged) 10jenkins-bot: Revert "Drop action api token methods deprecated in 1.24" [core] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721540 (https://phabricator.wikimedia.org/T291202) (owner: 10Legoktm) [22:12:58] (03CR) 10SDineshKumar: Switch from cron to systemd timer for elasticsearch module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [22:14:43] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/721631 [22:16:56] !log legoktm@deploy1002 Synchronized php-1.37.0-wmf.23/includes/api/ApiTokens.php: Restore deprecated token APIs (1/3) (duration: 00m 56s) [22:16:59] (03CR) 10Ryan Kemper: Switch from cron to systemd timer for elasticsearch module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [22:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:01] (03CR) 10Ryan Kemper: Switch from cron to systemd timer for elasticsearch module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [22:19:06] (03PS6) 10Ahmon Dancy: role::releases: Add profiles needed for image testing [puppet] - 10https://gerrit.wikimedia.org/r/721615 (https://phabricator.wikimedia.org/T288629) [22:19:48] !log legoktm@deploy1002 Synchronized php-1.37.0-wmf.23/autoload.php: Restore deprecated token APIs (2/3) (duration: 00m 56s) [22:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:57] (03CR) 10Ahmon Dancy: [C: 03+1] role::releases: Add profiles needed for image testing [puppet] - 10https://gerrit.wikimedia.org/r/721615 (https://phabricator.wikimedia.org/T288629) (owner: 10Ahmon Dancy) [22:21:16] !log legoktm@deploy1002 Synchronized php-1.37.0-wmf.23/includes/api/: Restore deprecated token APIs (3/3) (duration: 00m 56s) [22:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:39] it looks like there's no backport window today, just deployment training...but no one has signed up so I'm going to scap now [22:22:58] scap away [22:23:02] !log legoktm@deploy1002 Started scap: i18n for restoring deprecated token APIs [22:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:18] (03PS4) 10Dzahn: Switch from cron to systemd timer for elasticsearch module [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [22:25:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:19] (03PS6) 10Dave Pifke: statsv: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) [22:28:27] (03CR) 10Dzahn: "uploaded PS4 to test where that syntax error mentioned above comes from" [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [22:29:10] (03CR) 10jerkins-bot: [V: 04-1] statsv: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) (owner: 10Dave Pifke) [22:29:21] (03CR) 10Dzahn: "@Dinesh I made the change to use $script for testing and for some reason it works for me, I can compile it and see no syntax error: https" [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [22:30:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:48] (03PS7) 10Dave Pifke: webperf: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) [22:31:52] (03CR) 10Dzahn: [C: 03+1] Switch from cron to systemd timer for elasticsearch module [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [22:32:14] (03CR) 10Huji: Temporarily disable anonymous editing on fawiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) (owner: 10Huji) [22:32:55] (03PS3) 10Huji: Temporarily disable anonymous editing on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) [22:33:37] (03CR) 10Legoktm: "Do you want to ensure => absent it?" [puppet] - 10https://gerrit.wikimedia.org/r/720974 (https://phabricator.wikimedia.org/T290759) (owner: 10Giuseppe Lavagetto) [22:34:26] (03Abandoned) 10Legoktm: mediawiki: Install firejail from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/616955 (https://phabricator.wikimedia.org/T179022) (owner: 10Legoktm) [22:36:00] (03CR) 10Ryan Kemper: Switch from cron to systemd timer for elasticsearch module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [22:37:54] (03CR) 10SDineshKumar: Switch from cron to systemd timer for elasticsearch module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [22:38:32] !log legoktm@deploy1002 Finished scap: i18n for restoring deprecated token APIs (duration: 15m 30s) [22:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:35] (03PS5) 10Ryan Kemper: elasticsearch: Switch from cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [22:42:44] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:42:54] 10SRE, 10Patch-For-Review: Backport firejail 0.9.52 for use on Wikimedia appservers - https://phabricator.wikimedia.org/T179022 (10Legoktm) 05Openβ†’03Invalid Everything is on buster now. Plus we're moving away from firejail to Shellbox. [22:45:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:29] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:47:11] (03CR) 10Albertoleoncio: [C: 03+1] Temporarily disable anonymous editing on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) (owner: 10Huji) [22:49:29] (03PS1) 10Legoktm: Add k8s users/tokens for shellbox-media [labs/private] - 10https://gerrit.wikimedia.org/r/721633 (https://phabricator.wikimedia.org/T289228) [22:53:38] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:55:00] (03PS8) 10Dave Pifke: webperf: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) [22:55:26] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:55:39] (03CR) 10jerkins-bot: [V: 04-1] webperf: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) (owner: 10Dave Pifke) [22:56:33] (03PS9) 10Dave Pifke: webperf: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) [22:56:52] (03PS6) 10Ryan Kemper: elasticsearch: Switch from cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [22:57:35] (03PS10) 10Dave Pifke: webperf: connect to Kafka using TLS [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) [22:58:16] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [23:00:05] brennen: My dear minions, it's time we take the moon! Just kidding. Time for US Backport and Config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210916T2300). [23:01:07] * thcipriani waves as fake brenne.n [23:02:54] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31119/console" [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [23:03:24] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add k8s users/tokens for shellbox-media [labs/private] - 10https://gerrit.wikimedia.org/r/721633 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [23:05:39] (03PS1) 10Legoktm: Add tokens and users for shellbox-media service [puppet] - 10https://gerrit.wikimedia.org/r/721634 (https://phabricator.wikimedia.org/T289228) [23:06:53] (03PS1) 10Legoktm: Add namespace for shellbox-media service [deployment-charts] - 10https://gerrit.wikimedia.org/r/721635 (https://phabricator.wikimedia.org/T289228) [23:08:34] (03CR) 10Legoktm: [C: 03+2] Add tokens and users for shellbox-media service [puppet] - 10https://gerrit.wikimedia.org/r/721634 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [23:09:04] (03PS2) 10Legoktm: Add namespace for shellbox-media service [deployment-charts] - 10https://gerrit.wikimedia.org/r/721635 (https://phabricator.wikimedia.org/T289228) [23:09:12] (03CR) 10Legoktm: [C: 03+2] Add namespace for shellbox-media service [deployment-charts] - 10https://gerrit.wikimedia.org/r/721635 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [23:13:29] (03Merged) 10jenkins-bot: Add namespace for shellbox-media service [deployment-charts] - 10https://gerrit.wikimedia.org/r/721635 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [23:16:08] !log legoktm@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [23:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:27] 10SRE, 10NavigationTiming, 10Performance-Team, 10Patch-For-Review: Switch to encrypted kafka for coal/navtiming/statsv - https://phabricator.wikimedia.org/T290131 (10dpifke) Sigh. TLS isn't enabled for jumbo Kafka in the deployment-prep cluster (unlike jumbo Kafka in production). It's really frustrating... [23:17:22] !log legoktm@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [23:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:47] !log legoktm@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [23:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:20] !log legoktm@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [23:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:47] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [23:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:55] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [23:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:10] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [23:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:42] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [23:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:59] (03PS1) 10Legoktm: helmfile.d: Add shellbox-media [deployment-charts] - 10https://gerrit.wikimedia.org/r/721637 (https://phabricator.wikimedia.org/T289228) [23:28:45] 10SRE, 10serviceops: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10ssastry) Good to go! [23:34:52] (03PS7) 10Ryan Kemper: elasticsearch: Switch from cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [23:36:49] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: Switch from cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [23:37:56] !log T273673 Disabling puppet on elasticsearch hosts `sudo cumin 'R:Class = elasticsearch::log::hot_threads' 'sudo disable-puppet "https://gerrit.wikimedia.org/r/c/operations/puppet/+/721413 - T273673"'` [23:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:03] T273673: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 [23:39:09] !log T273673 Testing elasticsearch cron->systemd timer-job changes on canary instance `ryankemper@elastic1064:~$ sudo run-puppet-agent --force` [23:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:19] !log T273673 The associated crons are gone and I see the new systemd timers for both gc-cleanup and the hot threads logger [23:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:26] T273673: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 [23:51:21] !log T273673 All looks good, re-enabling puppet and running on rest of fleet: `sudo cumin 'R:Class = elasticsearch::log::hot_threads' 'sudo run-puppet-agent --force'` [23:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:27] T273673: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 [23:52:01] (03CR) 10Legoktm: [C: 03+2] helmfile.d: Add shellbox-media [deployment-charts] - 10https://gerrit.wikimedia.org/r/721637 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [23:56:45] (03Merged) 10jenkins-bot: helmfile.d: Add shellbox-media [deployment-charts] - 10https://gerrit.wikimedia.org/r/721637 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [23:58:13] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-media' for release 'main' . [23:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log