[00:09:31] urbanecm: Is something left to do? [00:09:47] i don't think so [00:11:05] urbanecm: I can't believe how many broken pages we have. [00:11:15] But okay, I'm near the end with deleting. [00:11:23] it's just conflicts [00:11:41] it was two pages, one that was in NS_PROJECT and second one in main NS [00:11:48] with the alias, those two pages became one [00:14:58] Okay. :) [00:19:03] Everything should have been deleted, thanks for the help, I got really tired, because I couldn't do this as a bot, because he didn't want to work. [00:19:09] Good night! Sleep well. [00:39:15] (03CR) 10Zoranzoki21: [C: 04-1] Use Wikimania's logo in a new vector (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [00:40:49] 10SRE, 10serviceops, 10Developer Productivity, 10Performance-Team (Radar), and 2 others: All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10Krinkle) [00:41:46] 10SRE, 10serviceops, 10Developer Productivity, 10Performance-Team (Radar), and 2 others: Debug hosts sometimes Fatal error: "The UdpSocket to 127.0.0.1:10514 has been closed" - https://phabricator.wikimedia.org/T214734 (10Krinkle) [00:57:12] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:27:09] (03PS5) 10Juan90264: Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) [01:28:26] (03CR) 10Juan90264: "@Zoranzoki21 Thank you, now resolved!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [01:29:36] (03PS7) 10Juan90264: Adding square logo and wordmark for Wikimania [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) [01:29:57] (03PS6) 10Juan90264: Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) [01:31:14] (03CR) 10Zoranzoki21: "> Patch Set 5:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [01:31:18] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [01:32:26] (03CR) 10jerkins-bot: [V: 04-1] Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [01:32:40] (03PS7) 10Zoranzoki21: Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [01:33:04] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [02:00:05] Deploy window Branching MediaWiki, extensions, skins, and vendor – See Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210713T0200) [02:06:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.14 [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704204 [02:06:54] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.14 [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704204 (owner: 10TrainBranchBot) [02:25:26] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.14 [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704204 (owner: 10TrainBranchBot) [02:58:18] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [03:09:44] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [03:53:10] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.76`. Pre-deploy tests passing on canary `wdqs1003` [03:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:17] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@36f74b3]: 0.3.76 [03:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:13] !log [WDQS Deploy] Tests passing following deploy of `0.3.76` on canary `wdqs1003`; proceeding to rest of fleet [03:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:55:39] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@36f74b3]: 0.3.76 (duration: 02m 22s) [03:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:46] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@36f74b3]: 0.3.76 [03:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:02] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:00:05] 10SRE, 10Wikimedia-Mailing-lists: Make auditing members of mailing lists bound to a user right easier - https://phabricator.wikimedia.org/T286122 (10Risker) I think this is a solution in search of a problem. As a listadmin for checkuser-l, I can tell you that we periodically audit the list and haven't ever ha... [04:05:15] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@36f74b3]: 0.3.76 (duration: 08m 28s) [04:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [04:09:36] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [04:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:44] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [04:09:47] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [04:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:30] (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [04:28:26] (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704171 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [04:42:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [05:32:10] (03PS5) 10Juan90264: Adding e use square wordmark for trwikiquote in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) [05:34:53] (03CR) 10Effie Mouzeli: "> Patch Set 22:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [05:38:13] (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [05:41:47] (03PS6) 10Juan90264: Adding e use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) [06:01:21] (03CR) 10Juan90264: "> Patch Set 6:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [06:06:39] !log pool mw2383 - T286463 [06:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:47] T286463: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 [06:29:11] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Skierpage) I tried out the long-awaited return of Score on testwiki and it seems good. It's... //surprising/... [06:48:22] (03CR) 10Muehlenhoff: "Patch has been updated to reflect the latest set of servers." [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [06:53:05] !log systemctl reset-failed ifup@ens5 on gitlab2001 - T273026 [06:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:12] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [07:04:29] (03CR) 10Elukey: "Hello folks, this is causing puppet failures on thanos-fe hosts due to this bit of code in profile::thanos::query:" [puppet] - 10https://gerrit.wikimedia.org/r/703740 (owner: 10Jbond) [07:06:51] !log installing apache security updates on codfw mw* hosts [07:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:08] elukey: thank you, I'll take a look [07:25:01] jbond: thoughts on ^ https://gerrit.wikimedia.org/r/703740 ? [07:25:17] (03CR) 10Hashar: [C: 03+2] [WMF] fork gitiles to prevent loading fonts from 3rd party [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/700932 (https://phabricator.wikimedia.org/T240264) (owner: 10Hashar) [07:29:32] (03CR) 10Effie Mouzeli: [C: 03+1] Rename chart rdf-streaming-updater as flink-session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/693411 (owner: 10DCausse) [07:30:05] godog: <3 [07:32:02] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10fgiunchedi) [07:32:04] (03PS3) 10Jelto: role::common::mediawiki::canary_appserver add new canary app server in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) [07:33:50] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10fgiunchedi) [07:34:23] (03Merged) 10jenkins-bot: [WMF] fork gitiles to prevent loading fonts from 3rd party [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/700932 (https://phabricator.wikimedia.org/T240264) (owner: 10Hashar) [07:35:29] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10fgiunchedi) [07:37:50] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10fgiunchedi) [07:39:40] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10fgiunchedi) [07:40:11] (03PS4) 10Volans: logoutd: add support for Python 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/701442 (https://phabricator.wikimedia.org/T283242) [07:41:34] (03CR) 10Elukey: [C: 03+1] "LGTM Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/702883 (owner: 10Volans) [07:41:42] (03Abandoned) 10Volans: Revert "Revert "Depool eqsin"" [dns] - 10https://gerrit.wikimedia.org/r/700027 (owner: 10Volans) [07:42:54] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (11) node(s) change every puppet run: cloudmetrics1001, cloudmetrics1002, dragonfly-supernode1001, labstore1006, registry1004, thanos-be1003, thanos-fe1001, thanos-fe1002, thanos-fe1003, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [07:42:54] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (11) node(s) change every puppet run: cloudmetrics1001, cloudmetrics1002, dragonfly-supernode1001, labstore1006, registry1004, thanos-be1003, thanos-fe1001, thanos-fe1002, thanos-fe1003, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [07:42:54] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (11) node(s) change every puppet run: cloudmetrics1001, cloudmetrics1002, dragonfly-supernode1001, labstore1006, registry1004, thanos-be1003, thanos-fe1001, thanos-fe1002, thanos-fe1003, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [07:46:08] (03CR) 10Volans: [C: 03+2] logoutd: add support for Python 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/701442 (https://phabricator.wikimedia.org/T283242) (owner: 10Volans) [07:48:05] (03PS1) 10Majavah: toolforge::prometheus: Update PAWS ingress target [puppet] - 10https://gerrit.wikimedia.org/r/704277 (https://phabricator.wikimedia.org/T264221) [07:49:27] (03PS2) 10Hashar: Upgrade Gerrit to v3.2.11 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/699038 (https://phabricator.wikimedia.org/T278990) (owner: 10Ahmon Dancy) [07:50:48] (03CR) 10Martaannaj: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [07:51:37] (03CR) 10Volans: [C: 03+2] Use IcingaHosts instead of Icinga (analytics) [cookbooks] - 10https://gerrit.wikimedia.org/r/702883 (owner: 10Volans) [07:55:39] (03Merged) 10jenkins-bot: Use IcingaHosts instead of Icinga (analytics) [cookbooks] - 10https://gerrit.wikimedia.org/r/702883 (owner: 10Volans) [07:57:30] (03PS1) 10Majavah: Renew PAWS Prometheus certificates [puppet] - 10https://gerrit.wikimedia.org/r/704279 [07:58:23] (03CR) 10Majavah: "Private keys already updated on the affected puppet masters." [puppet] - 10https://gerrit.wikimedia.org/r/704279 (owner: 10Majavah) [08:01:16] (03PS3) 10Volans: Use IcingaHosts instead of Icinga (various) [cookbooks] - 10https://gerrit.wikimedia.org/r/702885 [08:02:46] !log upgrade bullseye pilot installs to latest state of bullseye [08:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:59] (03CR) 10Volans: [C: 03+2] Use IcingaHosts instead of Icinga (various) [cookbooks] - 10https://gerrit.wikimedia.org/r/702885 (owner: 10Volans) [08:13:37] (03Merged) 10jenkins-bot: Use IcingaHosts instead of Icinga (various) [cookbooks] - 10https://gerrit.wikimedia.org/r/702885 (owner: 10Volans) [08:18:24] (03PS5) 10Volans: Use IcingaHosts instead of Icinga (generic) [cookbooks] - 10https://gerrit.wikimedia.org/r/702886 [08:20:38] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:20:40] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:27:46] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [08:28:18] (03PS1) 10Hashar: Merge tag 'v3.2.11' into wmf/stable-3.2 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/704282 (https://phabricator.wikimedia.org/T278990) [08:28:24] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:31:10] (03CR) 10Hashar: [C: 03+2] "That updates our wmf/stable-3.2 fork branch to 3.2.11, it is only used to build the plugins. Gerrit core and its plugin is updated via h" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/704282 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [08:31:34] (03CR) 10Hashar: [V: 03+2 C: 03+2] Upgrade Gerrit to v3.2.11 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/699038 (https://phabricator.wikimedia.org/T278990) (owner: 10Ahmon Dancy) [08:32:14] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:32:46] (Traffic on tunnel link) firing: (2) Traffic on tunnel link - https://alerts.wikimedia.org [08:38:52] (03Merged) 10jenkins-bot: Merge tag 'v3.2.11' into wmf/stable-3.2 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/704282 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [08:39:54] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:39:58] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:40:19] (03CR) 10Volans: [C: 03+2] Use IcingaHosts instead of Icinga (generic) [cookbooks] - 10https://gerrit.wikimedia.org/r/702886 (owner: 10Volans) [08:42:46] (Traffic on tunnel link) firing: (2) Traffic on tunnel link - https://alerts.wikimedia.org [08:43:04] (03Merged) 10jenkins-bot: Use IcingaHosts instead of Icinga (generic) [cookbooks] - 10https://gerrit.wikimedia.org/r/702886 (owner: 10Volans) [08:45:09] !log depool mw2383 - T286463 [08:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:15] T286463: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 [08:46:31] (03PS3) 10Volans: Use IcingaHosts instead of Icinga (search) [cookbooks] - 10https://gerrit.wikimedia.org/r/702884 [08:47:46] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [08:48:30] (03PS1) 10Dzahn: DHCP: remove mw1269 through mw1301 [puppet] - 10https://gerrit.wikimedia.org/r/704283 (https://phabricator.wikimedia.org/T280203) [08:49:20] ^^^ above alerts relate to Telia E-LINE service from cr1-codfw to cr3-eqsin [08:49:29] 20 minute outage in all [08:49:45] was that a planned maintenance? [08:49:55] Jul 13 08:16:31 - Jul 13 08:37:20 UTC [08:50:49] Just checked there yeah, we got an emergency maintenance notice from them. [08:51:13] Their window ends at 10:00 UTC so potentially we are still at risk. [08:52:22] ack, so nothing worrying so far :) [08:54:17] (03CR) 10JMeybohm: [C: 03+1] Rename chart rdf-streaming-updater as flink-session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/693411 (owner: 10DCausse) [08:56:52] (03PS1) 10Dzahn: site/conftool: remove mw1281 through mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/704284 (https://phabricator.wikimedia.org/T280203) [08:59:29] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on sretest1001.eqiad.wmnet with reason: testing the cookbook [08:59:30] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on sretest1001.eqiad.wmnet with reason: testing the cookbook [08:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:10] (03PS1) 10Dzahn: logstash/tests: replace mw1281 with mw1391 [puppet] - 10https://gerrit.wikimedia.org/r/704286 (https://phabricator.wikimedia.org/T280203) [09:00:10] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 12 hosts with reason: Deploying schema change to s3 T277116 [09:00:15] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 12 hosts with reason: Deploying schema change to s3 T277116 [09:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:16] T277116: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 [09:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:31] (03CR) 10David Caro: [C: 03+2] wmcs.toolforge: add k8s worker add/remove cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702090 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [09:03:42] (03CR) 10David Caro: [C: 03+2] wmcs.start_instance_with_prefix: allow passing the affinity [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702088 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [09:03:46] (03CR) 10David Caro: [C: 03+2] wmcs.OpenstackApi: allow soft affinities to be specified [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702087 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [09:03:50] (03CR) 10David Caro: [C: 03+2] wmcs: quote some parameters to openstack [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702086 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [09:03:52] (03CR) 10David Caro: [C: 03+2] wmcs: namespace exceptions [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702085 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [09:03:57] (03CR) 10David Caro: [C: 03+2] wmcs: add default control node to openstack api [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702084 (owner: 10David Caro) [09:03:59] (03PS1) 10Dzahn: site/conftool: remove mw1276 through mw1279 [puppet] - 10https://gerrit.wikimedia.org/r/704287 (https://phabricator.wikimedia.org/T280203) [09:04:01] (03CR) 10David Caro: [C: 03+2] wmcs: ran black and isort [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702083 (owner: 10David Caro) [09:04:03] (03CR) 10David Caro: [C: 03+2] wmcs.vps.refresh_puppet_certs: better handle puppetmaster swap [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702082 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [09:04:28] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:07:36] (03Merged) 10jenkins-bot: wmcs.vps.refresh_puppet_certs: better handle puppetmaster swap [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702082 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [09:07:38] (03Merged) 10jenkins-bot: wmcs: ran black and isort [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702083 (owner: 10David Caro) [09:07:40] (03Merged) 10jenkins-bot: wmcs: add default control node to openstack api [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702084 (owner: 10David Caro) [09:07:42] (03Merged) 10jenkins-bot: wmcs: namespace exceptions [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702085 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [09:07:55] (03Merged) 10jenkins-bot: wmcs: quote some parameters to openstack [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702086 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [09:07:57] (03Merged) 10jenkins-bot: wmcs.OpenstackApi: allow soft affinities to be specified [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702087 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [09:08:03] (03Merged) 10jenkins-bot: wmcs.start_instance_with_prefix: allow passing the affinity [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702088 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [09:14:11] (03PS1) 10Dzahn: conftool: remove mw1300, mw1301 [puppet] - 10https://gerrit.wikimedia.org/r/704289 (https://phabricator.wikimedia.org/T280203) [09:14:48] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10jcrespo) >>! In T277327#7207061, @RobH wrote: > These failed for not liking the specified partition recipie, which was set by someone else, so I need to investigate whats up.... [09:15:12] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [09:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:44] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [09:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:57] (03PS1) 10Dzahn: DHCP: let mwmaint1002 use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/704290 (https://phabricator.wikimedia.org/T267607) [09:21:42] (03CR) 10Dzahn: [C: 03+2] DHCP: let mwmaint1002 use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/704290 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [09:25:23] (03CR) 10Effie Mouzeli: [C: 03+1] role::common::mediawiki::canary_appserver add new canary app server in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [09:25:34] (03CR) 10Effie Mouzeli: [C: 03+2] rdf-streaming-updater: switch to H/A session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [09:25:57] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:26:57] (03PS4) 10Muehlenhoff: Deploy systemd-login logout.d script fleet-wide [puppet] - 10https://gerrit.wikimedia.org/r/703571 (https://phabricator.wikimedia.org/T283242) [09:28:02] (03Merged) 10jenkins-bot: rdf-streaming-updater: switch to H/A session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [09:31:13] (03CR) 10Effie Mouzeli: [C: 03+2] Rename chart rdf-streaming-updater as flink-session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/693411 (owner: 10DCausse) [09:31:47] !log depool mw2383 T286463 [09:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:54] T286463: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 [09:33:52] (03Merged) 10jenkins-bot: Rename chart rdf-streaming-updater as flink-session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/693411 (owner: 10DCausse) [09:33:54] (03PS3) 10Hashar: Update plugins for Gerrit 3.2.11 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/699035 (https://phabricator.wikimedia.org/T278990) (owner: 10Ahmon Dancy) [09:34:33] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) @Dwisehaupt I have to apologize for a newbie error I made here. The FR-Tech server's interfaces don't show what they're connected to in Netb... [09:37:31] (03PS1) 10Dzahn: switch mwmaint.discovery (noc.wm.org backend) from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/704293 (https://phabricator.wikimedia.org/T267607) [09:38:21] (03PS1) 10Btullis: Update AQS Roll Restart cookbook to use new style [cookbooks] - 10https://gerrit.wikimedia.org/r/704294 (https://phabricator.wikimedia.org/T269925) [09:38:27] !log installing apache security updates on parsoid hosts [09:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:09] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10jijiki) I depooled mw2383 as it was behaving as before. @RobH is it possible to run hardware tests on the host? I suspect it is might be a hardware issue. Thank you! [09:39:34] 10SRE, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) I think I was able to mitigate the problem by using a `part_size: 32mb` setting for multi-part uploads from the compactor. This will create more... [09:40:04] !log installing apache security updates on thanos-fe hosts [09:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:43] (03PS1) 10Hnowlan: maps: disable OSM sync and tilerator in codfw [puppet] - 10https://gerrit.wikimedia.org/r/704296 (https://phabricator.wikimedia.org/T269582) [09:41:08] (03CR) 10jerkins-bot: [V: 04-1] Update AQS Roll Restart cookbook to use new style [cookbooks] - 10https://gerrit.wikimedia.org/r/704294 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [09:43:46] (03CR) 10Volans: "Thanks for migrating this to the new API! Couple of additional nits inline in addition to what CI is complaining about." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/704294 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [09:46:03] Thanks Volans. Will update. [09:46:14] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Volans) [09:46:26] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:46:54] btullis: anytime, feel free to ping me directly if you have any question. There is also an onboarding chat about our automation tools ;) [09:47:59] (03PS1) 10Dzahn: httpbb: add tests for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/704297 (https://phabricator.wikimedia.org/T267607) [09:50:19] volans: Great, thanks. Have found the onboarding chats folder now. Will check it out :-) [09:50:53] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Volans) [09:52:25] btullis: should be listed in your onboarding checklist somewhere [09:52:38] if not we might have missed some step [09:53:17] (03CR) 10Elukey: Update AQS Roll Restart cookbook to use new style (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/704294 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [09:53:46] (03PS2) 10Dzahn: httpbb: add tests for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/704297 (https://phabricator.wikimedia.org/T267607) [09:54:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:56:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:57:07] volans: Not as far as I can see. The nearest thing I can see on my checklist is a link to here: https://wikitech.wikimedia.org/wiki/SRE_tooling but no mention of the shared drive full of chats. [09:58:08] !log upgrading PHP/Apache on matomo1002 (piwik.wikimedia.org) [09:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:01] (03CR) 10Jgiannelos: [C: 03+1] maps: disable OSM sync and tilerator in codfw [puppet] - 10https://gerrit.wikimedia.org/r/704296 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [10:00:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:54] (03PS4) 10Labdajiwa: Change category name of Babel extension on Javanese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702961 (https://phabricator.wikimedia.org/T286165) [10:01:27] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Volans) [10:03:59] (03CR) 10Hashar: "I have locally followed the instructions https://wikitech.wikimedia.org/wiki/Gerrit/Upgrade#Deploying" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/699035 (https://phabricator.wikimedia.org/T278990) (owner: 10Ahmon Dancy) [10:05:22] (03PS1) 10Dzahn: mediawiki::maintenance: open ferm hole for deployment servers to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/704300 (https://phabricator.wikimedia.org/T267607) [10:05:52] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::maintenance: open ferm hole for deployment servers to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/704300 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [10:06:21] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10LSobanski) @wiki_willy we're good to move ahead with this one. The host needs to be downtimed and shut down in advance, what (EU friendly) time would work for you? [10:06:39] Ha! Love the 'Better call volans!' graphic https://jynus.com/better-call-volans.jpg [10:06:46] +1 [10:07:55] * volans hides [10:08:14] (03PS2) 10Dzahn: mediawiki::maintenance: open ferm hole for deployment servers to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/704300 (https://phabricator.wikimedia.org/T267607) [10:08:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30184/console" [puppet] - 10https://gerrit.wikimedia.org/r/702325 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [10:09:44] (03CR) 10Jbond: [V: 03+1] "showing as a noop on codwfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/702325 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [10:10:55] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/30185/mwmaint1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/704300 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [10:11:04] (03PS8) 10Hnowlan: maps: reimage maps2008 as buster replica in new cluster [puppet] - 10https://gerrit.wikimedia.org/r/702099 [10:11:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:11:38] that's me ^ [10:12:12] bad godog [10:12:18] no biscuit [10:12:37] (03CR) 10DCausse: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/693416 (owner: 10DCausse) [10:12:44] who will be a good boy then, though ? [10:13:10] I've only recently catched up on #doggos on slack heh [10:13:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:14:58] godoggoinc.com [10:15:35] (03PS1) 10Jbond: P:thanos::query: drop site parameter from prometheus::resource_config [puppet] - 10https://gerrit.wikimedia.org/r/704301 [10:15:40] haha! [10:16:40] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [10:17:08] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [10:17:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30187/console" [puppet] - 10https://gerrit.wikimedia.org/r/704301 (owner: 10Jbond) [10:17:32] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [10:18:10] !log installing apache security updates on Logstash hosts [10:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:13] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [10:19:34] (03CR) 10Jbond: [V: 03+1 C: 03+2] "will go ahead and merge this strait away as its currently causing issues" [puppet] - 10https://gerrit.wikimedia.org/r/704301 (owner: 10Jbond) [10:20:16] jbond: err, no that's going to cause more issues, we need the site [10:20:21] please revert [10:20:27] oh sorry reverting [10:20:44] (03PS1) 10Jbond: Revert "P:thanos::query: drop site parameter from prometheus::resource_config" [puppet] - 10https://gerrit.wikimedia.org/r/704175 [10:20:50] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:thanos::query: drop site parameter from prometheus::resource_config" [puppet] - 10https://gerrit.wikimedia.org/r/704175 (owner: 10Jbond) [10:21:00] TYVM jbond [10:21:35] (03CR) 10David Caro: [C: 03+1] "Verified by doing the call:" [puppet] - 10https://gerrit.wikimedia.org/r/704279 (owner: 10Majavah) [10:21:35] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 18 hosts with reason: Deploying schema change to s1 T277116 [10:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:42] T277116: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 [10:21:42] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 18 hosts with reason: Deploying schema change to s1 T277116 [10:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:25] (03CR) 10David Caro: [C: 03+1] "Verified by doing the call:" [puppet] - 10https://gerrit.wikimedia.org/r/704277 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [10:22:58] jbond: yeah site is needed in resource_config for sure, perhaps it could default to ::site ? I'm open to other solutions too but thanos definitely needs to reach out to all sites [10:23:21] godog: yes will send a follow up patch now, just manually rolling back the changes puppet did [10:23:53] jbond: ok SGTM, thanos-fe2001 has puppet disabled on purpose and can be left alone [10:23:56] (03PS1) 10Ammarpad: Remove obsolote $wgShowDBErrorBacktrace config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704304 [10:24:01] ack [10:25:02] (03PS1) 10Hashar: Update our plugins for 3.2.11 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/704305 (https://phabricator.wikimedia.org/T278990) [10:26:44] PROBLEM - tilerator on maps2008 is CRITICAL: connect to address 10.192.48.165 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [10:26:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:26:56] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:05] !log installing apache security updates on alert1001 (icinga.wikimedia.org) [10:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:49] (03PS2) 10Ammarpad: Remove obsolete $wgShowDBErrorBacktrace config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704304 [10:30:18] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-compact.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:49] (03PS1) 10Jbond: R:prometheus::resource_config: reintroduce site parameter [puppet] - 10https://gerrit.wikimedia.org/r/704307 [10:31:20] (03CR) 10jerkins-bot: [V: 04-1] R:prometheus::resource_config: reintroduce site parameter [puppet] - 10https://gerrit.wikimedia.org/r/704307 (owner: 10Jbond) [10:32:14] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:42] (03PS2) 10Jbond: R:prometheus::resource_config: reintroduce site parameter [puppet] - 10https://gerrit.wikimedia.org/r/704307 [10:34:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:35:10] (03CR) 10Hashar: "For gitiles, it is forked on our Gerrit and I did not update our fork from upstream. I just git push --tags to get v3.2.11 and thus have t" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/699035 (https://phabricator.wikimedia.org/T278990) (owner: 10Ahmon Dancy) [10:36:50] (03CR) 10Hashar: [C: 03+2] "I have confirmed gitiles.jar has a META-INF/MANIFEST.MF with:" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/704305 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [10:39:10] !log running `nodetool decommission` on maps2008 [10:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:27] (03PS3) 10Jbond: R:prometheus::resource_config: reintroduce site parameter [puppet] - 10https://gerrit.wikimedia.org/r/704307 [10:39:34] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps2008.codfw.wmnet with reason: reimaging as buster replica [10:39:35] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps2008.codfw.wmnet with reason: reimaging as buster replica [10:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:18] (03CR) 10Dzahn: [V: 03+1 C: 03+2] mediawiki::maintenance: open ferm hole for deployment servers to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/704300 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [10:42:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30190/console" [puppet] - 10https://gerrit.wikimedia.org/r/704307 (owner: 10Jbond) [10:43:02] (03CR) 10Dzahn: "after merge, httpb tests work:" [puppet] - 10https://gerrit.wikimedia.org/r/704300 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [10:43:18] (03CR) 10Dzahn: "after opening ferm hole:" [puppet] - 10https://gerrit.wikimedia.org/r/704297 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [10:44:02] (03Merged) 10jenkins-bot: Update our plugins for 3.2.11 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/704305 (https://phabricator.wikimedia.org/T278990) (owner: 10Hashar) [10:44:29] !log upgrading apache on phab1001 (phabricator.wikimedia.org) [10:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:57] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [10:45:04] (03CR) 10Dzahn: [C: 03+2] httpbb: add tests for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/704297 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [10:47:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you !" [puppet] - 10https://gerrit.wikimedia.org/r/704307 (owner: 10Jbond) [10:48:01] (03CR) 10Jbond: [V: 03+1 C: 03+2] "np" [puppet] - 10https://gerrit.wikimedia.org/r/704307 (owner: 10Jbond) [10:51:58] (03CR) 10Dzahn: "[deploy1002:~] $ httpbb /srv/deployment/httpbb-tests/noc/* --hosts mwmaint1002.eqiad.wmnet,mwmaint2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/704297 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [10:52:53] (03CR) 10Dzahn: [V: 03+1] "freshly added tests for httpbb (I7ce678cc513bdc23) and they pass for both hosts:" [dns] - 10https://gerrit.wikimedia.org/r/704293 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [10:54:37] (03CR) 10Muehlenhoff: "Looks good, few nits inline." (035 comments) [software/statograph] - 10https://gerrit.wikimedia.org/r/704133 (https://phabricator.wikimedia.org/T285569) (owner: 10Jbond) [10:54:52] !log switching https://noc.wikimedia.org backened from eqiad to codfw for mwmaint1002 OS upgrade, not affecting config-master/pybal, tests passed (T267607) [10:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:58] T267607: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 [10:55:25] (03CR) 10Dzahn: [V: 03+1 C: 03+2] switch mwmaint.discovery (noc.wm.org backend) from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/704293 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210713T1100). [11:00:04] Ammar: A patch you scheduled for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:28] o/ [11:01:00] looking at the patch… [11:01:50] (03CR) 10Hnowlan: [C: 03+2] maps: fix osm sync directory path [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos) [11:02:19] Ammar: are you around? [11:02:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove obsolete $wgShowDBErrorBacktrace config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704304 (owner: 10Ammarpad) [11:02:59] (03PS4) 10Hashar: Update plugins for Gerrit 3.2.11 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/699035 (https://phabricator.wikimedia.org/T278990) (owner: 10Ahmon Dancy) [11:03:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) @jijiki thanks yes good suggestions both. I will send a mail to ops@ later today as a reminder for people to review. In terms of main... [11:05:20] (03CR) 10Hashar: "On Archiva I have manually deleted the 3.2.11 artifacts for gitiles, go-import, healthcheck and metrics-reporter-jmx and reuploaded them:" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/699035 (https://phabricator.wikimedia.org/T278990) (owner: 10Ahmon Dancy) [11:12:13] 10SRE, 10serviceops, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [11:12:59] 10SRE, 10serviceops, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) 05Stalled→03Open Yes, that's correct. We are reimaging eqiad first. Just switched noc.wikimedia.org backend to codfw to avoid any downtime of that. mwmaint2002 will be done o... [11:13:28] Ammar: if you’re not around then the patch won’t be deployed… [11:13:52] 10SRE, 10serviceops, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mwmaint1002.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202107... [11:13:55] !log mwmaint1002 - reimaging with buster (T267607) [11:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:01] T267607: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 [11:15:20] Lucas_WMDE I am around [11:15:24] ok! [11:15:32] (03PS13) 10Jbond: sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 [11:16:24] * Lucas_WMDE looks up what the current mwmaint server is [11:16:33] Lucas_WMDE: mwmaint2002 [11:16:58] ah, right [11:17:11] I had already updated my local `backport` function anyways, I was just running it from a shell that had the old version :facepalm: [11:17:12] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [11:17:13] I am upgrading the eqiad one right now, but it's not used for anything [11:17:23] fortunately there was a large banner on 1002 yelling at me :) [11:17:33] so it's down for a few minutes, but besides that it will show a MOTD that says to not use it [11:17:36] yea, that [11:17:49] I must’ve SSHed in just before it went down [11:17:55] also noc.wikimedia.org is served from codfw right now [11:18:00] that webserver is also on mwmaint [11:18:18] yea, good timing, the cookbook waited 3 min and powercycled it just now :) [11:18:40] ^^ [11:19:14] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) [11:19:54] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove obsolete $wgShowDBErrorBacktrace config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704304 (owner: 10Ammarpad) [11:20:37] (03Merged) 10jenkins-bot: Remove obsolete $wgShowDBErrorBacktrace config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704304 (owner: 10Ammarpad) [11:20:48] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [11:21:00] (03PS3) 10Jbond: statograph: add debian folder allowing us to package [software/statograph] - 10https://gerrit.wikimedia.org/r/704133 (https://phabricator.wikimedia.org/T285569) [11:21:02] (03PS1) 10Jbond: setup.py: create an entry point so we have an executable script [software/statograph] - 10https://gerrit.wikimedia.org/r/704314 (https://phabricator.wikimedia.org/T285569) [11:21:12] Ammar: the change is on mwdebug2001, feel free to test it there if you want [11:21:15] (though I imagine there’s not much to test ^^) [11:21:20] I’m also checking that nothing explodes [11:21:42] (03PS2) 10Btullis: Update AQS Roll Restart cookbook to use new style [cookbooks] - 10https://gerrit.wikimedia.org/r/704294 (https://phabricator.wikimedia.org/T269925) [11:21:52] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [11:23:03] Lucas_WMDE Indeed there's nothing to test, it's apparently not used in code though [11:23:59] ok, syncing [11:24:24] (03PS3) 10Cathal Mooney: Adding 'quality-of-service' template for use on QFX/EX series switches. [homer/public] - 10https://gerrit.wikimedia.org/r/701499 (https://phabricator.wikimedia.org/T284592) [11:25:22] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:704304|Remove obsolete $wgShowDBErrorBacktrace config]] (duration: 01m 25s) [11:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:00] mutante: scap printed a “remote host identification has changed” for mwmaint1002, I assume that’s expected [11:27:16] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [11:27:20] probably `scap pull` on that host when it’s finished reimaging? [11:27:32] (03PS3) 10Btullis: Update AQS Roll Restart cookbook to use new style [cookbooks] - 10https://gerrit.wikimedia.org/r/704294 (https://phabricator.wikimedia.org/T269925) [11:28:22] !log EU backport+config window done [11:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:21] Lucas_WMDE: yes, that is expected, hoping it does not affect you besides the warning [11:30:29] nope, I’m fine [11:30:38] cool, yes, will scap pull once it's done [11:30:42] ok thanks! [11:31:02] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mwmaint1002.eqiad.wmnet with reason: REIMAGE [11:31:04] currently "first puppet run" and that takes longer than the OS install itself [11:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwmaint1002.eqiad.wmnet with reason: REIMAGE [11:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:51] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) [11:42:23] (03CR) 10Jbond: "Thanks updated unfortunately this is targeted for buster not bullseye (i had the wrong entry in changelog)" (035 comments) [software/statograph] - 10https://gerrit.wikimedia.org/r/704133 (https://phabricator.wikimedia.org/T285569) (owner: 10Jbond) [11:45:30] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm) [11:48:48] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [11:50:17] (03CR) 10Jbond: sre.idm.logout: create cookbook to logout users (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [11:51:38] (03PS1) 10Jbond: sre.idm.cookbooks: fix typo uid -> cn [cookbooks] - 10https://gerrit.wikimedia.org/r/704317 [11:52:18] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Kormat) [11:52:39] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [11:52:42] (03CR) 10Jbond: [V: 03+2 C: 03+2] sre.idm.cookbooks: fix typo uid -> cn [cookbooks] - 10https://gerrit.wikimedia.org/r/704317 (owner: 10Jbond) [11:52:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) [11:52:52] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) 05Stalled→03Open [11:52:53] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for migrating it to the new API." [cookbooks] - 10https://gerrit.wikimedia.org/r/704294 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [11:54:40] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Mailman3 schema change: Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Legoktm) [11:54:51] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change: Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Legoktm) [11:55:48] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Internal server error (with ugly html tags) when changing Autoresponse postings text - https://phabricator.wikimedia.org/T286269 (10Legoktm) a:03Legoktm [11:57:50] (03CR) 10Jelto: "I have some concerns if removing 30 machines at once may be a little bit too much. If we decommission all machines at once we may need som" [puppet] - 10https://gerrit.wikimedia.org/r/704283 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [11:58:04] (03PS1) 10JMeybohm: dragonfly::dfdaemon: Make profile and module ensureable [puppet] - 10https://gerrit.wikimedia.org/r/704318 (https://phabricator.wikimedia.org/T286054) [11:58:37] (03PS4) 10Btullis: Update AQS Roll Restart cookbook to use new style [cookbooks] - 10https://gerrit.wikimedia.org/r/704294 (https://phabricator.wikimedia.org/T269925) [11:59:24] (03CR) 10jerkins-bot: [V: 04-1] dragonfly::dfdaemon: Make profile and module ensureable [puppet] - 10https://gerrit.wikimedia.org/r/704318 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [12:01:05] (03PS1) 10Dzahn: site/conftool: turn mw1422 into an mw appserver [puppet] - 10https://gerrit.wikimedia.org/r/704319 (https://phabricator.wikimedia.org/T279309) [12:01:12] (03CR) 10Jgiannelos: [C: 03+1] maps: reimage maps2008 as buster replica in new cluster [puppet] - 10https://gerrit.wikimedia.org/r/702099 (owner: 10Hnowlan) [12:02:12] (03PS2) 10JMeybohm: dragonfly::dfdaemon: Make profile and module ensureable [puppet] - 10https://gerrit.wikimedia.org/r/704318 (https://phabricator.wikimedia.org/T286054) [12:02:17] 10SRE, 10serviceops, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mwmaint1002.eqiad.wmnet'] ` and were **ALL** successful. [12:02:27] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/704283 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:02:52] (03CR) 10Btullis: [C: 03+2] Update AQS Roll Restart cookbook to use new style (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/704294 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [12:03:51] (03CR) 10JMeybohm: dragonfly::dfdaemon: Make profile and module ensureable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704318 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [12:06:24] (03PS2) 10Dzahn: DHCP: remove mw1269 through mw1284 [puppet] - 10https://gerrit.wikimedia.org/r/704283 (https://phabricator.wikimedia.org/T280203) [12:07:24] (03Merged) 10jenkins-bot: Update AQS Roll Restart cookbook to use new style [cookbooks] - 10https://gerrit.wikimedia.org/r/704294 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [12:07:55] (03CR) 10Hashar: [C: 03+2] Update plugins for Gerrit 3.2.11 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/699035 (https://phabricator.wikimedia.org/T278990) (owner: 10Ahmon Dancy) [12:08:04] (03Merged) 10jenkins-bot: Update plugins for Gerrit 3.2.11 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/699035 (https://phabricator.wikimedia.org/T278990) (owner: 10Ahmon Dancy) [12:08:17] (03CR) 10Dzahn: "just 15 hosts now" [puppet] - 10https://gerrit.wikimedia.org/r/704283 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:09:50] (03CR) 10Jelto: DHCP: remove mw1269 through mw1284 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704283 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:10:23] (03CR) 10Dzahn: "it doesn't mean they have to be decom'ed immediately. it just means they can't be reimaged from now until they are gone anyways" [puppet] - 10https://gerrit.wikimedia.org/r/704283 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:12:57] 10SRE, 10serviceops, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) ` [mwmaint1002:~] $ lsb_release -c Codename: buster ` mwmaint1002 is on buster now. puppet runs without errors or warnings. https://noc.wikimedia.org is hosted by mwmaint2002 in... [12:13:03] 10SRE, 10SRE Observability, 10Traffic, 10Sustainability (Incident Followup): Per-country Frontend Traffic dashboards - https://phabricator.wikimedia.org/T286554 (10ema) [12:17:09] 10SRE, 10serviceops, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) [12:20:20] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change: Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10LSobanski) p:05Triage→03Medium Let us know when full tested and ready to go. Preferably after the DC switch back as our schedule is p... [12:20:26] 10SRE, 10serviceops, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) @Legoktm done ^ the noc site is now hosted in codfw (leaving it like that until we switch back, right?). and mwmaint1002 is now on buster and puppet did not show any issues. it ha... [12:20:48] 10SRE, 10serviceops, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) Also we have this now which shows the noc site works on both hosts also after reimage: ` [deploy1002:~] $ httpbb /srv/deployment/httpbb-tests/noc/* --hosts mwmaint1002.eqiad.wmne... [12:20:53] !log mwmaint1002 - scap pull after reimaging [12:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:07] (03CR) 10Dzahn: [C: 03+1] role::common::mediawiki::canary_appserver add new canary app server in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [12:23:22] (03PS3) 10JMeybohm: dragonfly::dfdaemon: Make profile and module ensureable [puppet] - 10https://gerrit.wikimedia.org/r/704318 (https://phabricator.wikimedia.org/T286054) [12:24:46] (03CR) 10Dzahn: DHCP: remove mw1269 through mw1284 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704283 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:25:25] (03PS3) 10Dzahn: DHCP: remove mw1269 through mw1284 [puppet] - 10https://gerrit.wikimedia.org/r/704283 (https://phabricator.wikimedia.org/T280203) [12:26:22] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/704283 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:27:10] (03CR) 10Dzahn: [C: 03+2] "thanks for review!" [puppet] - 10https://gerrit.wikimedia.org/r/704283 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:27:46] (03PS1) 10JMeybohm: kubernetes::*::worker: include dragonfly dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/704322 (https://phabricator.wikimedia.org/T286054) [12:29:16] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/704284 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:31:01] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash/tests: replace mw1281 with mw1391 [puppet] - 10https://gerrit.wikimedia.org/r/704286 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:32:03] (03CR) 10Dzahn: [C: 03+2] logstash/tests: replace mw1281 with mw1391 [puppet] - 10https://gerrit.wikimedia.org/r/704286 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:32:08] (03PS2) 10Dzahn: logstash/tests: replace mw1281 with mw1391 [puppet] - 10https://gerrit.wikimedia.org/r/704286 (https://phabricator.wikimedia.org/T280203) [12:34:58] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/704286 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:35:05] (03CR) 10Jbond: dragonfly::dfdaemon: Make profile and module ensureable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704318 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [12:35:32] (03CR) 10Dzahn: "before I merge this I'll merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/704286 which was a special case I did not expect but " [puppet] - 10https://gerrit.wikimedia.org/r/704284 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:36:47] (03PS2) 10Dzahn: site/conftool: remove mw1281 through mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/704284 (https://phabricator.wikimedia.org/T280203) [12:37:13] (03PS1) 10Filippo Giunchedi: logstash: use one replica across the board [puppet] - 10https://gerrit.wikimedia.org/r/704324 (https://phabricator.wikimedia.org/T250133) [12:37:52] (03CR) 10Dzahn: [C: 03+2] site/conftool: remove mw1281 through mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/704284 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:38:19] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) [12:38:20] (03CR) 10Dzahn: [C: 04-1] "running decom cookbook before merging" [puppet] - 10https://gerrit.wikimedia.org/r/704284 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:38:42] (03CR) 10Dzahn: [C: 04-1] "and of course first depooling with confctl" [puppet] - 10https://gerrit.wikimedia.org/r/704284 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:38:55] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ema) [12:39:30] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/704287 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:40:35] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw128[1-3].eqiad.wmnet [12:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:01] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ema) [12:41:11] !log depooling and decom'ing eqiad API servers mw1281, mw1282, mw1283 - T280203 [12:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:19] T280203: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 [12:41:26] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ema) [12:43:38] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 1732 hosts [12:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:10] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10ema) [12:46:22] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10ema) [12:46:31] (03PS4) 10JMeybohm: dragonfly::dfdaemon: Make profile and module ensureable [puppet] - 10https://gerrit.wikimedia.org/r/704318 (https://phabricator.wikimedia.org/T286054) [12:46:33] (03PS2) 10JMeybohm: kubernetes::*::worker: include dragonfly dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/704322 (https://phabricator.wikimedia.org/T286054) [12:51:39] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10ema) [12:52:17] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10ema) [12:53:25] (03PS1) 10Ladsgroup: Send TTL instead of expiry in unix timestamp in calling BagOStuff [extensions/Wikibase] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704176 (https://phabricator.wikimedia.org/T286260) [12:53:37] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1117.eqiad.wmnet with reason: Copy m5 from db1117 to db1183 T284622 [12:53:38] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1117.eqiad.wmnet with reason: Copy m5 from db1117 to db1183 T284622 [12:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:43] T284622: Rename dbstore1004 to db1183 and place it on m5 - https://phabricator.wikimedia.org/T284622 [12:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:53] !log stopping replication on db1117:3325 T284622 [12:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:03] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/704318 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [12:55:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muehlenhoff out of all services on: 1732 hosts [12:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:18] (03PS1) 10Muehlenhoff: Don't expect all hosts to be reachable for the logout cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704325 (https://phabricator.wikimedia.org/T283242) [12:57:26] (03PS3) 10JMeybohm: kubernetes::*::worker: include dragonfly dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/704322 (https://phabricator.wikimedia.org/T286054) [12:58:15] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/704325 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [12:58:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/704322 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [13:00:21] (03PS1) 10JMeybohm: admin_ng: Add a new tiller ClusterRole for flink-session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/704326 [13:00:48] (03CR) 10Jelto: [C: 03+1] "lgtm" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/704143 (owner: 10Brennen Bearnes) [13:00:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/704325 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [13:01:07] (03CR) 10Ladsgroup: [C: 03+2] Send TTL instead of expiry in unix timestamp in calling BagOStuff [extensions/Wikibase] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704176 (https://phabricator.wikimedia.org/T286260) (owner: 10Ladsgroup) [13:01:18] (03PS1) 10Ladsgroup: Send TTL instead of expiry in unix timestamp in calling BagOStuff [extensions/Wikibase] (wmf/1.37.0-wmf.13) - 10https://gerrit.wikimedia.org/r/704177 (https://phabricator.wikimedia.org/T286260) [13:01:44] (03CR) 10Ladsgroup: [C: 03+2] Send TTL instead of expiry in unix timestamp in calling BagOStuff [extensions/Wikibase] (wmf/1.37.0-wmf.13) - 10https://gerrit.wikimedia.org/r/704177 (https://phabricator.wikimedia.org/T286260) (owner: 10Ladsgroup) [13:03:18] (03CR) 10Muehlenhoff: [C: 03+2] Don't expect all hosts to be reachable for the logout cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704325 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [13:03:41] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ssingh) [13:04:44] Amir1: wmf.13 will never get deployed, according to T281154 :). [13:04:45] T281154: 1.37.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T281154 [13:04:58] urbanecm: oh of course, I'm an idiot [13:05:04] Sorry, let me fix it [13:05:20] (03CR) 10Ladsgroup: Send TTL instead of expiry in unix timestamp in calling BagOStuff [extensions/Wikibase] (wmf/1.37.0-wmf.13) - 10https://gerrit.wikimedia.org/r/704177 (https://phabricator.wikimedia.org/T286260) (owner: 10Ladsgroup) [13:05:34] (03Abandoned) 10Ladsgroup: Send TTL instead of expiry in unix timestamp in calling BagOStuff [extensions/Wikibase] (wmf/1.37.0-wmf.13) - 10https://gerrit.wikimedia.org/r/704177 (https://phabricator.wikimedia.org/T286260) (owner: 10Ladsgroup) [13:05:41] No problem 🙂 [13:05:45] (03PS1) 10Ladsgroup: Send TTL instead of expiry in unix timestamp in calling BagOStuff [extensions/Wikibase] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704178 (https://phabricator.wikimedia.org/T286260) [13:05:58] (03CR) 10Ladsgroup: [C: 03+2] Send TTL instead of expiry in unix timestamp in calling BagOStuff [extensions/Wikibase] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704178 (https://phabricator.wikimedia.org/T286260) (owner: 10Ladsgroup) [13:06:20] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ssingh) [13:07:50] (03CR) 10Jelto: [C: 03+1] "lgtm and promoting mw1422 to a scap proxy (or api server) is a good idea to not only have mw appserver and canaries in A3." [puppet] - 10https://gerrit.wikimedia.org/r/704319 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [13:08:09] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:08:57] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 1732 hosts [13:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muehlenhoff out of all services on: 1732 hosts [13:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:01] !log Upgraded Apache on gerrit1001 and gerrit2001 [13:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:50] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 1732 hosts [13:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:28] !log jmm@cumin2002 END (FAIL) - Cookbook sre.idm.logout (exit_code=99) Logging Muehlenhoff out of all services on: 1732 hosts [13:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:16] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Vgutierrez) [13:12:18] (03PS2) 10Hashar: mwdeploy user is provided by LDAP on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/699427 (https://phabricator.wikimedia.org/T73480) [13:12:42] (03PS2) 10Hashar: beta: add warning motd and link to term of uses [puppet] - 10https://gerrit.wikimedia.org/r/699207 (https://phabricator.wikimedia.org/T100837) [13:14:06] !log restarted replication on db1117:3325 T284622 [13:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:12] T284622: Rename dbstore1004 to db1183 and place it on m5 - https://phabricator.wikimedia.org/T284622 [13:20:09] (03PS1) 10Jbond: profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) [13:20:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:20:44] that's me ^ [13:22:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:23:53] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10jbond) [13:24:49] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Vgutierrez) [13:26:22] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10jbond) [13:28:32] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10jbond) [13:28:34] (03Merged) 10jenkins-bot: Send TTL instead of expiry in unix timestamp in calling BagOStuff [extensions/Wikibase] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704176 (https://phabricator.wikimedia.org/T286260) (owner: 10Ladsgroup) [13:29:40] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 2 hosts [13:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:55] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10jbond) [13:30:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muehlenhoff out of all services on: 2 hosts [13:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:31] effie: tested on mwdebug, nothing showing up, I'm going to deploy to the world now [13:33:47] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/Wikibase/lib/includes/SimpleCacheWithBagOStuff.php: Backport: [[gerrit:704176|Send TTL instead of expiry in unix timestamp in calling BagOStuff (T286260)]] (duration: 00m 58s) [13:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:54] T286260: APCu caches are set to expire in 2073 instead of an hour if exptime is a unix timestamp - https://phabricator.wikimedia.org/T286260 [13:34:16] effie: how we can safely do a php-fpm restart so apcu gets cleaned [13:34:40] is it necessary? [13:35:00] I mean it is [13:35:01] (03CR) 10jerkins-bot: [V: 04-1] Send TTL instead of expiry in unix timestamp in calling BagOStuff [extensions/Wikibase] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704178 (https://phabricator.wikimedia.org/T286260) (owner: 10Ladsgroup) [13:35:41] (03CR) 10Ladsgroup: [C: 03+2] "again" [extensions/Wikibase] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704178 (https://phabricator.wikimedia.org/T286260) (owner: 10Ladsgroup) [13:36:49] Amir1: ok ok, I was thinking that eventually all php-fpm instances will be restarted, or we will eventually make it to 2073 [13:37:03] :D [13:37:23] !log rolling restart php-fpm across clusters - T286260 [13:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:29] Thanks [13:38:02] the thing is that if it causes issues, I want to see it now, rather than later exploding :D [13:42:29] (03PS2) 10Jbond: profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) [13:43:14] Amir1: which is why we should have waited for 2073, it would be somebody else's problem (TM) [13:43:59] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10jbond) [13:53:26] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/704324 (https://phabricator.wikimedia.org/T250133) (owner: 10Filippo Giunchedi) [13:53:35] !log otto@deploy1002 Started deploy [analytics/refinery@a3bc8bc]: Add eventlogging_legacy gobblin job - T271232 [13:53:36] (03PS1) 10Jbond: P:logoutd: fix typo scripts -> script [puppet] - 10https://gerrit.wikimedia.org/r/704337 [13:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:43] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [13:54:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/704337 (owner: 10Jbond) [13:54:22] (03CR) 10Jbond: [C: 03+2] P:logoutd: fix typo scripts -> script [puppet] - 10https://gerrit.wikimedia.org/r/704337 (owner: 10Jbond) [13:55:42] effie: the change has obviously had an an effect but it's neglible https://grafana.wikimedia.org/d/000000548/wikibase-sql-term-storage?orgId=1&refresh=30s [13:57:04] !log otto@deploy1002 Finished deploy [analytics/refinery@a3bc8bc]: Add eventlogging_legacy gobblin job - T271232 (duration: 03m 28s) [13:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:21] the hit ratio for example went from 95% to 90% [13:58:35] (03CR) 10Ottomata: [C: 03+2] Add gobbln job eventlogging_legacy [puppet] - 10https://gerrit.wikimedia.org/r/704159 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:59:14] jbond: FYI just meregd your tyop fix [13:59:17] puppet-merged ^ [13:59:19] (03Merged) 10jenkins-bot: Send TTL instead of expiry in unix timestamp in calling BagOStuff [extensions/Wikibase] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704178 (https://phabricator.wikimedia.org/T286260) (owner: 10Ladsgroup) [13:59:24] Amir1: overall it is fine so far [14:01:03] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30195/console" [puppet] - 10https://gerrit.wikimedia.org/r/704161 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [14:04:21] ottomata: thanks :) [14:07:57] (03PS1) 10Kormat: db1183: Enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/704341 (https://phabricator.wikimedia.org/T284622) [14:08:01] 10SRE, 10MediaWiki-Cache, 10Platform Engineering, 10Wikidata, and 3 others: APCu caches are set to expire in 2073 instead of an hour if exptime is a unix timestamp - https://phabricator.wikimedia.org/T286260 (10Ladsgroup) so this is fixed from wikibase point of view but we need to come to a decision to eit... [14:11:02] (03CR) 10Kormat: [C: 03+2] db1183: Enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/704341 (https://phabricator.wikimedia.org/T284622) (owner: 10Kormat) [14:11:10] (03PS3) 10Ottomata: Finalize eventlogging_legacy gobblin job migration [puppet] - 10https://gerrit.wikimedia.org/r/704161 (https://phabricator.wikimedia.org/T271232) [14:11:13] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.517e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [14:11:28] (03CR) 10jerkins-bot: [V: 04-1] Finalize eventlogging_legacy gobblin job migration [puppet] - 10https://gerrit.wikimedia.org/r/704161 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [14:12:57] (03PS4) 10Ottomata: Finalize eventlogging_legacy gobblin job migration [puppet] - 10https://gerrit.wikimedia.org/r/704161 (https://phabricator.wikimedia.org/T271232) [14:13:18] (03CR) 10Joal: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/704161 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [14:15:46] (03CR) 10Lucas Werkmeister (WMDE): "I think that would make sense, yeah." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [14:19:54] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Kormat) [14:21:45] (03PS1) 10Jbond: logout: Correctly parse additional arguments [puppet] - 10https://gerrit.wikimedia.org/r/704342 [14:21:52] (03PS1) 10Volans: decorators: improve the retry decorator [software/pywmflib] - 10https://gerrit.wikimedia.org/r/704343 [14:25:57] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01034 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [14:27:19] (03CR) 10Jbond: [C: 03+2] logout: Correctly parse additional arguments [puppet] - 10https://gerrit.wikimedia.org/r/704342 (owner: 10Jbond) [14:28:12] (03PS1) 10Volans: tox: remove flake8-import-order [software/spicerack] - 10https://gerrit.wikimedia.org/r/704344 [14:28:14] (03PS1) 10Volans: decorators: migrate to the wmflib version [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) [14:33:02] Pchelolo: how do you do that so fast?! :) [14:33:28] ottomata: was reading my emails, got a new one, clicked a button :) [14:33:43] (03CR) 10jerkins-bot: [V: 04-1] decorators: migrate to the wmflib version [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [14:33:48] i just submitted the PR like 20 seconds before you merged it ! :) [14:35:19] (03CR) 10Volans: "The CI failures are expected because of the Depends-On (see the commit message) that needs to be merged and released first." [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [14:35:31] !log nskaggs@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [14:35:31] !log nskaggs@cumin1001 END (FAIL) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=99) [14:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:02] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, all the cronjobs look to be properly matched by the new timers, didn't spot any typos. Also TIL: "minutely" :)" [puppet] - 10https://gerrit.wikimedia.org/r/703909 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:39:16] (03CR) 10Jbond: "LGTM some questions inline, comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/703909 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:43:57] (03CR) 10Jbond: "LGTM consider the other comments completely optional" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703909 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:44:04] (03PS1) 10Btullis: Fix the sre.aqs.roll-restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704347 (https://phabricator.wikimedia.org/T269925) [14:44:06] (03CR) 10Jbond: [C: 03+1] librenms: Migrate crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/703909 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:46:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "Agreed! I'll set up a reminder in my notes" [puppet] - 10https://gerrit.wikimedia.org/r/704324 (https://phabricator.wikimedia.org/T250133) (owner: 10Filippo Giunchedi) [14:46:53] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: use one replica across the board [puppet] - 10https://gerrit.wikimedia.org/r/704324 (https://phabricator.wikimedia.org/T250133) (owner: 10Filippo Giunchedi) [14:48:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/704344 (owner: 10Volans) [14:49:57] (03CR) 10Ladsgroup: librenms: Migrate crons to systemd timers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/703909 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:51:31] (03CR) 10Cathal Mooney: [C: 03+2] librenms: Migrate crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/703909 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:51:44] is it possible to peek at the job queue in production? e.g. get a job of a certain type (ideally without removing it) and look at its parameters in shell.php or similar [14:51:50] https://wikitech.wikimedia.org/wiki/Kafka_Job_Queue#Debugging only has instructions for pushing a null job [14:52:00] (03CR) 10Jbond: [C: 03+1] librenms: Migrate crons to systemd timers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/703909 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:52:14] !log volker-e@deploy1002 Started deploy [design/style-guide@5c07233]: Deploy design/style-guide: 5c07233 “Components”: Add WikimediaUI theme Figma links to various components (#483) [14:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:21] !log volker-e@deploy1002 Finished deploy [design/style-guide@5c07233]: Deploy design/style-guide: 5c07233 “Components”: Add WikimediaUI theme Figma links to various components (#483) (duration: 00m 06s) [14:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:17] it looks like JobQueueGroup::pop(), without calling run() or ack(), might be more or less like a peek(), but I don’t know if I want to try that out ^^ [14:53:53] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10RobH) a:05jijiki→03RobH >>! In T286463#7207986, @jijiki wrote: > I depooled mw2383 as it was behaving as before. @RobH is it possible to run hardware tests on the host? I suspect it is might be a hardware issue. Thank... [14:54:45] (03PS1) 10Nskaggs: Remove legacy wiki replicas [cookbooks] - 10https://gerrit.wikimedia.org/r/704348 (https://phabricator.wikimedia.org/T260389) [14:56:38] (03CR) 10Volans: [C: 03+1] "LGTM, all comments are optional depending on future plans for the canaries." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/704347 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [14:57:07] PROBLEM - Host mw2383 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:55] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10RobH) This is now running Dell's hardware test suite. [15:00:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:00:19] (03PS1) 10Ssingh: test_dns: improve the EDNS client subnet tests [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/704349 [15:00:39] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-compact.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:39] (03CR) 10Ssingh: [C: 03+2] test_dns: improve the EDNS client subnet tests [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/704349 (owner: 10Ssingh) [15:05:35] PROBLEM - Thanos compact has disappeared from Prometheus discovery on alert1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [15:06:36] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Disable moderation mail notifications for messages sent to archived lists - https://phabricator.wikimedia.org/T286371 (10Ladsgroup) Oh it is: https://wikitech.wikimedia.org/wiki/Mailman#Disable_or_re-enable_a_mailing_list Do you think it's missing something? [15:09:23] RECOVERY - Thanos compact has disappeared from Prometheus discovery on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [15:09:45] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:10:17] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:38] (03PS1) 10Muehlenhoff: Remove account expiry for kgordon [puppet] - 10https://gerrit.wikimedia.org/r/704352 [15:11:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:14:35] 10SRE, 10SRE Observability, 10Traffic, 10Patch-For-Review: Implement SLI measurement for Varnish Frontend - https://phabricator.wikimedia.org/T284576 (10ema) 05Open→03Resolved a:03ema This is now done, including a basic dashboard based on the metrics introduced as part of this task: https://grafana.w... [15:14:55] (03PS1) 10Ottomata: eventgate-analytics - bump to 2021-07-13-151027-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/704353 (https://phabricator.wikimedia.org/T272714) [15:17:32] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - bump to 2021-07-13-151027-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/704353 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [15:17:44] (03CR) 10Btullis: "> Patch Set 1: Code-Review+1" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/704347 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [15:19:30] (03CR) 10Ottomata: Fix the sre.aqs.roll-restart cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/704347 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [15:19:56] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [15:19:56] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [15:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:51] (03CR) 10Volans: [C: 03+1] Fix the sre.aqs.roll-restart cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/704347 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [15:21:36] (03CR) 10Elukey: Fix the sre.aqs.roll-restart cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/704347 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [15:32:17] (03PS4) 10JMeybohm: kubernetes::*::worker: include dragonfly dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/704322 (https://phabricator.wikimedia.org/T286054) [15:33:09] (03PS5) 10JMeybohm: kubernetes::*::worker: include dragonfly dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/704322 (https://phabricator.wikimedia.org/T286054) [15:33:32] (03CR) 10RLazarus: [C: 03+1] switch mwmaint.discovery (noc.wm.org backend) from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/704293 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [15:34:06] !log Adding IX peering to AS393950 (Xiber LLC) on cr2-eqiad. [15:34:09] (03PS11) 10Jcrespo: mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) [15:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:53] (03PS6) 10JMeybohm: kubernetes::*::worker: include dragonfly dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/704322 (https://phabricator.wikimedia.org/T286054) [15:38:47] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30197/console" [puppet] - 10https://gerrit.wikimedia.org/r/704322 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [15:39:04] (03PS1) 10Brennen Bearnes: Do not lock user_preferences before updating [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704181 (https://phabricator.wikimedia.org/T286521) [15:39:28] (03CR) 10Ppchelko: [C: 03+1] Do not lock user_preferences before updating [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704181 (https://phabricator.wikimedia.org/T286521) (owner: 10Brennen Bearnes) [15:39:36] (03PS1) 10Brennen Bearnes: Do not lock user_preferences before updating [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704182 (https://phabricator.wikimedia.org/T286521) [15:39:59] (03CR) 10Ppchelko: [C: 03+1] Do not lock user_preferences before updating [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704182 (https://phabricator.wikimedia.org/T286521) (owner: 10Brennen Bearnes) [15:40:12] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [15:42:57] (03PS1) 10Jcrespo: Revert "mediabackup: Install minio on the storage hosts and open port 9000" [puppet] - 10https://gerrit.wikimedia.org/r/704184 [15:44:35] (03CR) 10jerkins-bot: [V: 04-1] Revert "mediabackup: Install minio on the storage hosts and open port 9000" [puppet] - 10https://gerrit.wikimedia.org/r/704184 (owner: 10Jcrespo) [15:45:22] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Revert "mediabackup: Install minio on the storage hosts and open port 9000" [puppet] - 10https://gerrit.wikimedia.org/r/704184 (owner: 10Jcrespo) [15:45:52] (03CR) 10Brennen Bearnes: [C: 03+2] "Merging this one - wmf.14 is not presently checked out on deployment server, so no further action should be required on the wmf.14 backpor" [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704182 (https://phabricator.wikimedia.org/T286521) (owner: 10Brennen Bearnes) [15:45:59] PROBLEM - Check systemd state on backup2004 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,minio.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:41] PROBLEM - Check systemd state on backup2007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,minio.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:35] (03PS1) 10Jcrespo: Revert "Revert "mediabackup: Install minio on the storage hosts and open port 9000"" [puppet] - 10https://gerrit.wikimedia.org/r/704185 [15:47:55] RECOVERY - Check systemd state on backup2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:35] RECOVERY - Check systemd state on backup2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:26] (03PS7) 10JMeybohm: kubernetes::*::worker: include dragonfly dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/704322 (https://phabricator.wikimedia.org/T286054) [15:51:28] (03PS1) 10JMeybohm: dragonfly: Trim newlines in config files [puppet] - 10https://gerrit.wikimedia.org/r/704360 (https://phabricator.wikimedia.org/T286054) [15:51:58] (03CR) 10Jcrespo: "The syntax:" [puppet] - 10https://gerrit.wikimedia.org/r/704185 (owner: 10Jcrespo) [15:53:56] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30198/console" [puppet] - 10https://gerrit.wikimedia.org/r/704322 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [15:57:42] (03PS2) 10Ladsgroup: objectcache: Normalize exptime to ttl in APCu and WinCache [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703892 (https://phabricator.wikimedia.org/T286260) [15:57:47] (03Abandoned) 10Ladsgroup: objectcache: Normalize exptime to ttl in APCu and WinCache [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703892 (https://phabricator.wikimedia.org/T286260) (owner: 10Ladsgroup) [16:00:05] jbond42 and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210713T1600). [16:00:05] majavah: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:46] 👋 looking [16:01:59] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.517e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [16:02:08] (03CR) 10Btullis: Fix the sre.aqs.roll-restart cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/704347 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [16:02:15] (03CR) 10Btullis: [C: 03+2] Fix the sre.aqs.roll-restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704347 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [16:03:57] majavah: around? I'm able to merge that patch as part of the puppet request window if it's really necessary, but ordinarily it would be better to get a WMCS SRE to review instead [16:04:58] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Dwisehaupt) [16:05:16] (03Merged) 10jenkins-bot: Fix the sre.aqs.roll-restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704347 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [16:05:39] (03PS3) 10Martaannaj: Add config for updated PropertySuggester beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) [16:06:49] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Dwisehaupt) @cmooney Not a problem. I have updated the task to remove the pre/post tasks we were looking at. Always good for us to think about failure... [16:07:28] (03Merged) 10jenkins-bot: Do not lock user_preferences before updating [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704182 (https://phabricator.wikimedia.org/T286521) (owner: 10Brennen Bearnes) [16:09:37] (03PS1) 10Vgutierrez: admin: Add iflorez to analytics-group-users [puppet] - 10https://gerrit.wikimedia.org/r/704364 (https://phabricator.wikimedia.org/T286509) [16:10:08] (03PS2) 10Vgutierrez: admin: Add iflorez to analytics-product-users [puppet] - 10https://gerrit.wikimedia.org/r/704364 (https://phabricator.wikimedia.org/T286509) [16:16:19] (03PS1) 10Volans: admin: update my home files (volans) [puppet] - 10https://gerrit.wikimedia.org/r/704387 [16:17:16] (03CR) 10Volans: [C: 03+2] admin: update my home files (volans) [puppet] - 10https://gerrit.wikimedia.org/r/704387 (owner: 10Volans) [16:17:27] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.002188 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [16:25:00] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw128[1-3].eqiad.wmnet [16:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:35] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw1281.eqiad.wmnet with reason: decom T28203 [16:25:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw1281.eqiad.wmnet with reason: decom T28203 [16:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:41] T28203: update search index and cached data on de.ws - https://phabricator.wikimedia.org/T28203 [16:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw[1282-1283].eqiad.wmnet with reason: decom T28203 [16:26:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw[1282-1283].eqiad.wmnet with reason: decom T28203 [16:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1281.eqiad.wmnet [16:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:33] (03CR) 10Dave Pifke: "Should be fine; I'll cherry-pick in deployment-prep to test." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:29:57] (03CR) 10Ladsgroup: arclamp: Migrate crons to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:32:04] (03CR) 10Ladsgroup: arclamp: Migrate crons to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:34:05] rzl: sorry I'm late, thought the window was an hour later [16:35:21] no-one really owns deployment-prep so reviews on ops/puppet related to that are really hard to get, and I've already cherry-picked that already to its local puppet master and tested it, working fine [16:35:22] (03CR) 10Dave Pifke: [C: 04-1] "dpifke@deployment-webperf12:~$ sudo puppet agent -t" [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:36:16] (03CR) 10Muehlenhoff: [C: 03+2] Remove account expiry for kgordon [puppet] - 10https://gerrit.wikimedia.org/r/704352 (owner: 10Muehlenhoff) [16:37:07] (03CR) 10Dave Pifke: [C: 04-1] "I think we can probably leave out the environment for ensure => absent on the old cron jobs, and/or remove them manually." [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:37:16] jouncebot now [16:37:16] For the next 0 hour(s) and 22 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210713T1600) [16:37:36] (03CR) 10Muehlenhoff: "Looks good, but needs approval by Otto for analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/704364 (https://phabricator.wikimedia.org/T286509) (owner: 10Vgutierrez) [16:37:48] majavah: nod, understood -- merging for now, not sure it's a good long-term fit for the puppet window but we can try and figure something out :) [16:38:17] (03CR) 10Brennen Bearnes: [C: 03+2] Do not lock user_preferences before updating [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704181 (https://phabricator.wikimedia.org/T286521) (owner: 10Brennen Bearnes) [16:38:39] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10RobH) [16:38:53] rzl: i'm going to go ahead and sling out the above for an UBN, assuming nothing in the puppet window conflicts [16:38:58] having one or more SREs active in deployment-prep would be the more ideal solution. Yuvi may have been the last person to try and fill that role. [16:39:01] (can hold off a few otherwise) [16:39:41] brennen: my Puppet patch is practically a no-op at this point, go ahead [16:39:44] brennen: should be fine, go ahead and thanks for checking [16:39:49] ack, thanks. [16:39:58] (03CR) 10Juan90264: [C: 03+1] Adding e use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [16:40:31] rzl: agree that a better workflow would be nice, but unfortunately this is one of the only ways I can get some (well-needed) maintenance done there [16:40:56] (03CR) 10RLazarus: [C: 03+2] beta: remove deployment-deploy02 [puppet] - 10https://gerrit.wikimedia.org/r/700426 (https://phabricator.wikimedia.org/T278689) (owner: 10Majavah) [16:41:46] majavah: yeah, for sure -- I'm trying not to leave you hanging, just also trying not to sign my colleagues up for any specific long-term fix without their agreement :) [16:41:48] (03PS4) 10Jbond: statograph: add debian folder allowing us to package [software/statograph] - 10https://gerrit.wikimedia.org/r/704133 (https://phabricator.wikimedia.org/T285569) [16:42:58] (03CR) 10Juan90264: [C: 03+1] Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [16:43:17] (03CR) 10Juan90264: [C: 03+1] Adding square logo and wordmark for Wikimania [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [16:44:00] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [software/statograph] - 10https://gerrit.wikimedia.org/r/704133 (https://phabricator.wikimedia.org/T285569) (owner: 10Jbond) [16:44:35] got it, thank you for your help [16:44:38] (03CR) 10Jbond: [C: 03+2] setup.py: create an entry point so we have an executable script [software/statograph] - 10https://gerrit.wikimedia.org/r/704314 (https://phabricator.wikimedia.org/T285569) (owner: 10Jbond) [16:44:43] (03CR) 10Jbond: [C: 03+2] statograph: add debian folder allowing us to package [software/statograph] - 10https://gerrit.wikimedia.org/r/704133 (https://phabricator.wikimedia.org/T285569) (owner: 10Jbond) [16:44:51] (03CR) 10Juan90264: [C: 03+1] Use the ptwikinews wordmark in new vector and mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [16:45:11] (03CR) 10Juan90264: [C: 03+1] Adding square wordmark for ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704171 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [16:45:51] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:46:08] (03Merged) 10jenkins-bot: setup.py: create an entry point so we have an executable script [software/statograph] - 10https://gerrit.wikimedia.org/r/704314 (https://phabricator.wikimedia.org/T285569) (owner: 10Jbond) [16:46:10] (03Merged) 10jenkins-bot: statograph: add debian folder allowing us to package [software/statograph] - 10https://gerrit.wikimedia.org/r/704133 (https://phabricator.wikimedia.org/T285569) (owner: 10Jbond) [16:46:12] (03CR) 10Jbond: [V: 03+2 C: 03+2] setup.py: create an entry point so we have an executable script [software/statograph] - 10https://gerrit.wikimedia.org/r/704314 (https://phabricator.wikimedia.org/T285569) (owner: 10Jbond) [16:46:37] (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:46:38] bd808: oops, meant to say -- yeah, that sounds right to me in principle, just not sure offhand the best way to make it happen [16:47:24] (03PS2) 10Ladsgroup: arclamp: Migrate crons to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) [16:47:57] (03CR) 10jerkins-bot: [V: 04-1] arclamp: Migrate crons to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:48:07] rzl: *nod* I hope I didn't imply that you should be on the hook either :) [16:48:48] ahaha if you did, I hope I sidestepped it with equal grace and subtlety [16:51:05] * bd808 waves to the commons and shouts that others should fix it [16:51:27] Someone™ [16:51:34] (03PS3) 10Ladsgroup: arclamp: Migrate crons to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) [16:51:53] bd808: I mean that's what practically everyone does today [16:53:54] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10RobH) I can handle this, and a firmware upgrade takes anywhere from 5 to 30 minutes (depending on if its only bios, etc..) This is just bios, so I can handle this anytime this week. @LSobanski: Please have s... [16:55:11] !log upload statograph to buster wikimedia [16:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:37] (03Merged) 10jenkins-bot: Do not lock user_preferences before updating [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704181 (https://phabricator.wikimedia.org/T286521) (owner: 10Brennen Bearnes) [16:57:59] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1104.eqiad.wmnet with reason: Firmware upgrade T286226 [16:57:59] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1104.eqiad.wmnet with reason: Firmware upgrade T286226 [16:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:06] T286226: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 [16:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:20] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 18 hosts with reason: Firmware upgrade T286226 [16:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:27] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 18 hosts with reason: Firmware upgrade T286226 [16:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] chrisalbon and accraze: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210713T1700). [17:00:54] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10Kormat) a:05Kormat→03RobH Machine is depooled and ready for firmware upgrade. [17:00:56] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:01:56] ACKNOWLEDGEMENT - MariaDB Replica IO: s8 on db1099 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1104.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1104.eqiad.wmnet (111 Connection refused) Kormat Forgot to downtime the section _first_ https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:01:56] ACKNOWLEDGEMENT - MariaDB Replica IO: s8 on db1101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1104.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1104.eqiad.wmnet (111 Connection refused) Kormat Forgot to downtime the section _first_ https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:01:56] ACKNOWLEDGEMENT - MariaDB Replica IO: s8 on db1111 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1104.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1104.eqiad.wmnet (111 Connection refused) Kormat Forgot to downtime the section _first_ https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:01:56] ACKNOWLEDGEMENT - MariaDB Replica IO: s8 on db1114 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1104.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1104.eqiad.wmnet (111 Connection refused) Kormat Forgot to downtime the section _first_ https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:01:56] ACKNOWLEDGEMENT - MariaDB Replica IO: s8 on db1116 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1104.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1104.eqiad.wmnet (111 Connection refused) Kormat Forgot to downtime the section _first_ https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:01:56] ACKNOWLEDGEMENT - MariaDB Replica IO: s8 on db1171 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1104.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1104.eqiad.wmnet (111 Connection refused) Kormat Forgot to downtime the section _first_ https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:01:57] ACKNOWLEDGEMENT - MariaDB Replica IO: s8 on db1177 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1104.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1104.eqiad.wmnet (111 Connection refused) Kormat Forgot to downtime the section _first_ https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:01:57] ACKNOWLEDGEMENT - MariaDB Replica IO: s8 on db1178 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1104.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1104.eqiad.wmnet (111 Connection refused) Kormat Forgot to downtime the section _first_ https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:02:05] sorry for the spam folk, please ignore [17:02:12] if you're unable to ignore, please blame elukey [17:02:29] lol, thanks [17:02:45] mutante: i'm very considerate [17:03:02] yes, you are :) [17:03:04] :-D [17:03:25] what's wrong with s8 then? :-P [17:03:38] (03PS1) 10DCausse: [cirrus] switch more_like traffic to codfw 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704389 [17:03:41] (03PS1) 10DCausse: [cirrus] switch more_like traffic to codfw 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704390 [17:03:46] apergos: nothing is wrong with s8 and it's elukey's fault, that's what I've understood [17:04:10] hahahahahaha [17:05:01] !log brennen@deploy1002 Synchronized php-1.37.0-wmf.12/includes/user/UserOptionsManager.php: Backport: [[gerrit:704181|Do not lock user_preferences before updating (T286521)]] (duration: 01m 58s) [17:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:07] T286521: Deadlock found when trying to get lock (UserOptionsManager::saveOptionsQuery) - https://phabricator.wikimedia.org/T286521 [17:06:42] hmm - sync to mw2383.codfw.wmnet timed out during that deploy. [17:06:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1281.eqiad.wmnet [17:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:10] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1281.eqiad.wmnet` - m... [17:07:10] brennen: that host has a problem.. [17:07:26] https://phabricator.wikimedia.org/T286463 [17:07:35] T286463 [17:07:36] T286463: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 [17:08:34] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1282.eqiad.wmnet [17:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:42] !log mw1282 - decom, powered off [17:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:48] mutante: thx. [17:10:11] brennen: yw, hope it doesn't block deployment [17:10:40] it is depooled but not taken out of scap groups [17:11:40] brennen: last deadlock happened 10 minutes ago, seems to be working [17:15:30] kind of missing gerrit bot [17:15:49] (03PS1) 10Dzahn: logstash/tests: also replace IP of mw1281 with IP of mw1391 [puppet] - 10https://gerrit.wikimedia.org/r/704393 (https://phabricator.wikimedia.org/T280203) [17:16:10] (03CR) 10Dzahn: "there was more: https://gerrit.wikimedia.org/r/c/operations/puppet/+/704393" [puppet] - 10https://gerrit.wikimedia.org/r/704286 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [17:16:28] rzl: nobody that took my part, all supporting kormat, good to know [17:16:59] (03CR) 10Dzahn: [C: 03+2] "host 10.64.0.195" [puppet] - 10https://gerrit.wikimedia.org/r/704393 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [17:17:03] (03CR) 10Dzahn: [V: 03+2 C: 03+2] logstash/tests: also replace IP of mw1281 with IP of mw1391 [puppet] - 10https://gerrit.wikimedia.org/r/704393 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [17:17:52] elukey: 📓 [17:19:35] Pchelolo: rad, thanks for the patch. [17:20:19] (03CR) 10Bstorm: [C: 03+2] Renew PAWS Prometheus certificates [puppet] - 10https://gerrit.wikimedia.org/r/704279 (owner: 10Majavah) [17:20:25] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw1282.eqiad.wmnet [17:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:39] WARNING:homer:Too many invalid answers, commit aborted on asw2-a-eqiad.mgmt.eqiad.wmnet [17:20:40] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1282.eqiad.wmnet` - m... [17:20:43] arrrr :( [17:21:29] volans: is there a proper way to repeat just the homer step of the decom cookbook? [17:22:03] should I just try and run the whole cookbook another time? [17:22:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1282.eqiad.wmnet [17:22:55] (03PS1) 10Effie Mouzeli: profile::osm_master: add tilenator users for kubepod subnets [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) [17:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:02] * mutante does that [17:23:22] (03CR) 10jerkins-bot: [V: 04-1] profile::osm_master: add tilenator users for kubepod subnets [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [17:25:15] (03CR) 10Dzahn: [C: 04-1] "duplicate-ish of https://gerrit.wikimedia.org/r/c/operations/puppet/+/704283" [puppet] - 10https://gerrit.wikimedia.org/r/679958 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [17:27:08] mutante: if it's just the homer step you can run homer manually for the hosts's switch [17:28:01] (03PS2) 10Effie Mouzeli: profile::osm_master: add tilenator users for kubepod subnets [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) [17:28:21] (03PS3) 10Dzahn: DHCP: remove mw1285 through mw1301 [puppet] - 10https://gerrit.wikimedia.org/r/679958 (https://phabricator.wikimedia.org/T280203) [17:29:16] (03CR) 10Andrew Bogott: [C: 03+2] Add grafana-cloud.w.o as alias of grafana-labs [puppet] - 10https://gerrit.wikimedia.org/r/684100 (owner: 10Majavah) [17:29:25] volans: thank you, I ran the cookbook again and seeing if that also skips everything else [17:30:14] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw1282.eqiad.wmnet [17:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:22] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1282.eqiad.wmnet` - m... [17:30:31] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10RobH) So for best practices, I tend to upload a new idrac firmware before anything else. Since idrac handles the firmware updates of other things, it just seems safest. That being stated, this host is taking f... [17:31:33] (03CR) 10Andrew Bogott: [C: 03+2] Add grafana-cloud.{wm.o,d.wmnet} to replace labs [dns] - 10https://gerrit.wikimedia.org/r/684099 (owner: 10Majavah) [17:31:37] volans: fyi, it fails because DNS part already removed mgmt and then it tries to downtime that non-existing host in Icinga [17:31:39] (03PS5) 10Andrew Bogott: Add grafana-cloud.{wm.o,d.wmnet} to replace labs [dns] - 10https://gerrit.wikimedia.org/r/684099 (owner: 10Majavah) [17:35:07] volans: I found the switch from netbox and ran 'homer "asw*eqiad*" diff [17:41:09] !log homer "asw2-a*eqiad*" commit "decom mw1282 - T280203" [17:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:16] T280203: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 [17:44:03] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10RobH) This has gone exceedingly poorly. I updated the idrac firmware, and idrac is accessible via SSH but not via HTTPS, as its self signed cert changed to one unsupported (even with bypass option via advanced... [17:44:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1283.eqiad.wmnet [17:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:52] !log mw1283 - decom - powered off by cookbook [17:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:29] (03CR) 10Majavah: metricsinfra: Add HAProxy for distributing http traffic (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/703708 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [17:49:52] (03PS2) 10Jcrespo: Revert "Revert "mediabackup: Install minio on the storage hosts and open port 9000"" [puppet] - 10https://gerrit.wikimedia.org/r/704185 [17:50:20] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "mediabackup: Install minio on the storage hosts and open port 9000"" [puppet] - 10https://gerrit.wikimedia.org/r/704185 (owner: 10Jcrespo) [17:52:26] (03PS3) 10Jcrespo: Revert "Revert "mediabackup: Install minio on the storage hosts and open port 9000"" [puppet] - 10https://gerrit.wikimedia.org/r/704185 [17:52:58] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "mediabackup: Install minio on the storage hosts and open port 9000"" [puppet] - 10https://gerrit.wikimedia.org/r/704185 (owner: 10Jcrespo) [17:55:30] (03PS4) 10Jcrespo: Revert "Revert "mediabackup: Install minio on the storage hosts and open port 9000"" [puppet] - 10https://gerrit.wikimedia.org/r/704185 [17:57:20] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1283.eqiad.wmnet [17:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:29] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1283.eqiad.wmnet` - m... [17:57:42] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10RobH) I downloaded the 2.80.80.80 idrac firmware file, but the idrac was 2.4.something. It seems it had to increment it from the 2.4.x up to 2.61.61.61 and that version also introduced the https error. I was a... [18:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210713T1800) [18:00:54] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [18:00:59] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:02:18] mutante: sorry was afk, yes that works [18:02:27] * volans off for now [18:02:33] volans: yep, all is good. cya [18:12:35] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10RobH) a:05RobH→03LSobanski I am at a loss on how to proceed for this, without connecting a crash cart and rolling back the idrac version via crash cart. Basically the latest idrac firmware made the https id... [18:13:32] (03CR) 10Jcrespo: "This is the new full diff:" [puppet] - 10https://gerrit.wikimedia.org/r/704185 (owner: 10Jcrespo) [18:15:02] (03CR) 10Jcrespo: Revert "Revert "mediabackup: Install minio on the storage hosts and open port 9000"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704185 (owner: 10Jcrespo) [18:18:34] (03PS3) 10Effie Mouzeli: profile::osm_master: add tilenator users for kubepod subnets [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) [18:23:48] (03PS4) 10Ladsgroup: arclamp: Migrate crons to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) [18:23:49] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.517e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [18:24:54] (03PS1) 10Jdlrobson: links is flat array [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704368 (https://phabricator.wikimedia.org/T286040) [18:25:36] (03CR) 10Ahmon Dancy: "Thanks!" [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704368 (https://phabricator.wikimedia.org/T286040) (owner: 10Jdlrobson) [18:26:44] (03CR) 10Jdlrobson: "Just waiting for confirmation this addresses the issue on the beta cluster at which time I will remove the -1,and then Dancy you should fe" [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704368 (https://phabricator.wikimedia.org/T286040) (owner: 10Jdlrobson) [18:27:45] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Bstorm) [18:34:38] (03PS4) 10Effie Mouzeli: profile::osm_master: add tilenator users for kubepod subnets [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) [18:35:32] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Bstorm) cloudstore servers are both in the same rack. They are a cluster, and it will simply be offline. We will make a task to verify that it come... [18:39:45] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.006929 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [18:41:05] (03CR) 10jerkins-bot: [V: 04-1] links is flat array [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704368 (https://phabricator.wikimedia.org/T286040) (owner: 10Jdlrobson) [18:42:12] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) [18:46:59] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) [18:47:52] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10Kormat) @RobH : as the host is still accessible, i'm going to repool it in the meantime. [18:48:11] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) Switching cloudmetrics to just eat the brief outage. I don't think it will be a big deal. We can just check it after. [18:49:40] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) The WMCS-owned dbproxy1018 and 1019 make up the entire cluster, so that's just an outage no matter what for wikireplicas. It will just need do... [18:57:20] (03CR) 10Jdlrobson: [C: 03+1] "okay I can confirm this addresses the issue." [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704368 (https://phabricator.wikimedia.org/T286040) (owner: 10Jdlrobson) [18:57:47] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Bstorm) @Gehel I'm not sure who to tag in for cloudelastic here. If there are 2 of them in this row, that could be something that requires attention for redundancy.... [18:58:55] (03CR) 10Jdlrobson: [C: 03+1] "recheck" [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704368 (https://phabricator.wikimedia.org/T286040) (owner: 10Jdlrobson) [19:00:04] dancy and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210713T1900). [19:02:18] !log razzi@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers [19:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:07] (03CR) 10Abijeet Patro: "This change is ready for review." [extensions/Translate] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704405 (https://phabricator.wikimedia.org/T285830) (owner: 10Abijeet Patro) [19:04:13] (03CR) 10Abijeet Patro: "This change is ready for review." [extensions/Translate] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704404 (https://phabricator.wikimedia.org/T285830) (owner: 10Abijeet Patro) [19:08:10] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Bstorm) [19:08:38] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Gehel) Ryan should be around tomorrow to double check, but cloudelastic should be resilient to a row failure. Worst case the service will be down for the duration of... [19:20:40] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Bstorm) [19:29:49] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30203/console" [puppet] - 10https://gerrit.wikimedia.org/r/704161 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [19:29:51] (03Abandoned) 10Ottomata: refine: rename EventLoggingSanitization to RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/675936 (https://phabricator.wikimedia.org/T273789) (owner: 10Razzi) [19:30:19] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) [19:31:04] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Finalize eventlogging_legacy gobblin job migration [puppet] - 10https://gerrit.wikimedia.org/r/704161 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [19:34:03] (03PS1) 10Ottomata: Ensure camus eventlogging job is absent [puppet] - 10https://gerrit.wikimedia.org/r/704412 (https://phabricator.wikimedia.org/T271232) [19:35:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:53] (03CR) 10Ottomata: [C: 03+2] Ensure camus eventlogging job is absent [puppet] - 10https://gerrit.wikimedia.org/r/704412 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [19:36:07] (03CR) 10Ahmon Dancy: [C: 03+2] links is flat array [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704368 (https://phabricator.wikimedia.org/T286040) (owner: 10Jdlrobson) [19:39:33] (03CR) 10Effie Mouzeli: "PCC looks ok https://puppet-compiler.wmflabs.org/compiler1003/30202/maps1009.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [19:46:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:33] !log dancy@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.14 [19:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:02] (03Merged) 10jenkins-bot: links is flat array [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704368 (https://phabricator.wikimedia.org/T286040) (owner: 10Jdlrobson) [20:06:09] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10RobH) I confirmed backup1004 already had raid0 ssd set to bootable, it did, and rebooted into the installer, where it worked.... I have no idea what kind of race condition is... [20:07:23] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` backup1004.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [20:07:52] (03PS1) 10Andrew Bogott: profile::wmcs::prometheus: Don't pass $site to prometheus::class_config [puppet] - 10https://gerrit.wikimedia.org/r/704417 [20:09:32] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::prometheus: Don't pass $site to prometheus::class_config [puppet] - 10https://gerrit.wikimedia.org/r/704417 (owner: 10Andrew Bogott) [20:19:29] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.14 (duration: 31m 56s) [20:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:58] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10RobH) Ok, so in the installer, the error is: ` Unable to install GRUB in /dev/sdb Executing 'grub-install /dev/sdb' failed. This is a fatal error. `... [20:20:45] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup1004.eqiad.wmnet'] ` Of which those **FAILED**: ` ['backup1004.eqiad.wmnet'] ` [20:23:12] (03PS2) 10Jbond: profile::wmcs::prometheus: Don't pass $site to prometheus::class_config [puppet] - 10https://gerrit.wikimedia.org/r/704417 (owner: 10Andrew Bogott) [20:24:37] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::prometheus: Don't pass $site to prometheus::class_config [puppet] - 10https://gerrit.wikimedia.org/r/704417 (owner: 10Andrew Bogott) [20:24:39] (03CR) 10Jbond: profile::wmcs::prometheus: Don't pass $site to prometheus::class_config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704417 (owner: 10Andrew Bogott) [20:24:57] (03PS1) 10Urbanecm: SpecialCreateAccountCampaign: Ignore $wgLoginLanguageSelector [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704371 (https://phabricator.wikimedia.org/T286587) [20:25:09] (03PS1) 10Urbanecm: SpecialCreateAccountCampaign: Ignore $wgLoginLanguageSelector [extensions/GrowthExperiments] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704372 (https://phabricator.wikimedia.org/T286587) [20:26:46] !log dancy@deploy1002 Pruned MediaWiki: 1.37.0-wmf.9 (duration: 04m 21s) [20:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:34] (03PS3) 10Jbond: profile::wmcs::prometheus: Don't pass $site to prometheus::class_config [puppet] - 10https://gerrit.wikimedia.org/r/704417 (owner: 10Andrew Bogott) [20:30:54] !log dancy@deploy1002 Synchronized php-1.37.0-wmf.14/includes/skins/Skin.php: Backport: [[gerrit:704368|links is flat array (T286040)]] (duration: 02m 07s) [20:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:01] T286040: "InvalidArgumentException: $key must be a string or an array" on beta login and vote wikis - https://phabricator.wikimedia.org/T286040 [20:32:11] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::prometheus: Don't pass $site to prometheus::class_config [puppet] - 10https://gerrit.wikimedia.org/r/704417 (owner: 10Andrew Bogott) [20:44:44] (03PS1) 10Andrew Bogott: cloudmetrics: add grafana-cloud alias alongside grafana-labs [puppet] - 10https://gerrit.wikimedia.org/r/704419 [20:45:24] (03CR) 10Andrew Bogott: [C: 03+2] cloudmetrics: add grafana-cloud alias alongside grafana-labs [puppet] - 10https://gerrit.wikimedia.org/r/704419 (owner: 10Andrew Bogott) [20:49:04] (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/700317 (https://phabricator.wikimedia.org/T266703) (owner: 10Ladsgroup) [20:53:43] !log razzi@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) [20:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:15] (03PS1) 10RLazarus: icinga: Performance improvements to icinga-status [puppet] - 10https://gerrit.wikimedia.org/r/704422 (https://phabricator.wikimedia.org/T285803) [20:59:21] (03CR) 10BryanDavis: [C: 03+1] jessie deprecation: don't build jessie containers when rebuilding (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703005 (owner: 10Bstorm) [21:11:56] (03CR) 10Bstorm: jessie deprecation: don't build jessie containers when rebuilding (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703005 (owner: 10Bstorm) [21:12:48] (03CR) 10Bstorm: [C: 03+2] jessie deprecation: don't build jessie containers when rebuilding [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703005 (owner: 10Bstorm) [21:13:27] (03Merged) 10jenkins-bot: jessie deprecation: don't build jessie containers when rebuilding [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703005 (owner: 10Bstorm) [21:14:31] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10jcrespo) The partitioning is working "as expected" (it is not a partman problem), the issue is with disks- I can only see an sda of "SSD" size and sdb of "HD size", while I wou... [21:28:51] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Disable moderation mail notifications for messages sent to archived lists - https://phabricator.wikimedia.org/T286371 (10Quiddity) >>! In T286371#7208826, @Ladsgroup wrote: > Oh it is: https://wikitech.wikimedia.org/wiki/Mailman#Disable_or_re-enable_a_maili... [21:30:28] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10jcrespo) I confirm the issue is that there is a difference on the setup of the eqiad and codfw hosts. The eqiad ones have configured the SSDs as a virtual RAID disk on hardware... [22:04:23] (03PS1) 10Legoktm: Use Score with lilypond's safe mode only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704433 [22:04:48] jouncebot: next [22:04:48] In 0 hour(s) and 55 minute(s): Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210713T2300) [22:06:02] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 2719 MB (10% inode=95%): /tmp 2719 MB (10% inode=95%): /var/tmp 2719 MB (10% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [22:06:40] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` backup1004.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [22:08:41] (03CR) 10Tim Starling: [C: 03+1] Use Score with lilypond's safe mode only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704433 (owner: 10Legoktm) [22:08:51] (03CR) 10Legoktm: [C: 03+2] Use Score with lilypond's safe mode only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704433 (owner: 10Legoktm) [22:09:32] (03Merged) 10jenkins-bot: Use Score with lilypond's safe mode only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704433 (owner: 10Legoktm) [22:11:22] dancy: um, is there a reason "testwikis wikis to 1.37.0-wmf.14" is on the deploy1002 git repo but not in Gerrit? [22:12:23] hmm. Lemme look at that. [22:15:05] dancy: if it's OK with you, I'd like to rebase so I can sync out my config change now; your testwikis patch would still be on top [22:15:07] OK. I had cancelled the operation to push that commit so that I could fix my ssh-agent setup. When I re-ran, the script didn't think it needed to push. I'll look into that. [22:15:24] OK with me to rebase. [22:16:27] ack, syncing now [22:17:59] > ssh: connect to host mw2383.codfw.wmnet port 22: Connection timed out [22:18:06] expected. ignore [22:18:25] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Use Score with lilypond's safe mode only (duration: 02m 06s) [22:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:34] * dancy https://phabricator.wikimedia.org/T286463 [22:18:53] I'll push wikiversions.json now. [22:19:00] thanks and thanks :) [22:19:05] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1004.eqiad.wmnet with reason: REIMAGE [22:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:28] if it seems like mw2383 is going to be out for a while, we should remove it from scap/dsh [22:19:59] It's been offline all day from my perspective, so I'm in favor. [22:20:59] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.37.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704434 [22:21:01] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.37.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704434 (owner: 10Ahmon Dancy) [22:21:41] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704434 (owner: 10Ahmon Dancy) [22:22:13] `/srv/mediawiki-staging` is in a clean state now. [22:22:55] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1004.eqiad.wmnet with reason: REIMAGE [22:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:04] RECOVERY - Disk space on elastic1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [22:29:55] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup1004.eqiad.wmnet'] ` and were **ALL** successful. [22:32:00] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10RobH) >>! In T277327#7210496, @jcrespo wrote: > I confirm the issue is that there is a difference on the setup of the eqiad and codfw hosts. The eqiad ones have configured the... [22:32:32] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10RobH) [22:34:17] (03CR) 10Ppchelko: [C: 03+1] Update src/defines.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701467 (owner: 10Tim Starling) [22:44:40] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['backup1005.eqiad.wmnet', 'backup1006.eqiad.wmnet', 'backup1007.eqiad.wm... [22:47:28] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10RobH) >>! In T277327#7210496, @jcrespo wrote: > ... > Admittedly, this is a "weird" decision, as it has several drawbacks: no hot-plugging, overhead/performance, and mixed OS a... [22:50:10] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210713T2300). [23:00:04] Urbanecm: A patch you scheduled for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:15] I'll self-serve [23:00:29] (03CR) 10Urbanecm: [C: 03+2] SpecialCreateAccountCampaign: Ignore $wgLoginLanguageSelector [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704371 (https://phabricator.wikimedia.org/T286587) (owner: 10Urbanecm) [23:00:31] (03CR) 10Urbanecm: [C: 03+2] SpecialCreateAccountCampaign: Ignore $wgLoginLanguageSelector [extensions/GrowthExperiments] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704372 (https://phabricator.wikimedia.org/T286587) (owner: 10Urbanecm) [23:01:45] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1007.eqiad.wmnet with reason: REIMAGE [23:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:24] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:03:58] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1007.eqiad.wmnet with reason: REIMAGE [23:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:43] 10SRE, 10ops-codfw: mgmt on logstash2021 inaccessible - https://phabricator.wikimedia.org/T286274 (10wiki_willy) a:03Papaul [23:07:32] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1005.eqiad.wmnet with reason: REIMAGE [23:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:45] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1006.eqiad.wmnet with reason: REIMAGE [23:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:40] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1005.eqiad.wmnet with reason: REIMAGE [23:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:42] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1006.eqiad.wmnet with reason: REIMAGE [23:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:17] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup1007.eqiad.wmnet', 'backup1005.eqiad.wmnet', 'backup1006.eqiad.wmnet'] ` and were **ALL** successful. [23:18:58] (03Merged) 10jenkins-bot: SpecialCreateAccountCampaign: Ignore $wgLoginLanguageSelector [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704371 (https://phabricator.wikimedia.org/T286587) (owner: 10Urbanecm) [23:19:00] (03Merged) 10jenkins-bot: SpecialCreateAccountCampaign: Ignore $wgLoginLanguageSelector [extensions/GrowthExperiments] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704372 (https://phabricator.wikimedia.org/T286587) (owner: 10Urbanecm) [23:21:25] (03CR) 10Juan90264: [C: 03+1] "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) (owner: 10Juan90264) [23:22:36] 10SRE: Update email address for samtar - https://phabricator.wikimedia.org/T286624 (10Samtar) [23:23:00] works fine at mwdebug, syncing [23:24:58] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/GrowthExperiments/includes/Specials/SpecialCreateAccountCampaign.php: f3627361ff558c89d4a4452ff24b3457f46a4f46: SpecialCreateAccountCampaign: Ignore $wgLoginLanguageSelector (T286587) (duration: 02m 07s) [23:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:05] T286587: Donors to newcomers: campaign Special:CreateAccount& redirects to regular Special:CreateAccount after clicking on a lang link - https://phabricator.wikimedia.org/T286587 [23:27:12] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10RobH) [23:27:24] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.14/extensions/GrowthExperiments/includes/Specials/SpecialCreateAccountCampaign.php: f3627361ff558c89d4a4452ff24b3457f46a4f46: SpecialCreateAccountCampaign: Ignore $wgLoginLanguageSelector (T286587) (duration: 02m 08s) [23:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:15] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10RobH) [23:28:57] (03PS1) 10Samtar: Update email address for samtar [puppet] - 10https://gerrit.wikimedia.org/r/704440 (https://phabricator.wikimedia.org/T286624) [23:29:14] can someone remove mw2383 from scap group, until T286463 is resolved? scap works, and the warning was known to me, but...it doubles the sync time, as scap waits 60 secs for mw2383 to timeout [23:29:14] T286463: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 [23:30:35] 10SRE, 10Patch-For-Review: Update email address for samtar in ldap users - https://phabricator.wikimedia.org/T286624 (10Samtar) [23:32:34] ^ well that was nerve-racking, if anyone would like to have a quick look at T286624 and make sure I've not done anything silly, I'd really appreciate it (first patch in a *long* time) [23:32:34] T286624: Update email address for samtar in ldap users - https://phabricator.wikimedia.org/T286624 [23:32:56] 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: hw troubleshooting: system right cp board missing in new host backup1006 - https://phabricator.wikimedia.org/T286625 (10RobH) [23:33:25] (nb. https://www.mediawiki.org/wiki/Gerrit/Tutorial#Prepare_to_work_with_Gerrit is *really* well written, thank you!) [23:33:40] (03CR) 10Dave Pifke: [C: 03+1] "$ sudo puppet agent -t" [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [23:35:02] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10RobH) 05Open→03Resolved backup1006 has a hw failure and has been placed into failure in netbox. While this task is resolving, the hw failure task T286625 has been filed fo... [23:35:56] 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: hw troubleshooting: system right cp board missing in new host backup1006 - https://phabricator.wikimedia.org/T286625 (10RobH) [23:36:22] (03CR) 10Urbanecm: [C: 03+1] "patch syntax LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/704440 (https://phabricator.wikimedia.org/T286624) (owner: 10Samtar) [23:36:38] tn: not a SRE, but it looks correct in terms of syntax [23:36:57] awesome, thank you urbanecm, as long as I'm not going to set fire to anything :) [23:37:17] uploading a patch rarely causes bugs. it's merging it that can :D [23:37:19] 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: hw troubleshooting: system right cp board missing in new host backup1006 - https://phabricator.wikimedia.org/T286625 (10RobH) a:03Cmjohnson Assigning this to Chris for him to pop this chassis open and investigate if everything is seated. He returns from hi... [23:38:02] > uploading a patch rarely causes bugs - I like a challenge >:) [23:38:20] hehe [23:53:45] (03PS1) 10Razzi: druid: Add option to roll restart druid workers [cookbooks] - 10https://gerrit.wikimedia.org/r/704443 (https://phabricator.wikimedia.org/T283067) [23:55:31] (03PS1) 10Tim Starling: Invalidate the conf cache when Defines.php changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704444 [23:56:17] (03PS2) 10Razzi: druid: Add option to roll restart test druid worker java processes [cookbooks] - 10https://gerrit.wikimedia.org/r/704443 (https://phabricator.wikimedia.org/T283067)