[00:00:00] (03PS1) 10Jdlrobson: Update scroll observer to allow event logging [skins/Vector] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/743013 (https://phabricator.wikimedia.org/T292586) [00:00:04] RoanKattouw and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211202T0000). [00:00:04] No Gerrit patches in the queue for this window AFAICS. [00:00:45] urbanecm: if you are around i'd love to backport https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/743013 but no worries if not [00:00:49] same goes to RoanKattouw [00:00:50] sure [00:00:55] Okay I'll add to the calendar [00:00:58] thanks [00:01:14] (03CR) 10Urbanecm: [C: 03+2] Update scroll observer to allow event logging [skins/Vector] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/743013 (https://phabricator.wikimedia.org/T292586) (owner: 10Jdlrobson) [00:05:11] {{done}} [00:06:53] (03CR) 10Cwhite: "This is the last automated step in provisioning OpenSearch." [puppet] - 10https://gerrit.wikimedia.org/r/743049 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [00:07:35] Jdlrobson: great. I'm waiting on CI [00:07:43] (if you have any config, i can do it in the meantime) [00:07:47] (03CR) 10Cwhite: "We'll want to clean up the old cron entry from beta-logs post-merge." [puppet] - 10https://gerrit.wikimedia.org/r/743047 (owner: 10Cwhite) [00:09:26] urbanecm: just this one. I could do some wmf config clean up though while I'm waiting :) [00:09:38] up2you :) [00:18:43] (03CR) 10Dzahn: "sorry, I know nothing about the filter syntax nor the context this is for" [puppet] - 10https://gerrit.wikimedia.org/r/743040 (https://phabricator.wikimedia.org/T132324) (owner: 10Jcrespo) [00:19:41] (03Merged) 10jenkins-bot: Update scroll observer to allow event logging [skins/Vector] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/743013 (https://phabricator.wikimedia.org/T292586) (owner: 10Jdlrobson) [00:21:02] Jdlrobson: available at mwdebug1001, can you test? [00:21:28] yep [00:21:36] just finished by config cleanup patch :) [00:22:00] great [00:22:26] (03CR) 10Dzahn: [C: 03+2] wmf-beta-update-databases.py: Print error in a better way [puppet] - 10https://gerrit.wikimedia.org/r/742519 (owner: 10Ahmon Dancy) [00:22:33] it works! [00:22:56] (03PS1) 10Jdlrobson: Clean up readers web team config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 [00:23:18] great, syncinng [00:23:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:41] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.9/skins/Vector/: a7586cd4a2559248ea1fd29cf74de535de016501: Update scroll observer to allow event logging (T292586) (duration: 00m 57s) [00:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:45] here you go [00:24:46] T292586: Sticky Header: Create schema to track returning to the top of the page - https://phabricator.wikimedia.org/T292586 [00:24:54] Jdlrobson: https://gerrit.wikimedia.org/r/743051 is the next one? [00:24:58] We can skip https://gerrit.wikimedia.org/r/743051 for another day. It's not super urgent. I'll use remaining time to check the data coming in [00:25:06] Unless you really like the look of it :) [00:25:53] well, the patch won't work :D [00:25:56] I'll leave some notes [00:26:00] definitely let's skip it for now [00:26:23] Sounds good :-) [00:29:45] (03CR) 10Urbanecm: [C: 04-1] "Will not work: would require addition to MWConfigCacheGenerator::$dbLists." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 (owner: 10Jdlrobson) [00:29:52] ttyl Jdlrobson :) [00:29:53] * urbanecm off [00:33:22] (03PS1) 10Ebernhardson: rdf-streaming-updater: Provide the namespace to be updated [puppet] - 10https://gerrit.wikimedia.org/r/743052 [00:33:24] (03PS1) 10Ebernhardson: rdf-streaming-updater: Add configuration for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/743053 (https://phabricator.wikimedia.org/T293638) [00:34:00] (03CR) 10jerkins-bot: [V: 04-1] rdf-streaming-updater: Provide the namespace to be updated [puppet] - 10https://gerrit.wikimedia.org/r/743052 (owner: 10Ebernhardson) [00:36:22] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:36:48] PROBLEM - Check systemd state on an-worker1089 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:44] 10SRE, 10Infrastructure-Foundations, 10netops: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10Jclark-ctr) [00:47:22] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:47:22] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:46] RECOVERY - Check systemd state on an-worker1089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:16] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:52:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:38] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:46] PROBLEM - snapshot of s2 in eqiad on alert1001 is CRITICAL: snapshot for s2 at eqiad taken more than 3 days ago: Most recent backup 2021-11-29 00:23:07 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:58:18] PROBLEM - Hadoop NodeManager on an-worker1129 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:59:28] PROBLEM - Check systemd state on an-worker1129 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:33] (03PS1) 10Andrew Bogott: Convert cloudvirt1028 into a local storage hypervisor [puppet] - 10https://gerrit.wikimedia.org/r/743055 (https://phabricator.wikimedia.org/T296790) [01:00:04] twentyafterfour: I, the Bot under the Fountain, call upon thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211202T0100). [01:00:40] !log T280001 About to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/742841 to bring `wcqs` into state `lvs_setup`, after which I'll perform a rolling restart of pybal [01:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:44] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [01:01:17] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: move back to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/742841 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [01:01:47] !log T280001 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/742841 [01:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:53] !log T280001 `ryankemper@cumin1001:~$ sudo cumin 'O:lvs::balancer' 'sudo run-puppet-agent'` [01:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:15] (After this puppet run completes, we may have some expected alerts pop up that will need to be acked) [01:06:58] RECOVERY - Hadoop NodeManager on an-worker1129 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:07:45] !log T280001 Restarting pybal on low-traffic backups: `ryankemper@cumin1001:~$ sudo cumin 'P{lvs2010*,lvs1016*}' 'sudo systemctl restart pybal'` [01:07:48] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [01:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:50] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [01:08:08] RECOVERY - Check systemd state on an-worker1129 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:32] !log T280001 Sanity check of `sudo ipvsadm -L -n` on backup `lvs2010` and `lvs1016` looks good (for ex `lvs1016` has `TCP 10.2.2.67:443 wrr`) [01:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:38] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 76 connections established with conf1004.eqiad.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal [01:08:52] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [01:09:04] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 68 connections established with conf2004.codfw.wmnet:4001 (min=69) https://wikitech.wikimedia.org/wiki/PyBal [01:09:24] There's the expected alerts, acking [01:10:27] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) Ryan Kemper phabricator.wikimedia.org/T280001 https://wikitech.wikimedia.org/wiki/PyBal [01:10:27] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 76 connections established with conf1004.eqiad.wmnet:4001 (min=77) Ryan Kemper phabricator.wikimedia.org/T280001 https://wikitech.wikimedia.org/wiki/PyBal [01:10:27] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.67:443]) Ryan Kemper phabricator.wikimedia.org/T280001 https://wikitech.wikimedia.org/wiki/PyBal [01:10:27] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 68 connections established with conf2004.codfw.wmnet:4001 (min=69) Ryan Kemper phabricator.wikimedia.org/T280001 https://wikitech.wikimedia.org/wiki/PyBal [01:10:27] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.67:443]) Ryan Kemper phabricator.wikimedia.org/T280001 https://wikitech.wikimedia.org/wiki/PyBal [01:11:30] !log T280001 Waited 120s and checked https://icinga.wikimedia.org/alerts, proceeding to primary low-traffic hosts `lvs2009` and `lvs1015` [01:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:07] !log T280001 Restarting pybal on low-traffic primaries `lvs2009` and `lvs1015`: `ryankemper@cumin1001:~$ sudo cumin 'P{lvs2009*,lvs1015*}' 'sudo systemctl restart pybal'` [01:12:09] ryankemper: Failed to log message to wiki. Somebody should check the error logs. [01:12:17] !log T280001 Restarting pybal on low-traffic primaries `lvs2009` and `lvs1015`: `ryankemper@cumin1001:~$ sudo cumin 'P{lvs2009*,lvs1015*}' 'sudo systemctl restart pybal'` [01:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:00] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [01:14:50] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 77 connections established with conf1004.eqiad.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal [01:15:16] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 69 connections established with conf2004.codfw.wmnet:4001 (min=69) https://wikitech.wikimedia.org/wiki/PyBal [01:16:45] !log T280001 Pooled `wcqs200[1-3]` (had been left unpooled from when we last removed wcqs from production) [01:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:50] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [01:19:32] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [01:21:06] !log T280001 Rolling restart of low-traffic pybal hosts complete. All of `wcqs` is pooled and the pybal / ipvs related alerts have cleared [01:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:02] PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:33:18] PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:43:36] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:45:44] (03CR) 10Andrew Bogott: [C: 03+2] Convert cloudvirt1028 into a local storage hypervisor [puppet] - 10https://gerrit.wikimedia.org/r/743055 (https://phabricator.wikimedia.org/T296790) (owner: 10Andrew Bogott) [01:52:44] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [01:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:57] RECOVERY - Check systemd state on an-worker1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:59:01] RECOVERY - snapshot of s2 in eqiad on alert1001 is OK: Last snapshot for s2 at eqiad (db1102.eqiad.wmnet:3312) taken on 2021-12-01 23:47:50 (1053 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:01:55] RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:14:54] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1028.eqiad.wmnet with OS buster [02:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:30] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [02:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:16] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1028.eqiad.wmnet with OS buster [02:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:10] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [02:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:12] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:50:19] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster [02:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:22] (03CR) 10Winston Sung: Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) (owner: 10VulpesVulpes825) [04:35:59] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) @herron So, the summary is they can have 2 accounts, one as volunteer and one as staff, that's ok. The volunteer account should be in 'nda' (you just fixed that, done) and the staff account shou... [04:42:20] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) a:05Daimona→03None [04:42:30] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) 05Open→03In progress [04:45:48] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) p:05Triage→03Medium @Daimona I'm gonna call it 'Medium' because I _hope_ there isn't actually much difference between what the 'nda' group and the 'wmf' group give you. But if you are curren... [04:49:46] (03CR) 10Reedy: [C: 04-1] "This is going to result in some potential non zero logspam on wikis that don't have CentralAuth installed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742996 (https://phabricator.wikimedia.org/T288070) (owner: 10Jsn.sherman) [05:05:41] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:09:17] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:43:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [05:48:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [06:10:31] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:03:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2010.codfw.wmnet with OS buster [08:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:13] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster [08:18:02] 10SRE, 10ops-codfw: Installation issues on ganeti2010 with buster / firmware update - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) I have good and bad news :-) The good news is that the firmware update made the Buster installer work \o/. The bad news is that we have nine more servers of that... [08:29:31] !log restarting blazegraph on wdqs1007 (jvm stuck for 4h) [08:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:12] PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [08:31:24] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [08:34:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2010.codfw.wmnet with OS buster [08:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:30] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster completed: - ganeti2010 (**PASS**) - Removed from Puppet... [08:44:05] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10geraki) Note: Pages whose title contain multiple semicolons (;) ar... [08:45:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet [08:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet [08:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:45] (03CR) 1020after4: [C: 03+1] "I <3 this change. 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038 (owner: 10Ahmon Dancy) [08:53:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:54:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:15:22] (03PS1) 10Inductiveload: Wikisource: enable proofreading change-tagging for all Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743116 (https://phabricator.wikimedia.org/T289140) [09:16:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1096.eqiad.wmnet with reason: Maintenance T277354 [09:16:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1096.eqiad.wmnet with reason: Maintenance T277354 [09:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:28] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:16:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17960 and previous config saved to /var/cache/conftool/dbconfig/20211202-091629-marostegui.json [09:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1096:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17961 and previous config saved to /var/cache/conftool/dbconfig/20211202-091753-marostegui.json [09:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:52] (03PS2) 10Inductiveload: Wikisource: enable proofreading change-tagging for all Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743116 (https://phabricator.wikimedia.org/T289140) [09:27:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2010.codfw.wmnet to ganeti01.svc.codfw.wmnet [09:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:32] (03CR) 10Tpt: [C: 03+1] Wikisource: enable proofreading change-tagging for all Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743116 (https://phabricator.wikimedia.org/T289140) (owner: 10Inductiveload) [09:27:34] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2010.codfw.wmnet to ganeti01.svc.codfw.wmnet [09:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:13] 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Volans) I think this might be the same of T296856, and the host in need of a firmware upgrade. Adding DCOps. [09:32:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1096:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17962 and previous config saved to /var/cache/conftool/dbconfig/20211202-093257-marostegui.json [09:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:03] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:41:42] (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Pass the Ganeti group to gnt-node add [cookbooks] - 10https://gerrit.wikimedia.org/r/743118 [09:42:27] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/743118 (owner: 10Muehlenhoff) [09:43:50] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:36] (03CR) 10jerkins-bot: [V: 04-1] sre.ganeti.addnode: Pass the Ganeti group to gnt-node add [cookbooks] - 10https://gerrit.wikimedia.org/r/743118 (owner: 10Muehlenhoff) [09:45:47] (03PS1) 10Giuseppe Lavagetto: mediawiki: run rsyslog as www-data [deployment-charts] - 10https://gerrit.wikimedia.org/r/743119 [09:48:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1096:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17963 and previous config saved to /var/cache/conftool/dbconfig/20211202-094802-marostegui.json [09:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:08] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:49:38] (03PS2) 10Matthias Mullie: Explicitly disable references support on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737370 (https://phabricator.wikimedia.org/T230315) [09:49:46] (03CR) 10Matthias Mullie: [C: 03+2] Explicitly disable references support on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737370 (https://phabricator.wikimedia.org/T230315) (owner: 10Matthias Mullie) [09:50:31] (03Merged) 10jenkins-bot: Explicitly disable references support on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737370 (https://phabricator.wikimedia.org/T230315) (owner: 10Matthias Mullie) [09:52:33] !log draining primary/secondary instances off ganeti2009 T296622 [09:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:37] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [09:52:46] (03CR) 10Giuseppe Lavagetto: "The patch is correct; however I have a UX doubt: do we want to independently change these settings or we just need two setting groups, tha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/742773 (owner: 10Ahmon Dancy) [10:03:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1096:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17964 and previous config saved to /var/cache/conftool/dbconfig/20211202-100307-marostegui.json [10:03:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance T277354 [10:03:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance T277354 [10:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:12] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [10:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Maintenance T277354 [10:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Maintenance T277354 [10:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2075.codfw.wmnet with reason: Maintenance T277354 [10:15:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2075.codfw.wmnet with reason: Maintenance T277354 [10:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:21] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [10:15:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2075 (T277354)', diff saved to https://phabricator.wikimedia.org/P17966 and previous config saved to /var/cache/conftool/dbconfig/20211202-101522-marostegui.json [10:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2075 (T277354)', diff saved to https://phabricator.wikimedia.org/P17967 and previous config saved to /var/cache/conftool/dbconfig/20211202-101555-marostegui.json [10:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:28] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [10:20:56] (03PS1) 10Ladsgroup: auto_schema: Only comment in the ticket in first and last repool [software] - 10https://gerrit.wikimedia.org/r/743120 [10:24:02] PROBLEM - exim queue on mx2001 is CRITICAL: CRITICAL: 4013 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [10:29:11] lots of phabricator and otrs mails being delivered (but maybe that is the normal state) - queue size is going down [10:31:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2075 (T277354)', diff saved to https://phabricator.wikimedia.org/P17968 and previous config saved to /var/cache/conftool/dbconfig/20211202-103100-marostegui.json [10:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:09] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [10:32:58] (03CR) 10Marostegui: [C: 03+1] auto_schema: Only comment in the ticket in first and last repool [software] - 10https://gerrit.wikimedia.org/r/743120 (owner: 10Ladsgroup) [10:46:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2075 (T277354)', diff saved to https://phabricator.wikimedia.org/P17969 and previous config saved to /var/cache/conftool/dbconfig/20211202-104606-marostegui.json [10:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:11] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [10:51:31] (03PS1) 10Ladsgroup: auto_schema: Detect depooling [software] - 10https://gerrit.wikimedia.org/r/743125 [10:51:50] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Only comment in the ticket in first and last repool [software] - 10https://gerrit.wikimedia.org/r/743120 (owner: 10Ladsgroup) [10:52:24] (03Merged) 10jenkins-bot: auto_schema: Only comment in the ticket in first and last repool [software] - 10https://gerrit.wikimedia.org/r/743120 (owner: 10Ladsgroup) [11:00:05] mvolz: That opportune time is upon us again. Time for a Services – Citoid / Zotero deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211202T1100). [11:01:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2075 (T277354)', diff saved to https://phabricator.wikimedia.org/P17970 and previous config saved to /var/cache/conftool/dbconfig/20211202-110110-marostegui.json [11:01:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2089.codfw.wmnet with reason: Maintenance T277354 [11:01:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2089.codfw.wmnet with reason: Maintenance T277354 [11:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2089:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17971 and previous config saved to /var/cache/conftool/dbconfig/20211202-110120-marostegui.json [11:01:20] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [11:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2089:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17972 and previous config saved to /var/cache/conftool/dbconfig/20211202-110157-marostegui.json [11:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:32] (03PS2) 10Ladsgroup: auto_schema: Detect depooling [software] - 10https://gerrit.wikimedia.org/r/743125 [11:10:18] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:14:26] (03CR) 10Elukey: [C: 03+2] install_server: set test reuse recipe for kafka-main2003 [puppet] - 10https://gerrit.wikimedia.org/r/742969 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [11:14:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: run rsyslog as www-data [deployment-charts] - 10https://gerrit.wikimedia.org/r/743119 (owner: 10Giuseppe Lavagetto) [11:14:45] (03PS1) 10Elukey: admin: reserve kafka uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/743130 (https://phabricator.wikimedia.org/T296641) [11:17:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2089:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17973 and previous config saved to /var/cache/conftool/dbconfig/20211202-111702-marostegui.json [11:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:07] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [11:19:19] (03Merged) 10jenkins-bot: mediawiki: run rsyslog as www-data [deployment-charts] - 10https://gerrit.wikimedia.org/r/743119 (owner: 10Giuseppe Lavagetto) [11:19:32] (03PS1) 10Ladsgroup: auto_schema: Panic only when not in dry-mode [software] - 10https://gerrit.wikimedia.org/r/743133 [11:21:18] (03PS2) 10Ladsgroup: auto_schema: Panic only when not in dry-mode [software] - 10https://gerrit.wikimedia.org/r/743133 [11:21:35] !log draining primary/secondary instances off ganeti2022 T296622 [11:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:39] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [11:25:02] PROBLEM - ganeti-mond running on ganeti2009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [11:25:40] PROBLEM - ganeti-confd running on ganeti2009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:26:28] PROBLEM - ganeti-noded running on ganeti2009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [11:28:22] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:27] <_joe_> jouncebot: next [11:28:28] In 0 hour(s) and 31 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211202T1200) [11:32:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2089:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17974 and previous config saved to /var/cache/conftool/dbconfig/20211202-113206-marostegui.json [11:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:12] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [11:38:25] (03PS1) 10Muehlenhoff: Add migration dates for o11y stretch systems [puppet] - 10https://gerrit.wikimedia.org/r/743138 [11:39:06] (03CR) 10Marostegui: [C: 03+1] auto_schema: Panic only when not in dry-mode [software] - 10https://gerrit.wikimedia.org/r/743133 (owner: 10Ladsgroup) [11:41:24] (03PS2) 10Muehlenhoff: Add migration dates for o11y stretch systems [puppet] - 10https://gerrit.wikimedia.org/r/743138 [11:43:58] (03PS1) 10Ladsgroup: auto_schema: Avoid reusing sql variable in replica for loop [software] - 10https://gerrit.wikimedia.org/r/743141 (https://phabricator.wikimedia.org/T288235) [11:46:18] (03CR) 10Ladsgroup: "The logs: https://www.irccloud.com/pastebin/Xybo4mcd/" [software] - 10https://gerrit.wikimedia.org/r/743141 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [11:47:10] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2089:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17975 and previous config saved to /var/cache/conftool/dbconfig/20211202-114711-marostegui.json [11:47:13] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:47:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2101.codfw.wmnet with reason: Maintenance T277354 [11:47:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2101.codfw.wmnet with reason: Maintenance T277354 [11:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:16] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [11:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2111.codfw.wmnet with reason: Maintenance T277354 [11:47:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2111.codfw.wmnet with reason: Maintenance T277354 [11:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T277354)', diff saved to https://phabricator.wikimedia.org/P17976 and previous config saved to /var/cache/conftool/dbconfig/20211202-114755-marostegui.json [11:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2111 (T277354)', diff saved to https://phabricator.wikimedia.org/P17977 and previous config saved to /var/cache/conftool/dbconfig/20211202-114833-marostegui.json [11:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:02] (03CR) 10Jbond: [C: 04-1] "see comment" [puppet] - 10https://gerrit.wikimedia.org/r/743130 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [11:54:52] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:26] PROBLEM - exim queue on mx2001 is CRITICAL: CRITICAL: 4042 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [11:58:52] (03CR) 10Elukey: admin: reserve kafka uid/gid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743130 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [12:00:05] Amir1, Lucas_WMDE, and apergos: That opportune time is upon us again. Time for a UTC morning backport and config training deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211202T1200). [12:00:05] inductiveload: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:07] a heads up to urbanecm and a reminder to Amir1: we have a trainee for today's backport window :-) there is one patch scheduled, a config patch. [12:00:13] o/ [12:00:24] (03PS2) 10Elukey: admin: reserve kafka uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/743130 (https://phabricator.wikimedia.org/T296641) [12:00:30] I'm around but can't be at the meeting :( [12:00:43] hey inductiveload, do you have deployment rights, i.e. will we be talking you through how to do this, or are you waiting to get them, and you will be observing? [12:00:55] ok Amir1, no worries [12:00:59] I could join a meeting for half an hour, then I need to make lunch ^^ [12:01:08] *finds the calendar event* [12:01:38] \o [12:01:39] Tpt: ping? [12:01:39] umm, not sure that's for me? [12:01:58] i'm not the trainee, I just would a config change deployed [12:03:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2009.codfw.wmnet with OS buster [12:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:16] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2009.codfw.wmnet with OS buster [12:03:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2111 (T277354)', diff saved to https://phabricator.wikimedia.org/P17978 and previous config saved to /var/cache/conftool/dbconfig/20211202-120338-marostegui.json [12:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:44] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [12:07:48] inductiveload: we have a trainee in the google meet [12:08:03] they will be here in the channel shortly [12:08:34] 💯 [12:09:22] We will be talking through the deployment procedure with the person being trained, but they will just be observing here in the channel [12:09:30] and following along in the documentation [12:10:14] Are you doing your own deploy or do you need one of us to do it for you? [12:10:36] i need someone to do it, please [12:10:40] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/743130 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [12:10:50] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:11:18] okay! [12:11:24] 10SRE, 10ops-codfw: Installation issues on ganeti2010 with buster / firmware update - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) 05Resolved→03Open >>! In T296856#7542925, @MoritzMuehlenhoff wrote: > I have good and bad news :-) The good news is that the firmware update made the Buster... [12:11:49] 10SRE, 10ops-codfw: Installation issues on ganeti2009/2010 with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) [12:11:56] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.21% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [12:12:17] apergos: i see you pinged me, can i be of any help? [12:12:21] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2009.codfw.wmnet with OS buster [12:12:22] can't join the meeting though [12:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:24] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:24] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2009.codfw.wmnet with OS buster executed with errors: - ganeti2009 (**FAIL**) - Downtimed... [12:13:58] ok, I’ll deploy (reviewing the change now) [12:14:45] (03PS1) 10Elukey: Add profile::kafka::user to Kafka Brokers and Mirror Makers instances [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) [12:15:12] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743116 (https://phabricator.wikimedia.org/T289140) (owner: 10Inductiveload) [12:15:17] (03PS3) 10Lucas Werkmeister (WMDE): Wikisource: enable proofreading change-tagging for all Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743116 (https://phabricator.wikimedia.org/T289140) (owner: 10Inductiveload) [12:15:34] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Wikisource: enable proofreading change-tagging for all Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743116 (https://phabricator.wikimedia.org/T289140) (owner: 10Inductiveload) [12:15:59] 10SRE, 10Traffic-Icebox, 10HTTPS, 10Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Tranve) [12:16:04] (03PS2) 10Elukey: Add profile::kafka::user to Kafka Brokers and Mirror Makers instances [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) [12:16:33] 10SRE, 10Traffic-Icebox, 10HTTPS, 10Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Tranve) ESNI has been superseded by the ECH, hence updating the task. [12:16:48] (03Merged) 10jenkins-bot: Wikisource: enable proofreading change-tagging for all Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743116 (https://phabricator.wikimedia.org/T289140) (owner: 10Inductiveload) [12:16:56] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32780/console" [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [12:18:02] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2111 (T277354)', diff saved to https://phabricator.wikimedia.org/P17979 and previous config saved to /var/cache/conftool/dbconfig/20211202-121843-marostegui.json [12:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:49] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [12:19:00] inductiveload: the change is on mwdebug1001, can you test it? [12:19:56] that's working [12:20:23] alright, syncing then [12:21:23] (03PS1) 10Ladsgroup: auto_schema: Fix depool in for loops [software] - 10https://gerrit.wikimedia.org/r/743154 [12:21:49] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Daimona) >>! In T295993#7542800, @Dzahn wrote: > @Daimona I'm gonna call it 'Medium' because I _hope_ there isn't actually much difference between what the 'nda' group and the 'wmf' group give you. But... [12:22:31] (03PS1) 10Btullis: Re-apply spark.local.dir setting for stat servers [puppet] - 10https://gerrit.wikimedia.org/r/743155 (https://phabricator.wikimedia.org/T295346) [12:22:56] (03CR) 10Jbond: "LGTM, see nit" [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [12:23:29] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:743116|Wikisource: enable proofreading change-tagging for all Wikisources (T289140)]] (duration: 00m 57s) [12:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:33] T289140: ProofreadPage: Enable change-tag status system on Wikisources - https://phabricator.wikimedia.org/T289140 [12:26:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/743130 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [12:26:14] that is now working at enWS and frWS at least, thank you [12:27:18] yay [12:27:49] (03PS2) 10Muehlenhoff: sre.ganeti.addnode: Pass the Ganeti group to gnt-node add [cookbooks] - 10https://gerrit.wikimedia.org/r/743118 [12:27:51] !log UTC morning backport+config window done [12:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:35] (03PS1) 10KartikMistry: Enable SectionTranslation in Malayalam, Malay, Azerbaijani, Tamil, Bashkir and Albanian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743158 (https://phabricator.wikimedia.org/T285842) [12:31:37] !log installing NSS security updates [12:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2111 (T277354)', diff saved to https://phabricator.wikimedia.org/P17980 and previous config saved to /var/cache/conftool/dbconfig/20211202-123348-marostegui.json [12:33:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2113.codfw.wmnet with reason: Maintenance T277354 [12:33:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2113.codfw.wmnet with reason: Maintenance T277354 [12:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:54] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [12:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2113 (T277354)', diff saved to https://phabricator.wikimedia.org/P17981 and previous config saved to /var/cache/conftool/dbconfig/20211202-123356-marostegui.json [12:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2113 (T277354)', diff saved to https://phabricator.wikimedia.org/P17982 and previous config saved to /var/cache/conftool/dbconfig/20211202-123435-marostegui.json [12:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:03] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ayounsi) From https://netbox.wikimedia.org/extras/reports/network.Network/ > ge-6/0/26... [12:49:00] 10SRE, 10ops-codfw: Installation issues on ganeti2009/2010 with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) [12:49:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2113 (T277354)', diff saved to https://phabricator.wikimedia.org/P17983 and previous config saved to /var/cache/conftool/dbconfig/20211202-124940-marostegui.json [12:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:45] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [12:50:11] 10SRE, 10ops-codfw: Installation issues on ganeti2009/2010 with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) I prefer to just reuse/extent the task by adding the nodes in the description like i did for ganeti2009 and 2010 so we keep better tack. Thanks [12:51:02] training complete, and of course the window is closed... see folks next time! [12:52:31] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10Papaul) @ayounsi thank you [12:58:35] 10SRE, 10ops-codfw: Installation issues on ganeti2009/2010 with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) >>! In T296856#7543451, @Papaul wrote: > I prefer to just reuse/extent the task by adding the nodes in the description like i did for ganeti2009 and... [13:03:31] 10SRE, 10Traffic-Icebox, 10HTTPS, 10Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10jcrespo) unsubing, as I think I was added to this ticket by mistake. This is traffic/traffic security expertise, and they already triaged and aware of the task. [13:04:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2113 (T277354)', diff saved to https://phabricator.wikimedia.org/P17985 and previous config saved to /var/cache/conftool/dbconfig/20211202-130444-marostegui.json [13:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:50] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:06:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM for role::prometheus" [puppet] - 10https://gerrit.wikimedia.org/r/743138 (owner: 10Muehlenhoff) [13:13:38] 10ops-codfw: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10Papaul) [13:19:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2113 (T277354)', diff saved to https://phabricator.wikimedia.org/P17986 and previous config saved to /var/cache/conftool/dbconfig/20211202-131949-marostegui.json [13:19:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2094,2128].codfw.wmnet with reason: Maintenance T277354 [13:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2094,2128].codfw.wmnet with reason: Maintenance T277354 [13:19:54] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T277354)', diff saved to https://phabricator.wikimedia.org/P17987 and previous config saved to /var/cache/conftool/dbconfig/20211202-131959-marostegui.json [13:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2128 (T277354)', diff saved to https://phabricator.wikimedia.org/P17988 and previous config saved to /var/cache/conftool/dbconfig/20211202-132034-marostegui.json [13:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:15] 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10Marostegui) [13:30:45] 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10Marostegui) p:05Triage→03Medium Roles: db2074 -> replica (sanitarium master) db2078 -> replica db2101 -> replica (backup source) db2130 -> replica dbproxy2004 -> m5 proxy (m5 in codfw isn't in use) @Papa... [13:31:30] (03PS1) 10Muehlenhoff: Add Cumin alias for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/743162 [13:35:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2128 (T277354)', diff saved to https://phabricator.wikimedia.org/P17989 and previous config saved to /var/cache/conftool/dbconfig/20211202-133538-marostegui.json [13:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:45] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:37:47] !log roll-restarting tilerator,tileratorui,kartotherian in codfw [13:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:08] 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10jcrespo) Sadly, I won't be around on the 7th. There is no issue regarding the move (backups should have finished by that time, ip changes should not affect backups), but either the date has to be moved, or som... [13:38:56] (03PS3) 10Elukey: Add profile::kafka::user to Kafka Brokers and Mirror Makers instances [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) [13:39:21] 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10Marostegui) I can stop it, no issue. But @Kormat will need to bring it back up the following day (or wait till 9th for you). [13:39:23] (03CR) 10Elukey: [V: 03+1] Add profile::kafka::user to Kafka Brokers and Mirror Makers instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [13:41:16] (03PS1) 10Elukey: Move kafka test to fixed gid/uid for user kafka [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) [13:41:31] 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10jcrespo) > But @Kormat will need to bring it back up the following day (or wait till 9th for you). Both will work. For shuwdown, the usual db procedure will work (minus the need for mw depool). [13:42:18] (03PS3) 10Muehlenhoff: Add migration dates for o11y stretch systems [puppet] - 10https://gerrit.wikimedia.org/r/743138 [13:43:08] (03CR) 10Muehlenhoff: Add migration dates for o11y stretch systems (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743138 (owner: 10Muehlenhoff) [13:46:10] (03PS4) 10Elukey: Add profile::kafka::user to Kafka Brokers and Mirror Makers instances [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) [13:46:12] (03PS2) 10Elukey: Move kafka test to fixed gid/uid for user kafka [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) [13:46:21] 10SRE, 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10Kormat) >>! In T296930#7543594, @Marostegui wrote: > I can stop it, no issue. But @Kormat will need to bring it back up the following day (or wait till 9th for you). Can do. [13:46:52] (03CR) 10jerkins-bot: [V: 04-1] Add profile::kafka::user to Kafka Brokers and Mirror Makers instances [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [13:47:02] 10SRE, 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10Marostegui) \o/ Cool, so @Papaul let's go ahead as you've initially planned it. [13:49:07] (03CR) 10Filippo Giunchedi: [C: 03+1] Add Cumin alias for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/743162 (owner: 10Muehlenhoff) [13:49:10] !log roll-restarting tilerator,tileratorui,kartotherian in eqiad [13:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2128 (T277354)', diff saved to https://phabricator.wikimedia.org/P17990 and previous config saved to /var/cache/conftool/dbconfig/20211202-135043-marostegui.json [13:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:48] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:51:26] (03PS5) 10Elukey: Add profile::kafka::user to Kafka Brokers and Mirror Makers instances [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) [13:51:28] (03PS3) 10Elukey: Move kafka test to fixed gid/uid for user kafka [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) [13:54:29] (03PS6) 10Elukey: Add profile::kafka::user to Kafka Brokers and Mirror Makers instances [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) [13:54:31] (03PS4) 10Elukey: Move kafka test to fixed gid/uid for user kafka [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) [13:57:10] (03CR) 10Elukey: [C: 04-1] "My bad, I didn't see the kafka user declaration in confluent::kafka::common, weird, checking it now." [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [14:05:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2128 (T277354)', diff saved to https://phabricator.wikimedia.org/P17992 and previous config saved to /var/cache/conftool/dbconfig/20211202-140548-marostegui.json [14:05:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2137.codfw.wmnet with reason: Maintenance T277354 [14:05:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2137.codfw.wmnet with reason: Maintenance T277354 [14:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:55] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [14:05:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17993 and previous config saved to /var/cache/conftool/dbconfig/20211202-140557-marostegui.json [14:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2137:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17994 and previous config saved to /var/cache/conftool/dbconfig/20211202-140636-marostegui.json [14:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:05] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for Grafana [puppet] - 10https://gerrit.wikimedia.org/r/743162 (owner: 10Muehlenhoff) [14:09:22] (03PS7) 10Elukey: profile::kafka::broker: allow to specify kafka uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) [14:09:24] (03PS5) 10Elukey: Move kafka test to fixed gid/uid for user kafka [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) [14:10:05] (03CR) 10jerkins-bot: [V: 04-1] profile::kafka::broker: allow to specify kafka uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [14:10:22] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32785/console" [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [14:15:41] (03PS8) 10Elukey: profile::kafka::broker: allow to specify kafka uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) [14:15:43] (03PS6) 10Elukey: Move kafka test to fixed gid/uid for user kafka [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) [14:16:40] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32786/console" [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [14:18:05] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32787/console" [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [14:18:26] jbond: sorry I had to redo everything from scratch :D --^ [14:21:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2137:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17995 and previous config saved to /var/cache/conftool/dbconfig/20211202-142141-marostegui.json [14:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:47] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [14:30:03] (03CR) 10Elukey: [C: 03+1] Re-apply spark.local.dir setting for stat servers [puppet] - 10https://gerrit.wikimedia.org/r/743155 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [14:34:59] (03PS6) 10Hnowlan: partman: add reuse partman profile for cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) [14:36:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2137:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17996 and previous config saved to /var/cache/conftool/dbconfig/20211202-143646-marostegui.json [14:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:51] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [14:40:57] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10fkaelin) At first we would like to use the swift credentials from yarn containers, both from spark and skein based applications. This will mostly used for write operations using the [[ http... [14:45:36] 10SRE, 10DNS, 10Domains, 10Traffic-Icebox, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Tranve) 05Open→03Declined [14:47:18] 10SRE, 10DNS, 10Domains, 10Traffic-Icebox, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Tranve) Closing this task due to the fact that Wikipedia has been blocked in all languages since Mar 2019, which leaves this task meaningless. [14:47:34] uhm [14:50:18] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [14:51:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2137:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17997 and previous config saved to /var/cache/conftool/dbconfig/20211202-145151-marostegui.json [14:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:57] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [14:52:08] 10SRE, 10DNS, 10Domains, 10Traffic-Icebox, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10ssingh) 05Declined→03Open Please don't close this task pending further discussion. Thank you. [14:56:50] (03CR) 10Ayounsi: [C: 03+1] "Goal and implementation lgtm, +1 once Riccardo's comments are addressed." [software/homer] - 10https://gerrit.wikimedia.org/r/742942 (owner: 10Cathal Mooney) [14:58:21] (03CR) 10Marostegui: [C: 03+1] auto_schema: Fix depool in for loops [software] - 10https://gerrit.wikimedia.org/r/743154 (owner: 10Ladsgroup) [15:01:01] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Detect depooling [software] - 10https://gerrit.wikimedia.org/r/743125 (owner: 10Ladsgroup) [15:01:06] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Panic only when not in dry-mode [software] - 10https://gerrit.wikimedia.org/r/743133 (owner: 10Ladsgroup) [15:01:14] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Avoid reusing sql variable in replica for loop [software] - 10https://gerrit.wikimedia.org/r/743141 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [15:01:18] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Fix depool in for loops [software] - 10https://gerrit.wikimedia.org/r/743154 (owner: 10Ladsgroup) [15:01:40] (03Merged) 10jenkins-bot: auto_schema: Detect depooling [software] - 10https://gerrit.wikimedia.org/r/743125 (owner: 10Ladsgroup) [15:01:42] (03Merged) 10jenkins-bot: auto_schema: Panic only when not in dry-mode [software] - 10https://gerrit.wikimedia.org/r/743133 (owner: 10Ladsgroup) [15:01:47] (03Merged) 10jenkins-bot: auto_schema: Avoid reusing sql variable in replica for loop [software] - 10https://gerrit.wikimedia.org/r/743141 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [15:01:50] (03Merged) 10jenkins-bot: auto_schema: Fix depool in for loops [software] - 10https://gerrit.wikimedia.org/r/743154 (owner: 10Ladsgroup) [15:15:32] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/743118 (owner: 10Muehlenhoff) [15:15:47] 10SRE, 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10Papaul) Thanks guys [15:22:15] (03PS3) 10Muehlenhoff: sre.ganeti.addnode: Pass the Ganeti group to gnt-node add [cookbooks] - 10https://gerrit.wikimedia.org/r/743118 [15:25:58] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Pass the Ganeti group to gnt-node add [cookbooks] - 10https://gerrit.wikimedia.org/r/743118 (owner: 10Muehlenhoff) [15:28:36] (03PS1) 10Giuseppe Lavagetto: rsyslog: allow the spool directory to be world-writable [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/743189 [15:38:39] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2022.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [15:38:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2022.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [15:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:35] (03PS1) 10Muehlenhoff: Fix date [puppet] - 10https://gerrit.wikimedia.org/r/743200 [15:49:54] 10SRE, 10MediaWiki-Uploading: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10aborrero) 05Open→03Resolved a:03aborrero For the record, I was unable to upload the video using the commons web form: https://commons.wikimedia.org/wiki/File:Dactylopterus_Volitans.webm It too... [15:50:34] (03CR) 10Muehlenhoff: [C: 03+2] Fix date [puppet] - 10https://gerrit.wikimedia.org/r/743200 (owner: 10Muehlenhoff) [15:51:26] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10Ottomata) Hm, I haven't attempted to access swift using the S3 protocol. How does swift auth work there? I was about to render an env file you could use for the swift CLI or python client... [15:52:43] 10SRE-swift-storage, 10Data-Engineering, 10Data-Engineering-Kanban: Deploy research_poc Swift credidentials to Hadoop - https://phabricator.wikimedia.org/T296945 (10Ottomata) [15:54:56] (03PS1) 10Jbond: ldap: Add support for read/write operations [puppet] - 10https://gerrit.wikimedia.org/r/743204 (https://phabricator.wikimedia.org/T295150) [15:54:58] (03PS1) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [15:56:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32789/console" [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [15:57:16] (03PS2) 10Ebernhardson: rdf-streaming-updater: Provide the namespace to be updated [puppet] - 10https://gerrit.wikimedia.org/r/743052 [15:57:18] (03PS2) 10Ebernhardson: rdf-streaming-updater: Add configuration for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/743053 (https://phabricator.wikimedia.org/T293638) [15:59:16] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10MatthewVernon) For `S3`, you need three things - access key, secret key, endpoint. For thanos, these are: access key: the username secret key: the passphrase endpoint: https://thanos-swif... [16:05:03] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:07:30] (03PS1) 10Filippo Giunchedi: profile: add exim4 blackhole configuration [puppet] - 10https://gerrit.wikimedia.org/r/743207 (https://phabricator.wikimedia.org/T296373) [16:09:32] (03CR) 10jerkins-bot: [V: 04-1] profile: add exim4 blackhole configuration [puppet] - 10https://gerrit.wikimedia.org/r/743207 (https://phabricator.wikimedia.org/T296373) (owner: 10Filippo Giunchedi) [16:10:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/743049 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [16:11:18] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: enable opensearch compatibility mode [puppet] - 10https://gerrit.wikimedia.org/r/743046 (owner: 10Cwhite) [16:11:55] (03CR) 10Filippo Giunchedi: [C: 03+1] opensearch: use systemd timer for gc log rotation [puppet] - 10https://gerrit.wikimedia.org/r/743047 (owner: 10Cwhite) [16:12:31] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: allow logstash checker to query opensearch [puppet] - 10https://gerrit.wikimedia.org/r/743048 (owner: 10Cwhite) [16:12:34] (03CR) 10Muehlenhoff: ldap: Add support for read/write operations (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743204 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [16:12:41] (03PS5) 10Hnowlan: cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) [16:13:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [16:13:25] (03CR) 10Filippo Giunchedi: hiera: use site-local ldap for opensearch in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743045 (owner: 10Cwhite) [16:13:44] (03CR) 10jerkins-bot: [V: 04-1] cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [16:14:15] (03CR) 10Filippo Giunchedi: [C: 03+1] "ack, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/743138 (owner: 10Muehlenhoff) [16:14:20] (03PS2) 10Ahmon Dancy: modules/beta/files/wmf-beta-update-databases.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670922 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:15:17] (03CR) 10Ahmon Dancy: [C: 03+1] "Rebased after https://gerrit.wikimedia.org/r/c/operations/puppet/+/742519" [puppet] - 10https://gerrit.wikimedia.org/r/670922 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:15:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "Can't say I 100% understand the nitty gritty details but LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [16:15:25] (03PS2) 10Jbond: ldap: Add support for read/write operations [puppet] - 10https://gerrit.wikimedia.org/r/743204 (https://phabricator.wikimedia.org/T295150) [16:16:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32790/console" [puppet] - 10https://gerrit.wikimedia.org/r/743204 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [16:16:28] (03CR) 10LMata: [C: 03+2] Add migration dates for o11y stretch systems [puppet] - 10https://gerrit.wikimedia.org/r/743138 (owner: 10Muehlenhoff) [16:16:50] (03PS6) 10Hnowlan: cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) [16:18:23] (03CR) 10LMata: [C: 03+1] "ack prioritized" [puppet] - 10https://gerrit.wikimedia.org/r/743138 (owner: 10Muehlenhoff) [16:18:45] (03CR) 10jerkins-bot: [V: 04-1] cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [16:19:07] (03PS3) 10Ryan Kemper: rdf-streaming-updater: Provide the namespace to be updated [puppet] - 10https://gerrit.wikimedia.org/r/743052 (owner: 10Ebernhardson) [16:19:18] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/743052 (owner: 10Ebernhardson) [16:19:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32791/console" [puppet] - 10https://gerrit.wikimedia.org/r/743204 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [16:20:22] (03PS7) 10Hnowlan: cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) [16:22:49] (03PS2) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [16:24:07] (03PS3) 10Jbond: ldap: Add support for read/write operations [puppet] - 10https://gerrit.wikimedia.org/r/743204 (https://phabricator.wikimedia.org/T295150) [16:25:01] (03CR) 10Jbond: "updated thanks. Take note i have added a few more files since the first review as it took a little more to pass the write config through" [puppet] - 10https://gerrit.wikimedia.org/r/743204 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [16:25:10] (03PS3) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [16:25:50] (03PS4) 10Ryan Kemper: rdf-streaming-updater: Provide the namespace to be updated [puppet] - 10https://gerrit.wikimedia.org/r/743052 (owner: 10Ebernhardson) [16:25:59] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/743052 (owner: 10Ebernhardson) [16:29:57] (03PS5) 10Ahmon Dancy: mediawiki 0.0.41: Define php.devel_mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/742773 [16:31:44] (03CR) 10Ahmon Dancy: mediawiki 0.0.41: Define php.devel_mode (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/742773 (owner: 10Ahmon Dancy) [16:35:37] (03PS1) 10Ottomata: Deploy research_poc thanos swift auth env file to hadoop [puppet] - 10https://gerrit.wikimedia.org/r/743214 (https://phabricator.wikimedia.org/T296945) [16:35:43] (03CR) 10Muehlenhoff: ldap: Add support for read/write operations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743204 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [16:36:10] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32792/console" [puppet] - 10https://gerrit.wikimedia.org/r/743052 (owner: 10Ebernhardson) [16:37:28] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32793/console" [puppet] - 10https://gerrit.wikimedia.org/r/743214 (https://phabricator.wikimedia.org/T296945) (owner: 10Ottomata) [16:37:51] (03CR) 10David Caro: [C: 03+2] cli: add --fail-fast flag and behavior (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [16:37:59] (03CR) 10David Caro: [C: 03+2] tests: move to pytest [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/742112 (https://phabricator.wikimedia.org/T296481) (owner: 10David Caro) [16:38:43] (03CR) 10Jbond: ldap: Add support for read/write operations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743204 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [16:39:46] (03Merged) 10jenkins-bot: cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [16:39:48] (03Merged) 10jenkins-bot: tests: move to pytest [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/742112 (https://phabricator.wikimedia.org/T296481) (owner: 10David Caro) [16:39:51] (03PS2) 10Ottomata: Deploy research_poc thanos swift auth env file to hadoop [puppet] - 10https://gerrit.wikimedia.org/r/743214 (https://phabricator.wikimedia.org/T296945) [16:40:13] (03CR) 10Ryan Kemper: [V: 03+1] "here's the new ExecStart line:" [puppet] - 10https://gerrit.wikimedia.org/r/743052 (owner: 10Ebernhardson) [16:40:58] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32794/console" [puppet] - 10https://gerrit.wikimedia.org/r/743214 (https://phabricator.wikimedia.org/T296945) (owner: 10Ottomata) [16:41:54] (03CR) 10Ottomata: Deploy research_poc thanos swift auth env file to hadoop [puppet] - 10https://gerrit.wikimedia.org/r/743214 (https://phabricator.wikimedia.org/T296945) (owner: 10Ottomata) [16:42:06] (03CR) 10Ottomata: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/32794/an-master1001.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/743214 (https://phabricator.wikimedia.org/T296945) (owner: 10Ottomata) [16:43:07] (03CR) 10Muehlenhoff: ldap: Add support for read/write operations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743204 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [16:44:06] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] rdf-streaming-updater: Provide the namespace to be updated [puppet] - 10https://gerrit.wikimedia.org/r/743052 (owner: 10Ebernhardson) [16:46:44] (03CR) 10Herron: [C: 03+1] Add migration dates for o11y stretch systems [puppet] - 10https://gerrit.wikimedia.org/r/743138 (owner: 10Muehlenhoff) [16:52:33] (03CR) 10Herron: [C: 03+1] profile: allow logstash checker to query opensearch [puppet] - 10https://gerrit.wikimedia.org/r/743048 (owner: 10Cwhite) [16:54:23] (03CR) 10Herron: [C: 03+1] opensearch: use systemd timer for gc log rotation [puppet] - 10https://gerrit.wikimedia.org/r/743047 (owner: 10Cwhite) [16:54:58] (03CR) 10Herron: [C: 03+1] hiera: enable opensearch compatibility mode [puppet] - 10https://gerrit.wikimedia.org/r/743046 (owner: 10Cwhite) [16:57:32] (03CR) 10Herron: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/743049 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [17:00:05] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211202T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:31] puppet window done :) [17:02:18] (03CR) 10DCausse: rdf-streaming-updater: Add configuration for wcqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743053 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [17:04:00] (03CR) 10DCausse: rdf-streaming-updater: Add configuration for wcqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743053 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [17:07:03] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Andrew) btw this host is currently out of service and invisible to icinga so it can be rebooted whenever. [17:07:18] (03PS1) 10DCausse: [wdqs] cleanup blazegraph jvm options [puppet] - 10https://gerrit.wikimedia.org/r/743216 [17:08:09] (03PS2) 10Jsn.sherman: Enable TheWikipediaLibrary on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742996 (https://phabricator.wikimedia.org/T288070) [17:08:11] (03PS1) 10Andrew Bogott: wmcs: added admin script to produce text of annual purge wiki page [puppet] - 10https://gerrit.wikimedia.org/r/743217 [17:08:14] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: Add configuration for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/743053 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [17:08:28] (03CR) 10Ebernhardson: rdf-streaming-updater: Add configuration for wcqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743053 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [17:08:58] (03CR) 10jerkins-bot: [V: 04-1] wmcs: added admin script to produce text of annual purge wiki page [puppet] - 10https://gerrit.wikimedia.org/r/743217 (owner: 10Andrew Bogott) [17:09:16] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/743216 (owner: 10DCausse) [17:10:19] (03CR) 10Ryan Kemper: [C: 03+2] cirrussearch: fix grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/740708 (owner: 10Ryan Kemper) [17:10:24] (03CR) 10Jsn.sherman: [C: 04-1] Enable TheWikipediaLibrary on all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742996 (https://phabricator.wikimedia.org/T288070) (owner: 10Jsn.sherman) [17:12:08] (03PS2) 10Andrew Bogott: wmcs: added admin script to produce text of annual purge wiki page [puppet] - 10https://gerrit.wikimedia.org/r/743217 [17:15:03] (03CR) 10Jsn.sherman: [C: 04-1] "I accidentally dropped the -1; keeping it here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742996 (https://phabricator.wikimedia.org/T288070) (owner: 10Jsn.sherman) [17:20:22] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: added admin script to produce text of annual purge wiki page [puppet] - 10https://gerrit.wikimedia.org/r/743217 (owner: 10Andrew Bogott) [17:24:06] PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:24:16] (03PS2) 10Ryan Kemper: wdqs: cleanup blazegraph jvm options [puppet] - 10https://gerrit.wikimedia.org/r/743216 (owner: 10DCausse) [17:25:52] (03PS3) 10Ryan Kemper: wdqs: cleanup blazegraph jvm options [puppet] - 10https://gerrit.wikimedia.org/r/743216 (owner: 10DCausse) [17:28:39] (03CR) 10Cwhite: hiera: use site-local ldap for opensearch in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743045 (owner: 10Cwhite) [17:35:01] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: cleanup blazegraph jvm options [puppet] - 10https://gerrit.wikimedia.org/r/743216 (owner: 10DCausse) [17:38:33] !log [WDQS] Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/743216/; as a result of the fix `'-Dwdqs.throttling-filter.time-bucket-capacity-in-seconds=240', '-Dwdqs.throttling-filter.time-bucket-refill-amount-in-seconds=120', '-Dwdqs.throttling-filter.ban-duration-in-minutes=60'` will now be in the `extra_jvm_opts` for `wdqs-internal` hosts [17:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:32] RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:51:21] !log puppet disabled on cp3064 to manually increase number of maxconns in HAProxy - T296874 [17:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:26] T296874: HAProxy fails to reuse connections under some conditions - https://phabricator.wikimedia.org/T296874 [17:56:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [18:00:05] chrisalbon and accraze: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211202T1800). [18:03:17] (03CR) 10Ryan Kemper: [C: 03+2] rdf-streaming-updater: Add configuration for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/743053 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [18:04:34] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10Ottomata) @MatthewVernon endpoint should just be the host URL, without the /auth/v1.0 path? [18:05:14] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10RobH) updated the firmware of the bios, both network cards, raid controller, and backplane. system booted back into the existing os. [18:05:22] 10SRE, 10Traffic: HAProxy fails to reuse connections under some conditions - https://phabricator.wikimedia.org/T296874 (10Vgutierrez) As pointed out by Willy Tarreau on https://github.com/haproxy/haproxy/issues/1472#issuecomment-984745445 it's a matter of FD usage and increasing maxconn seems to solve it: {F34... [18:08:35] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10herron) 05In progress→03Resolved a:03herron >>! In T295993#7542796, @Dzahn wrote: > @herron So, the summary is they can have 2 accounts, one as volunteer and one as staff, that's ok. The voluntee... [18:12:17] (03PS1) 10Herron: admin: add user eleoni to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/743220 (https://phabricator.wikimedia.org/T296957) [18:12:39] (03PS1) 10Vgutierrez: haproxy:tls_terminator: Increase maxconn to 200k [puppet] - 10https://gerrit.wikimedia.org/r/743221 (https://phabricator.wikimedia.org/T296874) [18:13:44] (03PS3) 10Ottomata: Deploy research_poc thanos swift auth env file to hadoop [puppet] - 10https://gerrit.wikimedia.org/r/743214 (https://phabricator.wikimedia.org/T296945) [18:14:19] !log Started Wikibase rebuildItemsPerSite on mwmaint1002 for wikidatawiki. Can be killed at any time, if necessary. [18:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:17] (03CR) 10jerkins-bot: [V: 04-1] Deploy research_poc thanos swift auth env file to hadoop [puppet] - 10https://gerrit.wikimedia.org/r/743214 (https://phabricator.wikimedia.org/T296945) (owner: 10Ottomata) [18:18:43] (03CR) 10Vgutierrez: [C: 03+2] haproxy:tls_terminator: Increase maxconn to 200k [puppet] - 10https://gerrit.wikimedia.org/r/743221 (https://phabricator.wikimedia.org/T296874) (owner: 10Vgutierrez) [18:19:29] !log re-enable puppet on cp3064 - T296874 [18:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:34] T296874: HAProxy fails to reuse connections under some conditions - https://phabricator.wikimedia.org/T296874 [18:19:44] (03PS1) 10Ebernhardson: Provide a specific user agent when checking servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/743222 [18:20:37] (03CR) 10jerkins-bot: [V: 04-1] Provide a specific user agent when checking servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/743222 (owner: 10Ebernhardson) [18:21:25] (03PS1) 10Ryan Kemper: Switch WCQS to profile::base::linux510 [puppet] - 10https://gerrit.wikimedia.org/r/743223 (https://phabricator.wikimedia.org/T294961) [18:21:52] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/743223 (https://phabricator.wikimedia.org/T294961) (owner: 10Ryan Kemper) [18:22:15] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [18:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:39] 10SRE, 10Traffic, 10Patch-For-Review: HAProxy fails to reuse connections under some conditions - https://phabricator.wikimedia.org/T296874 (10Vgutierrez) 05Open→03Resolved p:05Triage→03Medium a:03Vgutierrez [18:22:44] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [18:27:02] (03PS1) 10Jbond: C:puppet_compiler: add uploader class [puppet] - 10https://gerrit.wikimedia.org/r/743224 [18:27:37] (03CR) 10jerkins-bot: [V: 04-1] C:puppet_compiler: add uploader class [puppet] - 10https://gerrit.wikimedia.org/r/743224 (owner: 10Jbond) [18:27:48] (03PS2) 10Jbond: C:puppet_compiler: add uploader class [puppet] - 10https://gerrit.wikimedia.org/r/743224 [18:28:23] (03CR) 10jerkins-bot: [V: 04-1] C:puppet_compiler: add uploader class [puppet] - 10https://gerrit.wikimedia.org/r/743224 (owner: 10Jbond) [18:29:02] (03PS3) 10Jbond: C:puppet_compiler: add uploader class [puppet] - 10https://gerrit.wikimedia.org/r/743224 [18:29:37] (03CR) 10jerkins-bot: [V: 04-1] C:puppet_compiler: add uploader class [puppet] - 10https://gerrit.wikimedia.org/r/743224 (owner: 10Jbond) [18:31:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [18:36:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [18:37:29] (03PS2) 10Ebernhardson: Provide a specific user agent when checking servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/743222 [18:40:03] (03CR) 10Ebernhardson: "I think this will work, per the twisted docs, but not entirely sure how to test." [debs/pybal] - 10https://gerrit.wikimedia.org/r/743222 (owner: 10Ebernhardson) [18:40:36] (03Abandoned) 10Jforrester: Drop the 'inactive' user rights grant, no longer around post-DisableAccount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462592 (https://phabricator.wikimedia.org/T158594) (owner: 10Jforrester) [18:41:18] (03PS7) 10Jforrester: Drop old config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932) [18:45:01] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Andrew) Volans suggests that this might be related to cloudgw changes. @aborrero, @cmooney, @ayounsi would you expect dhcp to still work normally on cloudvirts? [18:45:41] !log uploaded scap 4.1.0 to apt.wm.o (T296867) [18:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:46] T296867: Deploy Scap version 4.1.0 - https://phabricator.wikimedia.org/T296867 [18:46:48] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Volans) The host keep failing and from the logs on the install server the pxelinux is never offered. Looking at the DHCP logs I found them not following the usual offer... [18:46:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [18:50:11] (03PS1) 10Clare Ming: Update scroll instrument [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/743227 (https://phabricator.wikimedia.org/T294246) [18:53:41] (03CR) 10Nray: [C: 03+1] Update scroll instrument [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/743227 (https://phabricator.wikimedia.org/T294246) (owner: 10Clare Ming) [18:56:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/743223 (https://phabricator.wikimedia.org/T294961) (owner: 10Ryan Kemper) [18:56:37] !log upgraded scap to 4.1.0 on A:mw-canary, A:parsoid-canary, A:mw-jobrunner-canary (T296867) [18:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:41] T296867: Deploy Scap version 4.1.0 - https://phabricator.wikimedia.org/T296867 [18:56:44] (03CR) 10Ryan Kemper: [C: 03+2] Switch WCQS to profile::base::linux510 [puppet] - 10https://gerrit.wikimedia.org/r/743223 (https://phabricator.wikimedia.org/T294961) (owner: 10Ryan Kemper) [19:00:04] RoanKattouw and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211202T1900). [19:00:04] tgr, James_F, and cjming: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:09] * James_F waves. [19:00:11] hey! I can deploy today if someone more experienced is around (and you don't want to self-serve) [19:00:27] o/ [19:00:36] majavah: Happy to superintend, sure. [19:00:40] majavah: This'll be your first? [19:01:01] I did some config few patches yesterday, so not the first but pretty close [19:01:07] o/ [19:01:12] * urbanecm is around too [19:01:35] tgr: hi! do you want to self-service? [19:01:59] majavah: if you prefer, works for me either way [19:02:21] majavah: note there's a wmf.9 backport by cjming, you might want to +2 that now to save CI waiting time [19:02:30] (just a pro tip :)) [19:02:31] tgr: in that case I'd prefer to do it, just to get the experience [19:02:36] (03PS5) 10Majavah: GrowthExperiments configuration fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza) [19:03:13] (03CR) 10Majavah: [C: 03+2] "deploying" [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/743227 (https://phabricator.wikimedia.org/T294246) (owner: 10Clare Ming) [19:03:19] urbanecm: thanks, good idea [19:03:27] (03CR) 10Majavah: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza) [19:04:26] (03Merged) 10jenkins-bot: GrowthExperiments configuration fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza) [19:05:05] tgr: your patch is live on mwdebug1001, please test [19:05:55] (03PS4) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [19:06:13] (03Merged) 10jenkins-bot: Update scroll instrument [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/743227 (https://phabricator.wikimedia.org/T294246) (owner: 10Clare Ming) [19:06:18] * urbanecm looking too [19:07:02] wfm [19:07:03] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10Cmjohnson) @ayounsi Which junos version do you need? [19:07:20] cool, waiting for tgr too before syncing [19:07:21] (03PS5) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [19:07:27] verified by `new mw.Api().saveOption('growthexperiments-homepage-variant', 'invalid')` and comparing `ge.utils.getUserVariant()` at mwdebug1001 and outside debug srv [19:08:15] majavah: sure, good idea :) [19:08:34] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:08:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10database-backups: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102 - https://phabricator.wikimedia.org/T296546 (10Cmjohnson) @jcrespo Can we schedule this for Friday or would Monday be better fo... [19:10:51] eh, testing A/B tested features is annoying [19:11:00] I'll just assume I got unlucky and it works [19:11:13] ok, syncing [19:11:19] thanks majavah! [19:12:00] tgr: AFAIK wgGEHomepageDefaultVariant is only used when the variant is no longer valid, which is impossible to do by acc creation [19:12:08] (that's why i was just tampering with the variant pref) [19:12:20] !log taavi@deploy1002 Synchronized wmf-config: Config: [[gerrit:739032|GrowthExperiments configuration fixes (T294737)]] (duration: 00m 57s) [19:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:26] T294737: Add an image: experiment - https://phabricator.wikimedia.org/T294737 [19:13:19] (03CR) 10Majavah: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932) (owner: 10Jforrester) [19:13:38] I'm guessing I won't need to sync the tests/ directory? [19:13:55] yup, just IS.php is enough [19:14:39] yeah, in theory none of this configuration is ever used. I just wanted to do a safety check on registration, but didn't manage to hit the 40% variant in five attempts. The patch not risky so not worried about it. [19:15:07] yup yup [19:16:03] * majavah patiently waits for Zuul to merge the patch [19:16:57] (03PS8) 10Majavah: Drop old config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932) (owner: 10Jforrester) [19:17:10] (after calculating the chances, slightly worried about it :) [19:17:11] (03CR) 10Majavah: [C: 03+2] Drop old config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932) (owner: 10Jforrester) [19:17:18] let's see if that helps [19:17:35] most of the times it helps [19:17:44] (I'm doing it automatically when merging in this repo, tbh) [19:19:06] (03Merged) 10jenkins-bot: Drop old config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932) (owner: 10Jforrester) [19:19:11] finally [19:19:41] James_F: can you test on mwdebug1001 please? [19:20:30] majavah: Sure. [19:20:54] majavah: Well, the site's still up, so go for it. [19:21:26] thanks majavah and James_F for changing the wording in this feature, btw [19:21:33] majavah: Yeah, never sync more than you 'need'; production hosts will get to eventual consistency when the next `scap sync-world` happens, but test directories etc. doesn't matter. [19:22:04] urbanecm: Happy to make the little improvements. Credit to R.eedy and others for driving us on this process. [19:22:14] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:720363|Drop old config names for CentralAuth denylist controls (T277932)]] (duration: 00m 56s) [19:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:18] T277932: Address Voice and Tone issues in CentralAuth - https://phabricator.wikimedia.org/T277932 [19:22:28] thanks to them too :) [19:23:36] cjming: your patch is live on mwdebug1001, can you test please? [19:24:50] majavah: lgtm [19:24:55] thanks, syncing [19:25:02] (03CR) 10Dzahn: "I noticed, in corp LDAP, that they are listed as contractor, not FTE. That means we need to ask for expiry_date and expiry_contact and add" [puppet] - 10https://gerrit.wikimedia.org/r/743220 (https://phabricator.wikimedia.org/T296957) (owner: 10Herron) [19:26:10] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/WikimediaEvents/modules/ext.wikimediaEvents/webUIScroll.js: Backport: [[gerrit:743227|Update scroll instrument (T294246)]] (duration: 00m 56s) [19:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:14] T294246: Sticky header: Add agent_type and access_method to sticky header instrumentation - https://phabricator.wikimedia.org/T294246 [19:26:36] majavah: thank you \o/ [19:26:46] thanks all [19:26:54] !log UTC evening deploys done [19:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:10] thanks majavah for the deployment! [19:27:51] (03CR) 10Dzahn: [C: 03+2] modules/beta/files/wmf-beta-update-databases.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670922 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [19:28:29] (03Abandoned) 10Ebernhardson: Perform weekly dumps of all public media urls [puppet] - 10https://gerrit.wikimedia.org/r/561356 (https://phabricator.wikimedia.org/T240520) (owner: 10Ebernhardson) [19:29:20] !log upgrading wikitech-static deb packages as well as moving to mediawiki 1.37.0 [19:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:38] (03PS6) 10Legoktm: mediawiki: Install yaml extension for SettingsBuilder on canaries [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [19:35:03] !log installing yaml PHP extension on canaries [19:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:49] (03CR) 10Legoktm: [C: 03+2] mediawiki: Install yaml extension for SettingsBuilder on canaries [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [19:36:18] mutante: I'm going to puppet merge your beta Python 3 change [19:37:02] legoktm: sorry, got sidetracked by another chat window, thanks, please do [19:38:11] no worries, done :) [19:42:53] (03PS1) 10Herron: admin: add aminalhazwani to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/743246 (https://phabricator.wikimedia.org/T296816) [19:48:36] is there a reason mw1414 is depooled? [19:51:19] (03PS6) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [19:52:39] (03PS1) 10Ssingh: test_dns: add a DoH check against all doh* hosts [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/743249 [19:53:19] (03CR) 10Andrew Bogott: [C: 03+1] striker: send dev logs to logstash pipeline via localhost [puppet] - 10https://gerrit.wikimedia.org/r/742983 (owner: 10Herron) [19:53:44] (03CR) 10Ssingh: [C: 03+2] test_dns: add a DoH check against all doh* hosts [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/743249 (owner: 10Ssingh) [19:53:50] seems like it was depooled around the 24th [19:54:51] ah, it was j.oe [20:05:22] !log re-pooling mw1414 following testing [20:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:14] (03PS7) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [20:17:51] (03PS8) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [20:18:05] (03CR) 10Herron: [C: 03+2] striker: send dev logs to logstash pipeline via localhost [puppet] - 10https://gerrit.wikimedia.org/r/742983 (owner: 10Herron) [20:22:43] (03PS1) 10JHathaway: icinga: authorize myself, jhathaway, to run commands [puppet] - 10https://gerrit.wikimedia.org/r/743254 [20:49:06] (03PS1) 10Herron: striker: send logs to logstash pipeline via local rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/743257 (https://phabricator.wikimedia.org/T151422) [20:50:04] (03PS2) 10Herron: striker: send logs to logstash pipeline via local rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/743257 (https://phabricator.wikimedia.org/T151422) [20:51:00] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/743257 (https://phabricator.wikimedia.org/T151422) (owner: 10Herron) [21:14:18] (03Abandoned) 10Herron: striker: send logs to logstash pipeline via local rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/743257 (https://phabricator.wikimedia.org/T151422) (owner: 10Herron) [21:15:20] (03PS1) 10Herron: Revert "striker: send dev logs to logstash pipeline via localhost" [puppet] - 10https://gerrit.wikimedia.org/r/743174 [21:16:29] (03CR) 10Herron: [C: 03+2] Revert "striker: send dev logs to logstash pipeline via localhost" [puppet] - 10https://gerrit.wikimedia.org/r/743174 (owner: 10Herron) [21:22:28] (03PS1) 10Herron: striker: switch cloudweb dev to cee logging handler [puppet] - 10https://gerrit.wikimedia.org/r/743261 (https://phabricator.wikimedia.org/T151422) [21:23:12] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/743261 (https://phabricator.wikimedia.org/T151422) (owner: 10Herron) [21:26:32] 10SRE, 10ops-codfw: Installation issues on ganeti2009/2010 with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) [21:32:02] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10wiki_willy) Hi @akosiaris or @Joe, @Dzahn, @jijiki - just following up to see if we could get the racking task info filled out for this in the task description. Much... [21:35:01] (03CR) 10Herron: [C: 03+2] "please see rationale in https://phabricator.wikimedia.org/T151422#7544661" [puppet] - 10https://gerrit.wikimedia.org/r/743261 (https://phabricator.wikimedia.org/T151422) (owner: 10Herron) [21:44:31] (03CR) 10Cwhite: [C: 03+2] hiera: enable opensearch compatibility mode [puppet] - 10https://gerrit.wikimedia.org/r/743046 (owner: 10Cwhite) [21:44:52] (03CR) 10Cwhite: [C: 03+2] hiera: use site-local ldap for opensearch in codfw [puppet] - 10https://gerrit.wikimedia.org/r/743045 (owner: 10Cwhite) [21:46:11] (03CR) 10Cwhite: [C: 03+2] opensearch: use systemd timer for gc log rotation [puppet] - 10https://gerrit.wikimedia.org/r/743047 (owner: 10Cwhite) [21:46:36] (03CR) 10Cwhite: [C: 03+2] profile: allow logstash checker to query opensearch [puppet] - 10https://gerrit.wikimedia.org/r/743048 (owner: 10Cwhite) [21:46:44] 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [21:47:14] 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [21:47:35] 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [21:47:46] 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [21:48:19] 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [21:48:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10wiki_willy) [21:48:46] 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [21:48:55] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10wiki_willy) [21:48:57] 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [21:49:17] 10SRE, 10ops-eqiad, 10DC-Ops: Row E/F temp/humid probe installation - https://phabricator.wikimedia.org/T296424 (10wiki_willy) [21:49:19] 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [21:49:41] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10wiki_willy) [21:49:44] 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [21:50:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10wiki_willy) [21:50:09] 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [21:50:23] 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [21:52:17] 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [21:52:50] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:55:36] 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [22:13:35] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for eleoni - https://phabricator.wikimedia.org/T296957 (10Dzahn) Hey @Daimona sorry for this being complicated, but.. could you clarify whether you are a contractor or a full-time employee? [22:26:28] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:28:51] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) one case of mw2252.mgmt right now [22:33:36] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @AntiCompositeNumber Not for sure but it seems likely that it could also b... [22:36:50] ACKNOWLEDGEMENT - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:53:56] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:27:32] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:36:11] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for eleoni - https://phabricator.wikimedia.org/T296957 (10Daimona) I'm currently a full-time employee. [23:44:12] (03PS1) 10Gergő Tisza: Avoid references to TemplateCollectionFeature [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/743178