[00:00:04] brennen: Your horoscope predicts another unfortunate UTC late backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211203T0000). [00:00:05] tgr: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:01:14] will do [00:05:21] (03CR) 10Gergő Tisza: [C: 03+2] Avoid references to TemplateCollectionFeature [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/743178 (owner: 10Gergő Tisza) [00:15:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:01] (03PS1) 10Papaul: Add new restbase202[456] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/743275 (https://phabricator.wikimedia.org/T294377) [00:21:12] (03CR) 10Papaul: [C: 03+2] Add new restbase202[456] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/743275 (https://phabricator.wikimedia.org/T294377) (owner: 10Papaul) [00:24:05] (03PS1) 10Papaul: Add new restbsse202[456] to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/743276 (https://phabricator.wikimedia.org/T294377) [00:27:30] (03CR) 10Papaul: [C: 03+2] Add new restbsse202[456] to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/743276 (https://phabricator.wikimedia.org/T294377) (owner: 10Papaul) [00:28:12] (03Merged) 10jenkins-bot: Avoid references to TemplateCollectionFeature [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/743178 (owner: 10Gergő Tisza) [00:30:17] (03CR) 10Gergő Tisza: [C: 03+2] Add an image: Add test version of GEInfoboxTemplates [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/743177 (https://phabricator.wikimedia.org/T291232) (owner: 10Gergő Tisza) [00:33:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2024.codfw.wmnet with OS buster [00:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:44] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, and 2 others: Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2024.codfw.wmnet with OS buster [00:36:15] !log tgr@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/includes/Config/Validation/GrowthConfigValidation.php: Backport: [[gerrit:743178|Avoid references to TemplateCollectionFeature]] step 1 (duration: 00m 56s) [00:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:45] !log tgr@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/includes: Backport: [[gerrit:743178|Avoid references to TemplateCollectionFeature]] step2 (duration: 00m 56s) [00:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:38] !log rzl@apt1001:~$ sudo -i reprepro -C main include buster-wikimedia /home/rzl/python3-imagecatalog/imagecatalog_0.0.1-1_amd64.changes [00:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:47] (03CR) 10Razzi: [C: 03+1] "LGTM, thanks for keeping an eye on these small but important details @elukey" [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [00:50:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:50:46] (03CR) 10Razzi: "Hmm am I missing something? Looks to me like the uid and gid of kafka are 499. Is the idea to make 916 the new uid / gid, and if so, how'd" [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [00:52:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:54] (03Merged) 10jenkins-bot: Add an image: Add test version of GEInfoboxTemplates [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/743177 (https://phabricator.wikimedia.org/T291232) (owner: 10Gergő Tisza) [01:00:57] !log tgr@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments: Backport: [[gerrit:743177|Add an image: Add test version of GEInfoboxTemplates (T291232)]] (duration: 00m 57s) [01:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:02] T291232: Add an image: exclude certain articles - https://phabricator.wikimedia.org/T291232 [01:01:41] !log UTC late deploys done [01:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2024.codfw.wmnet with OS buster [01:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:35] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, and 2 others: Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2024.codfw.wmnet with OS buster comp... [01:06:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2025.codfw.wmnet with OS buster [01:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:55] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, and 2 others: Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2025.codfw.wmnet with OS buster [01:19:52] (03CR) 10RLazarus: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/742574 (owner: 10RLazarus) [01:37:59] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:39:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2025.codfw.wmnet with OS buster [01:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:09] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 fo... [01:42:21] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:55] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10Papaul) [02:27:11] (03CR) 10Legoktm: [C: 03+2] "Thanks! Let's give this a try..." [puppet] - 10https://gerrit.wikimedia.org/r/742236 (https://phabricator.wikimedia.org/T293055) (owner: 10Majavah) [02:31:31] (03PS1) 10Legoktm: extdist: Only enable labs::lvm::srv on pre-bullseye [puppet] - 10https://gerrit.wikimedia.org/r/743281 (https://phabricator.wikimedia.org/T293055) [02:31:51] (03CR) 104nn1l2: [C: 03+1] "Let's deploy this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734383 (https://phabricator.wikimedia.org/T293839) (owner: 10Urbanecm) [02:37:51] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32796/console" [puppet] - 10https://gerrit.wikimedia.org/r/743281 (https://phabricator.wikimedia.org/T293055) (owner: 10Legoktm) [02:38:24] (03CR) 10Legoktm: [V: 03+1 C: 03+2] extdist: Only enable labs::lvm::srv on pre-bullseye [puppet] - 10https://gerrit.wikimedia.org/r/743281 (https://phabricator.wikimedia.org/T293055) (owner: 10Legoktm) [02:46:45] (03PS1) 10Legoktm: extdist: Fix conditional to use lt, not le [puppet] - 10https://gerrit.wikimedia.org/r/743282 [02:47:08] (03CR) 10Legoktm: [V: 03+2 C: 03+2] extdist: Fix conditional to use lt, not le [puppet] - 10https://gerrit.wikimedia.org/r/743282 (owner: 10Legoktm) [03:59:52] RECOVERY - IPMI Sensor Status on htmldumper1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [04:18:35] (03CR) 10Dzahn: [C: 03+1] admin: add user eleoni to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/743220 (https://phabricator.wikimedia.org/T296957) (owner: 10Herron) [04:19:32] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for eleoni - https://phabricator.wikimedia.org/T296957 (10Dzahn) Ok, thanks! based on that the patch is +1 , Herron [05:16:57] (03CR) 10Razzi: "Late to the party here, but I wonder if we could add some sort of linter to catch invalid configuration like this. It's a bit tricky becau" [puppet] - 10https://gerrit.wikimedia.org/r/737932 (https://phabricator.wikimedia.org/T295312) (owner: 10Btullis) [05:29:18] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1113.eqiad.wmnet with reason: Maintenance T277354 [05:30:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1113.eqiad.wmnet with reason: Maintenance T277354 [05:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:32] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [05:30:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P18000 and previous config saved to /var/cache/conftool/dbconfig/20211203-053032-marostegui.json [05:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1113:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P18001 and previous config saved to /var/cache/conftool/dbconfig/20211203-053457-marostegui.json [05:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:50] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:48:06] (03PS1) 10Marostegui: install_server: Reimage db1125 deleting /srv [puppet] - 10https://gerrit.wikimedia.org/r/743285 (https://phabricator.wikimedia.org/T295965) [05:49:27] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1125 deleting /srv [puppet] - 10https://gerrit.wikimedia.org/r/743285 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [05:50:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P18002 and previous config saved to /var/cache/conftool/dbconfig/20211203-055001-marostegui.json [05:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:47] (03PS2) 10KartikMistry: Enable SectionTranslation in Malayalam, Malay, Azerbaijani, Tamil, Bashkir and Albanian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743158 (https://phabricator.wikimedia.org/T285842) [06:02:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1125.eqiad.wmnet with OS bullseye [06:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P18003 and previous config saved to /var/cache/conftool/dbconfig/20211203-060506-marostegui.json [06:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1113:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P18004 and previous config saved to /var/cache/conftool/dbconfig/20211203-062011-marostegui.json [06:20:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1144.eqiad.wmnet with reason: Maintenance T277354 [06:20:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1144.eqiad.wmnet with reason: Maintenance T277354 [06:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:17] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [06:20:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P18005 and previous config saved to /var/cache/conftool/dbconfig/20211203-062019-marostegui.json [06:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1125.eqiad.wmnet with OS bullseye [06:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:17] PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:44:48] (03PS1) 10Marostegui: site.pp: Add testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/743288 (https://phabricator.wikimedia.org/T295965) [06:45:30] (03CR) 10Marostegui: [C: 03+2] site.pp: Add testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/743288 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [06:48:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1144:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P18006 and previous config saved to /var/cache/conftool/dbconfig/20211203-064850-marostegui.json [06:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:55] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [06:53:11] (03CR) 10Elukey: [V: 03+1] Move kafka test to fixed gid/uid for user kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [06:53:21] (03CR) 10Elukey: [C: 03+2] admin: reserve kafka uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/743130 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [06:55:10] (03CR) 10Elukey: [V: 03+1] "To make things simpler to reason about, I'll change the hiera parameter to boolean, and hardcode the 916 value + comments in the profile." [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [07:00:09] (03PS9) 10Elukey: profile::kafka::broker: allow to specify kafka uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) [07:00:11] (03PS7) 10Elukey: Move kafka test to fixed gid/uid for user kafka [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) [07:00:45] (03CR) 10jerkins-bot: [V: 04-1] profile::kafka::broker: allow to specify kafka uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [07:02:40] (03PS10) 10Elukey: profile::kafka::broker: allow to specify kafka uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) [07:02:42] (03PS8) 10Elukey: Move kafka test to fixed gid/uid for user kafka [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) [07:03:29] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32798/console" [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [07:03:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P18007 and previous config saved to /var/cache/conftool/dbconfig/20211203-070355-marostegui.json [07:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:40] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32799/console" [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [07:06:06] (03CR) 10Elukey: Move kafka test to fixed gid/uid for user kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [07:19:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P18008 and previous config saved to /var/cache/conftool/dbconfig/20211203-071900-marostegui.json [07:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:49] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::kafka::broker: allow to specify kafka uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/743150 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [07:29:41] (03CR) 10Elukey: [V: 03+1 C: 03+2] Move kafka test to fixed gid/uid for user kafka [puppet] - 10https://gerrit.wikimedia.org/r/743163 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [07:34:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1144:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P18009 and previous config saved to /var/cache/conftool/dbconfig/20211203-073404-marostegui.json [07:34:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1150.eqiad.wmnet with reason: Maintenance T277354 [07:34:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1150.eqiad.wmnet with reason: Maintenance T277354 [07:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:10] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [07:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:20] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:39:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1110.eqiad.wmnet with reason: Maintenance T277354 [07:39:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1110.eqiad.wmnet with reason: Maintenance T277354 [07:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T277354)', diff saved to https://phabricator.wikimedia.org/P18010 and previous config saved to /var/cache/conftool/dbconfig/20211203-073910-marostegui.json [07:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:11] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [07:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1110 (T277354)', diff saved to https://phabricator.wikimedia.org/P18011 and previous config saved to /var/cache/conftool/dbconfig/20211203-074334-marostegui.json [07:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:52] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:40] (03PS1) 10Marostegui: host.py: Slightly change a message [software] - 10https://gerrit.wikimedia.org/r/743347 (https://phabricator.wikimedia.org/T288235) [07:51:40] hi, is anywere incident report for 2021-11-29 outage ? [07:52:16] mykhal: nothing public to my knowledge [07:54:07] hm, sh*t happens... (?) [07:54:43] It's just mostly stuff that can't be said at the moment [07:54:57] Whether it ever will be, I don't know [07:56:12] If there's something specific you want to know, it would be best to ask that [07:56:14] i've seen some "disk full" log items later, is it related? or was it some attack ? [07:56:37] (03CR) 10Ladsgroup: [C: 03+2] host.py: Slightly change a message [software] - 10https://gerrit.wikimedia.org/r/743347 (https://phabricator.wikimedia.org/T288235) (owner: 10Marostegui) [07:57:10] (03Merged) 10jenkins-bot: host.py: Slightly change a message [software] - 10https://gerrit.wikimedia.org/r/743347 (https://phabricator.wikimedia.org/T288235) (owner: 10Marostegui) [07:57:27] because.. the outage seemed very notable for me [07:57:51] mykhal: It is not public at the moment [07:58:18] ok, tkanks [07:58:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1110', diff saved to https://phabricator.wikimedia.org/P18012 and previous config saved to /var/cache/conftool/dbconfig/20211203-075839-marostegui.json [07:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:09] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) Adding some Data-Engineering considerations, with the assumption that the new `sflow` stream would be comparable to... [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211203T0800) [08:00:24] marostegui: disk full on what? [08:00:29] But I'd imagine no [08:01:01] Again, the causes aren't public [08:10:22] (03PS4) 10Muehlenhoff: Add migration dates for o11y stretch systems [puppet] - 10https://gerrit.wikimedia.org/r/743138 [08:13:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1110', diff saved to https://phabricator.wikimedia.org/P18013 and previous config saved to /var/cache/conftool/dbconfig/20211203-081343-marostegui.json [08:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:06] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:16] (03CR) 10Muehlenhoff: [C: 03+2] Add migration dates for o11y stretch systems [puppet] - 10https://gerrit.wikimedia.org/r/743138 (owner: 10Muehlenhoff) [08:15:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743220 (https://phabricator.wikimedia.org/T296957) (owner: 10Herron) [08:17:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743246 (https://phabricator.wikimedia.org/T296816) (owner: 10Herron) [08:23:01] (03CR) 10Muehlenhoff: ldap: Add support for read/write operations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743204 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [08:28:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1110 (T277354)', diff saved to https://phabricator.wikimedia.org/P18014 and previous config saved to /var/cache/conftool/dbconfig/20211203-082848-marostegui.json [08:28:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db[1154,1161].eqiad.wmnet with reason: Maintenance T277354 [08:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:54] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [08:28:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db[1154,1161].eqiad.wmnet with reason: Maintenance T277354 [08:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T277354)', diff saved to https://phabricator.wikimedia.org/P18015 and previous config saved to /var/cache/conftool/dbconfig/20211203-082859-marostegui.json [08:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2009.codfw.wmnet with OS buster [08:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:14] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2009.codfw.wmnet with OS buster [08:30:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1161 (T277354)', diff saved to https://phabricator.wikimedia.org/P18016 and previous config saved to /var/cache/conftool/dbconfig/20211203-083023-marostegui.json [08:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:07] (03PS1) 10Elukey: role::kafka::main: use fixed uid/gid in the codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/743351 (https://phabricator.wikimedia.org/T296982) [08:32:26] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32800/console" [puppet] - 10https://gerrit.wikimedia.org/r/743351 (https://phabricator.wikimedia.org/T296982) (owner: 10Elukey) [08:32:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Excellent thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/742773 (owner: 10Ahmon Dancy) [08:36:09] (03Merged) 10jenkins-bot: mediawiki 0.0.41: Define php.devel_mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/742773 (owner: 10Ahmon Dancy) [08:43:14] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:25] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1161', diff saved to https://phabricator.wikimedia.org/P18017 and previous config saved to /var/cache/conftool/dbconfig/20211203-084528-marostegui.json [08:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:44] (03PS1) 10Filippo Giunchedi: prometheus: change blackbox jobs to use / as separator [puppet] - 10https://gerrit.wikimedia.org/r/743352 [08:58:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2009.codfw.wmnet with OS buster [08:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:00] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2009.codfw.wmnet with OS buster completed: - ganeti2009 (**PASS**) - Removed from Puppet... [09:00:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1161', diff saved to https://phabricator.wikimedia.org/P18018 and previous config saved to /var/cache/conftool/dbconfig/20211203-090033-marostegui.json [09:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:31] (03PS1) 10Muehlenhoff: profile::openldap::client: Write out a separate /etc/ldap/wmf-ldap.conf for tools [puppet] - 10https://gerrit.wikimedia.org/r/743353 [09:08:08] (03CR) 10jerkins-bot: [V: 04-1] profile::openldap::client: Write out a separate /etc/ldap/wmf-ldap.conf for tools [puppet] - 10https://gerrit.wikimedia.org/r/743353 (owner: 10Muehlenhoff) [09:09:15] (03PS2) 10Muehlenhoff: Write out a separate /etc/ldap/wmf-ldap.conf for tools [puppet] - 10https://gerrit.wikimedia.org/r/743353 [09:13:20] (03PS1) 10Matthias Mullie: Enable references support on beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743354 (https://phabricator.wikimedia.org/T230315) [09:15:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1161 (T277354)', diff saved to https://phabricator.wikimedia.org/P18019 and previous config saved to /var/cache/conftool/dbconfig/20211203-091537-marostegui.json [09:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:43] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:18:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2009.codfw.wmnet [09:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2009.codfw.wmnet [09:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2009.codfw.wmnet to ganeti01.svc.codfw.wmnet [09:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:20] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2009.codfw.wmnet to ganeti01.svc.codfw.wmnet [09:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:15] (03CR) 10Matthias Mullie: [C: 03+2] Enable references support on beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743354 (https://phabricator.wikimedia.org/T230315) (owner: 10Matthias Mullie) [09:26:56] (03Merged) 10jenkins-bot: Enable references support on beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743354 (https://phabricator.wikimedia.org/T230315) (owner: 10Matthias Mullie) [09:38:10] !log draining primary/secondary instances off ganeti2011 T296622 [09:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:15] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [09:38:30] (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Pass the Ganeti group to gnt-node add [cookbooks] - 10https://gerrit.wikimedia.org/r/743356 [09:48:16] (03PS1) 10Filippo Giunchedi: wmflib: add 'probes' to service::catalog type [puppet] - 10https://gerrit.wikimedia.org/r/743358 (https://phabricator.wikimedia.org/T291946) [09:48:20] (03PS1) 10Filippo Giunchedi: prometheus: support for blackbox configuration fragments [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) [09:53:46] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [09:59:17] (03PS2) 10Filippo Giunchedi: profile: add exim4 blackhole configuration [puppet] - 10https://gerrit.wikimedia.org/r/743207 (https://phabricator.wikimedia.org/T296373) [09:59:28] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32801/console" [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:00:09] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:01:07] (03PS2) 10Filippo Giunchedi: wmflib: add 'probes' to service::catalog type [puppet] - 10https://gerrit.wikimedia.org/r/743358 (https://phabricator.wikimedia.org/T291946) [10:01:09] (03PS2) 10Filippo Giunchedi: prometheus: support for blackbox configuration fragments [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) [10:01:11] (03PS2) 10Filippo Giunchedi: prometheus: change blackbox jobs to use / as separator [puppet] - 10https://gerrit.wikimedia.org/r/743352 [10:03:21] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32802/console" [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:16:30] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10jcrespo) 05Resolved→03Open ` from: SYSTEMDTIMER to: root@cumin2001.codfw.wmnet Output of systemd t... [10:17:06] (03CR) 10David Caro: C:puppet_compiler: add uploader class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743224 (owner: 10Jbond) [10:35:49] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Ladsgroup) [10:41:59] (03PS3) 10Jcrespo: mariadb: Remove obsolete mariadb.server init script [puppet] - 10https://gerrit.wikimedia.org/r/658953 (https://phabricator.wikimedia.org/T272559) [10:46:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10database-backups: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102 - https://phabricator.wikimedia.org/T296546 (10jcrespo) @Cmjohnson Today would be preferred, as Monday I will be off, and won't... [10:53:13] (03CR) 10Jcrespo: [C: 03+2] mariadb: Remove obsolete mariadb.server init script [puppet] - 10https://gerrit.wikimedia.org/r/658953 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo) [10:55:51] (03PS4) 10Jcrespo: exim: Reenable regular flash sale offers on wmf address [puppet] - 10https://gerrit.wikimedia.org/r/743040 (https://phabricator.wikimedia.org/T132324) [10:55:53] (03PS2) 10Jcrespo: mariadb: Replace deprecated wmflib require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/724943 [10:56:14] (03Abandoned) 10Jcrespo: mariadb: Replace deprecated wmflib require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/724943 (owner: 10Jcrespo) [10:58:11] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10jcrespo) [10:59:18] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10jcrespo) [11:01:41] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10jcrespo) [11:01:50] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [11:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [11:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:42] (03PS3) 10Jbond: Write out a separate /etc/ldap/wmf-ldap.conf for tools [puppet] - 10https://gerrit.wikimedia.org/r/743353 (owner: 10Muehlenhoff) [11:06:44] !log stop and shutdown db1102 T296546 [11:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:49] T296546: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102 - https://phabricator.wikimedia.org/T296546 [11:08:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2011.codfw.wmnet with OS buster [11:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:19] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS buster [11:08:33] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743353 (owner: 10Muehlenhoff) [11:08:59] (03Abandoned) 10Jbond: ldap: Add support for read/write operations [puppet] - 10https://gerrit.wikimedia.org/r/743204 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [11:09:34] (03CR) 10Muehlenhoff: Write out a separate /etc/ldap/wmf-ldap.conf for tools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743353 (owner: 10Muehlenhoff) [11:12:55] (03PS4) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [11:13:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10database-backups: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102 - https://phabricator.wikimedia.org/T296546 (10jcrespo) The host is **down and ready to be serviced**, please let us know if st... [11:15:06] (03PS5) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [11:15:32] (03CR) 10Jbond: "Patch has been updated based on 743353" [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [11:16:46] (03CR) 10jerkins-bot: [V: 04-1] modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [11:21:53] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) [11:23:09] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) And one more, added to the task description and ready to power off any time. I should have made an advent calendar out of these, one su... [11:24:17] (03CR) 10Muehlenhoff: [C: 03+2] Write out a separate /etc/ldap/wmf-ldap.conf for tools [puppet] - 10https://gerrit.wikimedia.org/r/743353 (owner: 10Muehlenhoff) [11:27:33] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2011.codfw.wmnet with OS buster [11:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:35] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS buster executed with errors: - ganeti2011 (**FAIL**) - Downtimed... [11:29:59] (03PS3) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [11:30:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2022.codfw.wmnet with OS buster [11:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:41] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2022.codfw.wmnet with OS buster [11:31:45] (03CR) 10jerkins-bot: [V: 04-1] Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [11:37:04] (03PS4) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [11:40:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32803/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [11:45:21] (03PS5) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [11:45:23] (03PS1) 10Jbond: P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366 [11:46:19] (03PS2) 10Jbond: P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366 [11:46:39] (03PS6) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [11:47:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32804/console" [puppet] - 10https://gerrit.wikimedia.org/r/743366 (owner: 10Jbond) [11:49:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32805/console" [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [12:07:10] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:40] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2022.codfw.wmnet with OS buster [12:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:05] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2022.codfw.wmnet with OS buster completed: - ganeti2022 (**PASS**) - Downtimed on Icinga... [12:16:36] (03PS3) 10Jbond: P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366 [12:17:07] (03CR) 10Muehlenhoff: [C: 03+1] "Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/743366 (owner: 10Jbond) [12:18:24] (03PS7) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [12:19:04] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [12:19:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32806/console" [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [12:20:28] (03CR) 10Muehlenhoff: Switch profile::openldap::management to use profile::openldap::client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [12:22:36] (03PS8) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [12:24:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32807/console" [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [12:26:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet [12:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet [12:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2022.codfw.wmnet to ganeti01.svc.codfw.wmnet [12:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:14] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2022.codfw.wmnet to ganeti01.svc.codfw.wmnet [12:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:12] (03PS4) 10Jbond: P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366 [12:37:15] !log draining primary/secondary instances off ganeti2007 T296622 [12:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:19] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [12:37:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32808/console" [puppet] - 10https://gerrit.wikimedia.org/r/743366 (owner: 10Jbond) [12:37:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32809/console" [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [12:37:55] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [12:38:40] (03PS5) 10Jbond: P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366 [12:38:51] (03PS9) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [12:39:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32810/console" [puppet] - 10https://gerrit.wikimedia.org/r/743366 (owner: 10Jbond) [12:40:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32811/console" [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [12:41:52] (03PS6) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [12:43:27] (03CR) 10jerkins-bot: [V: 04-1] modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [12:43:29] (03PS7) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [12:44:58] (03PS1) 10David Caro: pcc: add possibility to fail fast [puppet] - 10https://gerrit.wikimedia.org/r/743379 (https://phabricator.wikimedia.org/T296984) [12:45:00] (03PS1) 10David Caro: pcc: Autoformat with black+isort [puppet] - 10https://gerrit.wikimedia.org/r/743380 [12:45:13] (03CR) 10jerkins-bot: [V: 04-1] modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [12:45:57] (03CR) 10jerkins-bot: [V: 04-1] pcc: Autoformat with black+isort [puppet] - 10https://gerrit.wikimedia.org/r/743380 (owner: 10David Caro) [12:52:53] (03PS8) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [12:53:28] !log installing nss security updates on stretch [12:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:00] (03PS9) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [12:55:56] (03CR) 10jerkins-bot: [V: 04-1] modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [13:03:48] (03PS10) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [13:05:39] (03CR) 10jerkins-bot: [V: 04-1] modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [13:12:24] (03PS11) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [13:13:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32813/console" [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [13:14:07] (03CR) 10jerkins-bot: [V: 04-1] modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [13:15:38] (03PS12) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [13:15:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32814/console" [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [13:16:58] (03PS13) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [13:17:22] (03PS14) 10Jbond: modify-mfa.py: update modify-mfa and add-ldap-group to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [13:18:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32815/console" [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [13:19:52] (03CR) 10jerkins-bot: [V: 04-1] modify-mfa.py: update modify-mfa and add-ldap-group to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [13:24:48] (03PS15) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [13:26:32] (03CR) 10jerkins-bot: [V: 04-1] modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [13:33:53] (03PS1) 10Btullis: Refactor superset caching to enable dual caches [puppet] - 10https://gerrit.wikimedia.org/r/743386 (https://phabricator.wikimedia.org/T295295) [13:34:32] (03PS1) 10Jbond: C:ldap::client::utils: Update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/743387 (https://phabricator.wikimedia.org/T247364) [13:35:52] (03PS16) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [13:36:17] (03PS1) 10Filippo Giunchedi: prometheus: bump logging level for blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946) [13:36:28] (03CR) 10jerkins-bot: [V: 04-1] C:ldap::client::utils: Update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/743387 (https://phabricator.wikimedia.org/T247364) (owner: 10Jbond) [13:36:46] (03PS2) 10Jbond: C:ldap::client::utils: Update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/743387 (https://phabricator.wikimedia.org/T247364) [13:48:09] (03PS2) 10Herron: admin: add user eleoni to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/743220 (https://phabricator.wikimedia.org/T296957) [13:48:16] (03PS2) 10Herron: admin: add aminalhazwani to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/743246 (https://phabricator.wikimedia.org/T296816) [13:48:46] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32816/console" [puppet] - 10https://gerrit.wikimedia.org/r/743386 (https://phabricator.wikimedia.org/T295295) (owner: 10Btullis) [13:49:43] (03CR) 10Btullis: Refactor superset caching to enable dual caches [puppet] - 10https://gerrit.wikimedia.org/r/743386 (https://phabricator.wikimedia.org/T295295) (owner: 10Btullis) [14:06:34] (03CR) 10Btullis: partman: add reuse partman profile for cassandra hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [14:08:40] (03CR) 10Volans: [C: 04-1] "I don't think would work as is" [cookbooks] - 10https://gerrit.wikimedia.org/r/743356 (owner: 10Muehlenhoff) [14:10:42] !log jelto@cumin1001 START - Cookbook sre.ganeti.makevm for new host gitlab-runner2001.codfw.wmnet [14:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:11] !log jelto@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host gitlab-runner2001.codfw.wmnet [14:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:11] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) [14:27:02] (03CR) 10Herron: [C: 03+2] admin: add user eleoni to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/743220 (https://phabricator.wikimedia.org/T296957) (owner: 10Herron) [14:27:55] (03CR) 10Herron: [C: 03+2] admin: add aminalhazwani to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/743246 (https://phabricator.wikimedia.org/T296816) (owner: 10Herron) [14:34:55] (03CR) 10Awight: "Ready for review :-)" [puppet] - 10https://gerrit.wikimedia.org/r/742148 (https://phabricator.wikimedia.org/T296512) (owner: 10Awight) [14:35:15] (03PS2) 10Awight: Maps are invariant to revid parameter [puppet] - 10https://gerrit.wikimedia.org/r/742148 (https://phabricator.wikimedia.org/T296512) [14:36:17] (03PS1) 10Filippo Giunchedi: team-sre: port node-exporter textfile stale alert [alerts] - 10https://gerrit.wikimedia.org/r/743394 (https://phabricator.wikimedia.org/T288726) [14:36:40] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to WMF for Amin Al Hazwani - https://phabricator.wikimedia.org/T296816 (10herron) 05Open→03Resolved a:03herron Hi @aminalhazwani, your account has been added to the 'wmf' ldap group. I'll transition this to resolved now but please re-open... [14:39:43] (03PS1) 10Filippo Giunchedi: prometheus: remove textfile stale alert [puppet] - 10https://gerrit.wikimedia.org/r/743395 (https://phabricator.wikimedia.org/T288726) [14:40:12] (03PS1) 10Jelto: site and install_server: add gitlab-runner2001 [puppet] - 10https://gerrit.wikimedia.org/r/743396 (https://phabricator.wikimedia.org/T295481) [14:41:29] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for eleoni - https://phabricator.wikimedia.org/T296957 (10herron) 05Open→03Resolved a:03herron Ldap account eleoni is now a member of ldap group 'wmf' Phab account @Daimona is already present in #wmf-nda. I'm not aware of an alte... [14:42:07] 10SRE, 10SRE-Access-Requests, 10WMF-NDA-Requests: Add EJoseph to #wmf-nda - https://phabricator.wikimedia.org/T293326 (10herron) 05Open→03Stalled [14:47:57] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jsn.sherman - https://phabricator.wikimedia.org/T296654 (10herron) >>! In T296654#7539033, @Aklapper wrote: > It's human and understandable that folks stick to their procedural knowledge, but it also makes it hard to improve processes if they aren't looked... [14:57:58] (03CR) 10Jelto: [C: 03+2] site and install_server: add gitlab-runner2001 [puppet] - 10https://gerrit.wikimedia.org/r/743396 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:59:37] 10SRE, 10procurement: eqiad: request tools for data center cage - https://phabricator.wikimedia.org/T297013 (10Jclark-ctr) [15:00:19] 10SRE, 10procurement: eqiad: request tools for data center cage - https://phabricator.wikimedia.org/T297013 (10Jclark-ctr) p:05Triage→03Medium [15:13:03] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [15:37:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/743387 (https://phabricator.wikimedia.org/T247364) (owner: 10Jbond) [15:50:48] (03PS1) 10Elukey: admin_ng: refactor istio helmfile config to allow egress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/743438 (https://phabricator.wikimedia.org/T294414) [16:03:14] (03CR) 10Btullis: [C: 03+2] Re-apply spark.local.dir setting for stat servers [puppet] - 10https://gerrit.wikimedia.org/r/743155 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [16:22:34] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10cmooney) > To be noted that the host is behind one of the cloudsw switch. @ayounsi , @cmooney do you think that this could be the culprit and maybe option 82 has some i... [16:27:05] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:35:24] (03PS1) 10Ssingh: dnsdist: add SpoofRawAction() for a Wikidough easter egg [puppet] - 10https://gerrit.wikimedia.org/r/743442 [16:36:38] (03CR) 10Ssingh: [C: 03+2] dnsdist: add SpoofRawAction() for a Wikidough easter egg [puppet] - 10https://gerrit.wikimedia.org/r/743442 (owner: 10Ssingh) [16:39:31] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [16:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:38] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster [16:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:15] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [16:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:22] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster [16:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:51] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [16:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:33] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster [16:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:47] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [16:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:18] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10cmooney) Looks like the DHCP server is returning what it should. But the host keeps sending further DHCP requests, so not sure if the replies are making it. PCAP file... [17:13:39] (03PS1) 10Joal: Update aqs druid datasource for 2021-11 [puppet] - 10https://gerrit.wikimedia.org/r/743458 [17:15:09] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Andrew) @robh @wiki_willy sounds like this is back in you court. Cathal and I are out of ideas :( [17:17:28] (03PS1) 10Jelto: site: use gitlab_runner role on gitlab-runner2001 [puppet] - 10https://gerrit.wikimedia.org/r/743459 (https://phabricator.wikimedia.org/T295481) [17:18:49] (03CR) 10Razzi: [C: 03+2] Update aqs druid datasource for 2021-11 [puppet] - 10https://gerrit.wikimedia.org/r/743458 (owner: 10Joal) [17:22:59] !log razzi@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [17:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:12] (03CR) 10Jbond: C:puppet_compiler: add uploader class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743224 (owner: 10Jbond) [17:35:13] !log razzi@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [17:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:41] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster [17:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:51] (03CR) 10JHathaway: P:openldap::client: Add ldap::client::utils (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743366 (owner: 10Jbond) [17:35:59] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [17:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:45] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster [17:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:53] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [17:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:37] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10cmooney) Ok the traffic is being relayed by cr1-eqiad at least: ` cmooney@re0.cr1-eqiad> monitor traffic interface xe-3/0/4.1118 matching "port 67 or port 68" no-resolv... [17:56:38] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster [17:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:14] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10cmooney) Not sure if this has any bearing on the above, but trying to ping each CR from install1003 I note that cr2-eqiad is blocking the ping: ` cmooney@install1003:~$... [18:08:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10Jclark-ctr) [18:09:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10Jclark-ctr) Servers added to netbox [18:14:18] (03CR) 10Dzahn: [C: 03+1] "yep, with the Hieradata in "common" it should just work" [puppet] - 10https://gerrit.wikimedia.org/r/743459 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [18:17:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10Jclark-ctr) [18:17:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10Jclark-ctr) added server to netbox [18:21:04] (03CR) 10Kipod: [C: 03+1] hewiki: add "templateeditor" permission group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742833 (https://phabricator.wikimedia.org/T296769) (owner: 104nn1l2) [18:23:05] (03CR) 10Kipod: [C: 03+1] hewiki: add "templateeditor" permission group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742833 (https://phabricator.wikimedia.org/T296769) (owner: 104nn1l2) [18:25:07] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10Jclark-ctr) [18:28:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10Jclark-ctr) Servers added to netbox [18:29:23] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:31:45] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:32:01] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Andrew) @Dzahn points out that it could be that dhcp is working but preseed is failing [18:36:05] (03CR) 10Andrew Bogott: [C: 03+2] wikireplicas: remove dependency on meta_p for user_properties_anon view [puppet] - 10https://gerrit.wikimedia.org/r/739680 (https://phabricator.wikimedia.org/T294652) (owner: 10BryanDavis) [18:36:11] (03PS2) 10Andrew Bogott: wikireplicas: remove dependency on meta_p for user_properties_anon view [puppet] - 10https://gerrit.wikimedia.org/r/739680 (https://phabricator.wikimedia.org/T294652) (owner: 10BryanDavis) [18:38:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) icinga::nsca::client is used in fundraising. so there are special cases that can be in use but this audit script doesn't see [18:40:02] (03PS1) 10Andrew Bogott: wiki replicas: depool clouddb1013-clouddb1016 [puppet] - 10https://gerrit.wikimedia.org/r/743472 (https://phabricator.wikimedia.org/T291404) [18:40:07] (03PS1) 10Andrew Bogott: wiki replicas: repool clouddb1013-clouddb1016, depool clouddb1017-1020 [puppet] - 10https://gerrit.wikimedia.org/r/743473 (https://phabricator.wikimedia.org/T291404) [18:40:12] (03PS1) 10Andrew Bogott: wiki replicas: repool clouddb1017-1020 [puppet] - 10https://gerrit.wikimedia.org/r/743474 (https://phabricator.wikimedia.org/T291404) [18:40:15] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) something like role::gerrit::migration is not in use but still has value, it is there so you can apply it during the next migration when it's time to... [18:41:09] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:42:24] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install drmrs non-cp-hosts - https://phabricator.wikimedia.org/T286507 (10RobH) 05Open→03In progress a:05MMandere→03RobH stealing to update firmware on these hosts, with the following notes: * dns600[12] are the only hosts in service. ensure updates ha... [18:43:47] (03PS1) 10Dzahn: requesttracker: delete plugins class [puppet] - 10https://gerrit.wikimedia.org/r/743482 (https://phabricator.wikimedia.org/T272559) [18:44:18] (03PS2) 10Dzahn: requesttracker: delete plugins class [puppet] - 10https://gerrit.wikimedia.org/r/743482 (https://phabricator.wikimedia.org/T272559) [18:45:09] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install drmrs non-cp-hosts - https://phabricator.wikimedia.org/T286507 (10RobH) [18:45:45] (03CR) 10Andrew Bogott: [C: 03+2] wiki replicas: depool clouddb1013-clouddb1016 [puppet] - 10https://gerrit.wikimedia.org/r/743472 (https://phabricator.wikimedia.org/T291404) (owner: 10Andrew Bogott) [18:46:04] 10SRE, 10Security-Team: Establish secteam production norms - https://phabricator.wikimedia.org/T224886 (10sbassett) [18:46:27] 10SRE, 10Security-Team, 10Patch-For-Review: (2019-09) Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10sbassett) 05Resolved→03Invalid Huh, given that both relevant patches were never merged, I'm going to set this to invalid for now, since it defini... [18:47:33] 10SRE, 10Security-Team: Establish secteam production norms - https://phabricator.wikimedia.org/T224886 (10sbassett) 05Open→03Invalid Closing as invalid for now, as this is quite dated and much of this task would need to be re-reviewed at this point. [18:47:45] (03PS1) 10WQuarshie: example-node-api chart creating chart for the exampl-node-api Bug:T288134 [deployment-charts] - 10https://gerrit.wikimedia.org/r/743483 [18:49:19] 10SRE, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Jclark-ctr) [18:49:30] 10SRE, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Jclark-ctr) Servers added to netbox [18:50:18] (03CR) 10Dzahn: [C: 03+2] requesttracker: delete plugins class [puppet] - 10https://gerrit.wikimedia.org/r/743482 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [18:51:30] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) [18:51:45] ACKNOWLEDGEMENT - configured eth on ganeti6001 is CRITICAL: public reporting no carrier. rhalsell T286507 firmware updates https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:51:45] ACKNOWLEDGEMENT - configured eth on ganeti6002 is CRITICAL: public reporting no carrier. rhalsell T286507 firmware updates https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:51:45] ACKNOWLEDGEMENT - configured eth on ganeti6003 is CRITICAL: public reporting no carrier. rhalsell T286507 firmware updates https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:51:45] ACKNOWLEDGEMENT - configured eth on ganeti6004 is CRITICAL: public reporting no carrier. rhalsell T286507 firmware updates https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:54:04] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install drmrs non-cp-hosts - https://phabricator.wikimedia.org/T286507 (10RobH) [18:54:08] 10SRE, 10LDAP-Access-Requests: Add Ollie Shotton to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T296715 (10KFrancis) @herron I am confirming the NDA is complete. Please proceed with the access request. Thanks! [18:54:56] 10SRE, 10observability: Remove Diamond from production - https://phabricator.wikimedia.org/T212231 (10Dzahn) The unused classes diamond::collector::servicestats diamond::collector::servicestats_lib still exist and pop up in T272559 [18:55:29] PROBLEM - Host ganeti6003 is DOWN: PING CRITICAL - Packet loss = 100% [18:55:29] PROBLEM - Host ganeti6002 is DOWN: PING CRITICAL - Packet loss = 100% [18:55:31] PROBLEM - Host ganeti6001 is DOWN: PING CRITICAL - Packet loss = 100% [18:55:43] PROBLEM - Host ganeti6004 is DOWN: PING CRITICAL - Packet loss = 100% [18:56:14] ^ expected? [18:56:19] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) re: diamond classes T210993 T212231 [18:57:13] those are all drmrs [18:57:17] yes [18:59:03] (03CR) 10Andrew Bogott: [C: 03+2] wiki replicas: repool clouddb1013-clouddb1016, depool clouddb1017-1020 [puppet] - 10https://gerrit.wikimedia.org/r/743473 (https://phabricator.wikimedia.org/T291404) (owner: 10Andrew Bogott) [18:59:23] RECOVERY - Host ganeti6001 is UP: PING OK - Packet loss = 0%, RTA = 85.27 ms [18:59:29] RECOVERY - Host ganeti6002 is UP: PING OK - Packet loss = 0%, RTA = 85.17 ms [18:59:37] RECOVERY - Host ganeti6004 is UP: PING OK - Packet loss = 0%, RTA = 85.20 ms [18:59:37] RECOVERY - Host ganeti6003 is UP: PING OK - Packet loss = 0%, RTA = 85.24 ms [18:59:51] hmm [18:59:57] guessing network? [19:00:47] PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:00:50] ah, there was an ack a few lines above [19:01:09] (03PS1) 10Herron: admin: add ollieshotton to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/743484 (https://phabricator.wikimedia.org/T296715) [19:01:37] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add Ollie Shotton to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T296715 (10herron) [19:01:59] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_drmrs01_sync.service,netbox_ganeti_drmrs02_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:10] (03PS1) 10Majavah: httpbb: fix doc tests [puppet] - 10https://gerrit.wikimedia.org/r/743485 (https://phabricator.wikimedia.org/T247653) [19:05:53] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:07:13] 10SRE, 10Wikimedia-Mailing-lists, 10observability: Improve mailman3 queue alerting - https://phabricator.wikimedia.org/T295805 (10herron) 05Open→03Resolved a:03herron I'll give this a soft close since the check interval has been extended, and that satisfies the ask in the description . If there's any... [19:08:15] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add Ollie Shotton to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T296715 (10herron) p:05Triage→03Medium [19:08:32] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10herron) p:05Triage→03Medium [19:09:03] 10SRE, 10Analytics, 10Event-Platform, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10herron) p:05Triage→03Medium [19:10:02] (03CR) 10RLazarus: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/743485 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [19:11:59] RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:12:36] (03CR) 10Andrew Bogott: [C: 03+2] wiki replicas: repool clouddb1017-1020 [puppet] - 10https://gerrit.wikimedia.org/r/743474 (https://phabricator.wikimedia.org/T291404) (owner: 10Andrew Bogott) [19:12:48] (03PS2) 10Andrew Bogott: wiki replicas: repool clouddb1017-1020 [puppet] - 10https://gerrit.wikimedia.org/r/743474 (https://phabricator.wikimedia.org/T291404) [19:12:59] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:57] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install drmrs non-cp-hosts - https://phabricator.wikimedia.org/T286507 (10RobH) [19:17:07] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:36:16] (03PS1) 10Ssingh: durum: update landing page text [puppet] - 10https://gerrit.wikimedia.org/r/743490 [19:39:31] (03PS2) 10Ssingh: durum: update landing page text [puppet] - 10https://gerrit.wikimedia.org/r/743490 [19:41:17] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32817/console" [puppet] - 10https://gerrit.wikimedia.org/r/743490 (owner: 10Ssingh) [19:41:42] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:42:03] 10SRE, 10Fundraising-Backlog, 10Thank-You-Page, 10Wikimedia-Apache-configuration, and 3 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10DStrine) Hey all, Bumping this @Ppena just saw a TY page in app and an in-app banner appear over it. This w... [19:44:12] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: update landing page text [puppet] - 10https://gerrit.wikimedia.org/r/743490 (owner: 10Ssingh) [20:43:34] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx1001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10Peachey88) [20:51:04] (03CR) 10Eigyan: "I just wanted to point out that though this patch overwrites the performance survey in beta - it will not override production performance " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [20:52:01] (03PS9) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [20:52:10] (03PS10) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [20:52:14] (03PS1) 10Ebernhardson: rdf-query-service: Allow logback config to load outside the blazegraph war [puppet] - 10https://gerrit.wikimedia.org/r/743499 [20:55:18] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx1001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10Majavah) [20:58:23] (03PS3) 10Majavah: set up tls termination on cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742213 (https://phabricator.wikimedia.org/T263829) [21:11:16] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx1001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10herron) Hi @bcampbell do you have any examples of the bounce messages including the full raw headers? With that we could trace the messages through the mail logs. A private pa... [21:16:11] 10SRE, 10Traffic-Icebox: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350 (10odimitrijevic) [21:25:22] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx1001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10bcampbell) Hey @herron thanks. I think I uploaded the eml file privately and added you as a subscriber, but let me know if you don't see it. [21:51:39] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx1001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10herron) Thanks, looking into this I see in the private message: ` Subject: Warning: message 1msZEc-006308-GG delayed 24 hours This message was created automatically by mail de... [21:54:29] PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:55:51] PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:15:23] RECOVERY - Check systemd state on an-worker1128 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:03] RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:24:39] PROBLEM - Exim SMTP on mx2001 is CRITICAL: connect to address 208.80.153.45 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [22:29:33] 10SRE, 10Analytics-Clusters, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10odimitrijevic) @razzi is this still relevant? [22:29:46] 10SRE, 10Analytics-Clusters, 10Data-Engineering, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10odimitrijevic) [22:32:00] 10SRE, 10Analytics-Clusters, 10Analytics-Radar: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10odimitrijevic) [22:33:22] (03PS1) 10Herron: Revert "Revert "Revert "Revert "Revert "mx2001: disable ldap validation""""" [puppet] - 10https://gerrit.wikimedia.org/r/743423 [22:34:21] (03CR) 10Legoktm: [C: 03+1] Revert "Revert "Revert "Revert "Revert "mx2001: disable ldap validation""""" [puppet] - 10https://gerrit.wikimedia.org/r/743423 (owner: 10Herron) [22:35:04] (03CR) 10Herron: [C: 03+2] Revert "Revert "Revert "Revert "Revert "mx2001: disable ldap validation""""" [puppet] - 10https://gerrit.wikimedia.org/r/743423 (owner: 10Herron) [22:37:28] (03PS1) 10Dzahn: Revert "Revert "Revert "Revert "Revert "mx2001: disable ldap validation""""" [puppet] - 10https://gerrit.wikimedia.org/r/743424 [22:37:40] 10SRE, 10Analytics-Clusters, 10Data-Engineering, 10Infrastructure-Foundations, 10netops: Automate ingestion of netflow event stream - https://phabricator.wikimedia.org/T248865 (10odimitrijevic) [22:39:53] RECOVERY - Exim SMTP on mx2001 is OK: OK - Certificate mx1001.wikimedia.org will expire on Tue 04 Jan 2022 11:55:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [22:43:54] 10SRE, 10Analytics-Clusters, 10Data-Engineering, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10razzi) Yes, superset still uses firejail: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/pro... [23:50:22] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx1001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10eliza) Hi Herron What may be the next steps as ITS is receiving more tickets related to this issue. Thanks Eliza [23:57:50] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx1001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10Dzahn) 05Open→03In progress p:05Triage→03High [23:59:24] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx1001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10faidon) @eliza we're looking into this - next update in 15mins.