[00:00:05] twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210610T0000). [00:05:39] PROBLEM - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:05:42] ACKNOWLEDGEMENT - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T284710 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Ha [00:05:42] aid_Information_Gathering [00:05:46] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T284710 (10ops-monitoring-bot) [00:23:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:18] RECOVERY - HP RAID on ms-be2038 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:21:10] PROBLEM - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:21:13] ACKNOWLEDGEMENT - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T284713 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Ha [01:21:13] aid_Information_Gathering [01:21:17] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T284713 (10ops-monitoring-bot) [01:31:04] RECOVERY - Long running screen/tmux on an-launcher1002 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [01:42:50] RECOVERY - HP RAID on ms-be2038 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:17:20] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:36:44] PROBLEM - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:36:47] ACKNOWLEDGEMENT - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T284714 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Ha [02:36:47] aid_Information_Gathering [02:36:51] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T284714 (10ops-monitoring-bot) [02:51:30] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:58:18] RECOVERY - HP RAID on ms-be2038 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [03:52:10] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:57:15] (03PS1) 10Marostegui: Revert "db1130: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/699022 [05:03:22] (03CR) 10Marostegui: [C: 03+2] Revert "db1130: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/699022 (owner: 10Marostegui) [05:05:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Repool db1130 after upgrade', diff saved to https://phabricator.wikimedia.org/P16366 and previous config saved to /var/cache/conftool/dbconfig/20210610-050526-root.json [05:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:12] (03PS2) 10Marostegui: dbbackups: Switchover eqiad s5 backups from db1145 to db1150 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/698157 (https://phabricator.wikimedia.org/T283235) (owner: 10Jcrespo) [05:10:55] (03CR) 10Marostegui: [C: 03+2] dbbackups: Switchover eqiad s5 backups from db1145 to db1150 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/698157 (https://phabricator.wikimedia.org/T283235) (owner: 10Jcrespo) [05:20:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316', diff saved to https://phabricator.wikimedia.org/P16367 and previous config saved to /var/cache/conftool/dbconfig/20210610-052017-marostegui.json [05:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 50%: Repool db1130 after upgrade', diff saved to https://phabricator.wikimedia.org/P16368 and previous config saved to /var/cache/conftool/dbconfig/20210610-052030-root.json [05:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:56] (03PS1) 10KartikMistry: Add support for Elia MT to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/699089 (https://phabricator.wikimedia.org/T275803) [05:23:44] I need to deploy MT API key for above change. Whom should I ping for it? [05:23:55] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 469 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:23:57] (Puppet private repository) [05:24:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3315', diff saved to https://phabricator.wikimedia.org/P16369 and previous config saved to /var/cache/conftool/dbconfig/20210610-052421-marostegui.json [05:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:32] Those fatals are due to db1096 going crazy, I have depooled it [05:25:59] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 6 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:32:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: Repool db1096:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P16370 and previous config saved to /var/cache/conftool/dbconfig/20210610-053255-root.json [05:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16371 and previous config saved to /var/cache/conftool/dbconfig/20210610-053259-root.json [05:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Repool db1130 after upgrade', diff saved to https://phabricator.wikimedia.org/P16372 and previous config saved to /var/cache/conftool/dbconfig/20210610-053534-root.json [05:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:08] PROBLEM - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [05:39:10] ACKNOWLEDGEMENT - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T284718 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Ha [05:39:10] aid_Information_Gathering [05:39:15] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T284718 (10ops-monitoring-bot) [05:41:14] (03PS1) 10Marostegui: dbproxy1018: Depool clouddb1017 [puppet] - 10https://gerrit.wikimedia.org/r/699091 [05:41:59] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool clouddb1017 [puppet] - 10https://gerrit.wikimedia.org/r/699091 (owner: 10Marostegui) [05:44:18] (03PS1) 10Marostegui: Revert "dbproxy1018: Depool clouddb1017" [puppet] - 10https://gerrit.wikimedia.org/r/699024 [05:44:54] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1018: Depool clouddb1017" [puppet] - 10https://gerrit.wikimedia.org/r/699024 (owner: 10Marostegui) [05:47:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: Repool db1096:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P16373 and previous config saved to /var/cache/conftool/dbconfig/20210610-054759-root.json [05:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16374 and previous config saved to /var/cache/conftool/dbconfig/20210610-054802-root.json [05:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:52] RECOVERY - HP RAID on ms-be2038 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [05:50:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Repool db1130 after upgrade', diff saved to https://phabricator.wikimedia.org/P16375 and previous config saved to /var/cache/conftool/dbconfig/20210610-055037-root.json [05:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316', diff saved to https://phabricator.wikimedia.org/P16376 and previous config saved to /var/cache/conftool/dbconfig/20210610-055327-marostegui.json [05:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:42] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:57:32] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:03:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: Repool db1096:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P16377 and previous config saved to /var/cache/conftool/dbconfig/20210610-060302-root.json [06:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16378 and previous config saved to /var/cache/conftool/dbconfig/20210610-060405-root.json [06:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: Repool db1096:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P16379 and previous config saved to /var/cache/conftool/dbconfig/20210610-061806-root.json [06:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16380 and previous config saved to /var/cache/conftool/dbconfig/20210610-061909-root.json [06:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:16] PROBLEM - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:22:20] ACKNOWLEDGEMENT - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T284719 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Ha [06:22:20] aid_Information_Gathering [06:22:24] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T284719 (10ops-monitoring-bot) [06:27:05] 10SRE, 10ops-codfw, 10User-fgiunchedi: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T283401 (10RhinosF1) [06:27:09] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T284719 (10RhinosF1) [06:27:12] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T284709 (10RhinosF1) [06:27:15] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T284713 (10RhinosF1) [06:27:17] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T284710 (10RhinosF1) [06:27:25] 10SRE, 10ops-codfw, 10User-fgiunchedi: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T283401 (10RhinosF1) Merged all open tasks of same title [06:34:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16381 and previous config saved to /var/cache/conftool/dbconfig/20210610-063412-root.json [06:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3316', diff saved to https://phabricator.wikimedia.org/P16382 and previous config saved to /var/cache/conftool/dbconfig/20210610-063745-marostegui.json [06:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:26] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:36] 10SRE, 10ops-codfw, 10User-fgiunchedi: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T283401 (10fgiunchedi) Thanks folks for merging the duplicate tasks. I've temporarily disabled the event handler on icinga so no further tasks will be open. @papaul this BBU is flipping between recharging a... [06:43:54] RECOVERY - HP RAID on ms-be2038 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:49:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16383 and previous config saved to /var/cache/conftool/dbconfig/20210610-064916-root.json [06:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/699057 (https://phabricator.wikimedia.org/T284647) (owner: 10Volans) [06:52:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 25%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16384 and previous config saved to /var/cache/conftool/dbconfig/20210610-065217-root.json [06:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:56] (03PS26) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [06:54:58] (03PS11) 10Elukey: Add knative serving and net-istio images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692899 (https://phabricator.wikimedia.org/T278194) [06:55:00] (03PS9) 10Elukey: Add base kubeflow kfserving images and kube-rbac-proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) [06:55:02] (03PS8) 10Elukey: Add Jetstack's cert-manager base go images. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T280661) [06:55:11] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) >>! In T265435#7146748, @wiki_willy wrote: > Thanks @Papaul. So in terms of feedback for Raritan, so far it's: > > - convert PDU to one row of plug... [06:56:36] (03CR) 10Elukey: "Fixed the nobody comments, I totally forgot to follow up about what kube-rbac-proxy does, going to dig a bit into kfserving's internals an" (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [07:07:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 50%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16385 and previous config saved to /var/cache/conftool/dbconfig/20210610-070720-root.json [07:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:39] (03CR) 10JMeybohm: [C: 03+1] Add base kubeflow kfserving images and kube-rbac-proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [07:13:48] (03PS4) 10Muehlenhoff: Enable apt* hosts for unprivileged Cumin [puppet] - 10https://gerrit.wikimedia.org/r/698960 [07:22:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 75%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16386 and previous config saved to /var/cache/conftool/dbconfig/20210610-072224-root.json [07:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:05] (03PS1) 10Marostegui: wmnet: Promote db1130 to s5 master [dns] - 10https://gerrit.wikimedia.org/r/699136 (https://phabricator.wikimedia.org/T284529) [07:23:24] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/699136 (https://phabricator.wikimedia.org/T284529) (owner: 10Marostegui) [07:30:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698960 (owner: 10Muehlenhoff) [07:31:59] (03CR) 10Jgiannelos: [C: 04-1] osm: create missing imposm directories, add mirror support to import (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699044 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [07:37:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 100%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16387 and previous config saved to /var/cache/conftool/dbconfig/20210610-073727-root.json [07:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:28] (03CR) 10Muehlenhoff: [C: 03+2] Enable apt* hosts for unprivileged Cumin [puppet] - 10https://gerrit.wikimedia.org/r/698960 (owner: 10Muehlenhoff) [07:43:58] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.53 [software/spicerack] - 10https://gerrit.wikimedia.org/r/699138 [07:44:27] !log retrying s6 snapshots on eqiad, acking demon failure [07:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:57] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.53 [software/spicerack] - 10https://gerrit.wikimedia.org/r/699138 (owner: 10Volans) [07:49:16] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 455 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:49:54] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 578 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:51:45] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.1974 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [07:51:52] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.9365 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:52:01] * volans here [07:52:07] uh? [07:52:12] <_joe_> uhm [07:52:19] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.53 [software/spicerack] - 10https://gerrit.wikimedia.org/r/699138 (owner: 10Volans) [07:52:20] around [07:52:23] here [07:52:34] seems to be one host in s6 [07:52:35] depooling [07:52:47] <_joe_> marostegui: ah ok I was about to ask you to check the dbs [07:52:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3316', diff saved to https://phabricator.wikimedia.org/P16388 and previous config saved to /var/cache/conftool/dbconfig/20210610-075247-marostegui.json [07:52:48] same pattern that happened around 5:22 [07:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:01] <_joe_> yeah it was just barely less severe [07:53:03] root@db1098:~# w [07:53:03] 07:52:59 up 43 days, 22 min, 1 user, load average: 173.61, 138.39, 86.69 [07:53:07] but larger this time [07:53:12] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.7385 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:53:32] <_joe_> so things should recover rather quickly if you kill all connections on that server marostegui [07:54:03] _joe_: I can't, i can barely type on the host :) [07:54:11] <_joe_> we're not in an outage anyways, we have a slight increase in the 5xx we serve to the edge caches [07:54:33] <_joe_> marostegui: sudo kill -9 1 would do the trick though [07:54:38] <_joe_> :P [07:54:43] the host is now healthy [07:54:48] it only has 70 connections [07:54:52] the graph show signs of recovery [07:54:56] <_joe_> we're back to normal with most metrics too [07:55:16] <_joe_> we should get a recovery pretty soon [07:55:25] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5775 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [07:55:27] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=8&orgId=1&from=1623290119647&to=1623311719647&var-site=eqiad&var-group=core&var-shard=All&var-role=All [07:55:35] ^ :-O [07:56:01] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=43&from=now-15m&orgId=1&to=now&var-cluster=api_appserver&var-datasource=eqiad%20prometheus%2Fops this is missing s6 it seems? [07:56:08] jynus: that's probably not real, there was a schema change done [07:56:12] ah, ok [07:56:16] jynus: go back to your holidays!!!!!!!!!! [07:56:35] "My troops were merely passing by" :-) [07:56:50] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:57:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3317', diff saved to https://phabricator.wikimedia.org/P16389 and previous config saved to /var/cache/conftool/dbconfig/20210610-075702-marostegui.json [07:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:14] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:57:20] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:57:31] !log reset-failed on cumin1001 after backup rerun [07:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:10] 10SRE, 10User-MoritzMuehlenhoff: Improve keytab management with CI - https://phabricator.wikimedia.org/T284720 (10MoritzMuehlenhoff) [07:58:18] 10SRE, 10User-MoritzMuehlenhoff: Improve keytab management with CI - https://phabricator.wikimedia.org/T284720 (10MoritzMuehlenhoff) p:05Triage→03Low [07:58:24] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:58:36] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:59:01] (03PS1) 10Marostegui: db1098: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/699139 [07:59:11] (03PS1) 10Volans: Upstream release v0.0.53 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/699140 [07:59:23] marostegui: need a hand for db1098? [07:59:36] volans: no, I am going to check HW errors and give it a reboot :) [07:59:37] thanks! [07:59:45] this was a thing thing, however: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=10&orgId=1&from=1623290361470&to=1623311961471&var-site=eqiad&var-group=core&var-shard=All&var-role=All [07:59:46] ack [08:00:14] jynus: yes, the first schema change caused a spike [08:00:16] different from the schema change, and not only on s6 [08:00:42] side note: stacked metrics are hard to read IMHO [08:00:58] (03CR) 10Marostegui: [C: 03+2] db1098: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/699139 (owner: 10Marostegui) [08:01:01] volans, yeah, but all at the same time is harder! [08:01:12] I just click the title of the section for individual examination [08:01:36] that is more of "something's happening than the details" kinda dashboard [08:01:58] :) [08:05:57] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10ayounsi) {F34489185} See the two cloudsw2 on the right of the diagram. If you have any spare 40G DACs feel free to use them, length at DCops discretion. Otherwise let me know... [08:06:21] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.53 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/699140 (owner: 10Volans) [08:12:35] (03Merged) 10jenkins-bot: Upstream release v0.0.53 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/699140 (owner: 10Volans) [08:15:29] (03PS1) 10Marostegui: Revert "db1098: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/699025 [08:16:14] (03CR) 10Marostegui: [C: 03+2] Revert "db1098: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/699025 (owner: 10Marostegui) [08:16:56] (03PS27) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [08:16:58] (03PS12) 10Elukey: Add knative serving and net-istio images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692899 (https://phabricator.wikimedia.org/T278194) [08:17:01] (03PS10) 10Elukey: Add base kubeflow kfserving images and kube-rbac-proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) [08:17:03] (03PS9) 10Elukey: Add Jetstack's cert-manager base go images. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T280661) [08:17:45] !log Drop several grants from labswiki (wikitech) T282074 [08:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:56] T282074: Audit labswiki grants - https://phabricator.wikimedia.org/T282074 [08:18:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 5%: Repool db1098:3317 after schema change', diff saved to https://phabricator.wikimedia.org/P16391 and previous config saved to /var/cache/conftool/dbconfig/20210610-081828-root.json [08:18:30] (03CR) 10Elukey: "After a chat with Joe I have refactored a bit the build Dockerfile to avoid ENV as much as possible (that in theory can be overridden at r" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [08:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 5%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16392 and previous config saved to /var/cache/conftool/dbconfig/20210610-081834-root.json [08:18:35] (03CR) 10Elukey: "After a chat with Joe I have refactored a bit the build Dockerfile to avoid ENV as much as possible (that in theory can be overridden at r" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692899 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [08:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:40] (03CR) 10Elukey: "After a chat with Joe I have refactored a bit the build Dockerfile to avoid ENV as much as possible (that in theory can be overridden at r" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [08:25:05] !log uploaded spicerack_0.0.53 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [08:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:50] 10SRE, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10A189605) Can you possibly explain why our ISP's interface connected to our network (using IP 84.19.61.194) can successfully ping 91.198.174.192 and 91.198.174.208 (used earlier in t... [08:33:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 10%: Repool db1098:3317 after schema change', diff saved to https://phabricator.wikimedia.org/P16393 and previous config saved to /var/cache/conftool/dbconfig/20210610-083332-root.json [08:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 10%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16394 and previous config saved to /var/cache/conftool/dbconfig/20210610-083338-root.json [08:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:57] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10ayounsi) @Cmjohnson @wiki_willy would it be possible to prioritize this (or at least 1 of the 2) for before the next 2 weeks? We would like to test a fix for T284592 before roll... [08:46:48] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/699056 (owner: 10Volans) [08:48:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 20%: Repool db1098:3317 after schema change', diff saved to https://phabricator.wikimedia.org/P16395 and previous config saved to /var/cache/conftool/dbconfig/20210610-084835-root.json [08:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 20%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16396 and previous config saved to /var/cache/conftool/dbconfig/20210610-084841-root.json [08:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/699057 (https://phabricator.wikimedia.org/T284647) (owner: 10Volans) [08:52:42] (03CR) 10Volans: [C: 03+2] setup.py: fix Django classifier [software/debmonitor] - 10https://gerrit.wikimedia.org/r/699056 (owner: 10Volans) [08:53:38] (03CR) 10JMeybohm: [C: 03+1] Add base kubeflow kfserving images and kube-rbac-proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [08:53:46] (03CR) 10Volans: [C: 03+2] cli: urllib3 backward/forward compatibility [software/debmonitor] - 10https://gerrit.wikimedia.org/r/699057 (https://phabricator.wikimedia.org/T284647) (owner: 10Volans) [08:54:58] (03Merged) 10jenkins-bot: setup.py: fix Django classifier [software/debmonitor] - 10https://gerrit.wikimedia.org/r/699056 (owner: 10Volans) [08:56:10] (03Merged) 10jenkins-bot: cli: urllib3 backward/forward compatibility [software/debmonitor] - 10https://gerrit.wikimedia.org/r/699057 (https://phabricator.wikimedia.org/T284647) (owner: 10Volans) [08:56:16] (03CR) 10JMeybohm: [C: 03+1] Add knative serving and net-istio images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692899 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [08:57:52] (03Abandoned) 10Jbond: P:cumin::master: change permission of config file [puppet] - 10https://gerrit.wikimedia.org/r/699027 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [09:03:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 30%: Repool db1098:3317 after schema change', diff saved to https://phabricator.wikimedia.org/P16397 and previous config saved to /var/cache/conftool/dbconfig/20210610-090339-root.json [09:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 30%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16398 and previous config saved to /var/cache/conftool/dbconfig/20210610-090345-root.json [09:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:48] (03PS1) 10Jbond: P:cumin::master: Use sudo to run the check command [puppet] - 10https://gerrit.wikimedia.org/r/699147 (https://phabricator.wikimedia.org/T268211) [09:09:22] (03PS1) 10Jbond: netbox-next: switch to ldap [puppet] - 10https://gerrit.wikimedia.org/r/699148 [09:12:09] (03CR) 10Jbond: [C: 03+2] P:cumin::master: Use sudo to run the check command [puppet] - 10https://gerrit.wikimedia.org/r/699147 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [09:13:08] (03CR) 10Volans: [C: 03+1] "LGTM, potential simplication inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699147 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [09:16:09] (03CR) 10Volans: "reply to comment" (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [09:16:27] (03PS1) 10Jbond: Revert "P:cumin::master: Use sudo to run the check command" [puppet] - 10https://gerrit.wikimedia.org/r/699166 [09:16:36] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:cumin::master: Use sudo to run the check command" [puppet] - 10https://gerrit.wikimedia.org/r/699166 (owner: 10Jbond) [09:17:46] (03PS1) 10Jbond: P:cumin::master: Use sudo to run the check command [puppet] - 10https://gerrit.wikimedia.org/r/699167 (https://phabricator.wikimedia.org/T268211) [09:18:31] ACKNOWLEDGEMENT - Maps - OSM synchronization lag - codfw on alert1001 is CRITICAL: 2.712e+06 ge 2.592e+05 Hnowlan Side effect of introduction of maps2009 as second master. https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [09:18:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 40%: Repool db1098:3317 after schema change', diff saved to https://phabricator.wikimedia.org/P16399 and previous config saved to /var/cache/conftool/dbconfig/20210610-091842-root.json [09:18:44] (03PS1) 10Effie Mouzeli: WIP add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 [09:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:16] (03CR) 10jerkins-bot: [V: 04-1] P:cumin::master: Use sudo to run the check command [puppet] - 10https://gerrit.wikimedia.org/r/699167 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [09:20:12] (03CR) 10jerkins-bot: [V: 04-1] WIP add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 (owner: 10Effie Mouzeli) [09:22:03] (03CR) 10David Caro: ceph: add cookbooks to reboot osds (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [09:22:11] (03PS2) 10Effie Mouzeli: WIP add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 [09:22:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3316', diff saved to https://phabricator.wikimedia.org/P16401 and previous config saved to /var/cache/conftool/dbconfig/20210610-092246-marostegui.json [09:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:47] (03CR) 10jerkins-bot: [V: 04-1] WIP add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 (owner: 10Effie Mouzeli) [09:24:44] (03PS3) 10Effie Mouzeli: WIP add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 [09:29:38] (03CR) 10Hnowlan: osm: create missing imposm directories, add mirror support to import (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699044 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [09:30:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113:3316', diff saved to https://phabricator.wikimedia.org/P16402 and previous config saved to /var/cache/conftool/dbconfig/20210610-093003-marostegui.json [09:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:29] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:57] 10SRE, 10CAS-SSO, 10User-jbond: Document IDP MFA policy and processes - https://phabricator.wikimedia.org/T284725 (10jbond) p:05Triage→03Medium [09:33:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 50%: Repool db1098:3317 after schema change', diff saved to https://phabricator.wikimedia.org/P16404 and previous config saved to /var/cache/conftool/dbconfig/20210610-093346-root.json [09:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:42] 10SRE, 10CAS-SSO, 10User-jbond: Document IDP MFA policy and processes - https://phabricator.wikimedia.org/T284725 (10Volans) If I may add to the wish list, support multiple tokens for those that have more than one for added redundancy. [09:43:37] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [09:44:49] PROBLEM - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:45:38] (03PS4) 10David Caro: ceph: add cookbooks to reboot osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) [09:45:40] (03PS2) 10David Caro: wmcs: Fixed docstring on CephController.get_nodes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698988 [09:45:42] (03PS2) 10David Caro: wmcs.ceph: Add cookbook to reboot mons [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698989 (https://phabricator.wikimedia.org/T281248) [09:45:44] (03PS1) 10David Caro: wmcs.ceph.reboot_node: use newer icinga_hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/699154 [09:46:30] (03CR) 10David Caro: [C: 03+2] ceph: add cookbooks to reboot osds (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [09:46:43] 10SRE, 10Wikimedia-Mailing-lists: Please close the wmfkids@ mailing list - https://phabricator.wikimedia.org/T284683 (10Volans) p:05Triage→03Medium [09:48:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 60%: Repool db1098:3317 after schema change', diff saved to https://phabricator.wikimedia.org/P16405 and previous config saved to /var/cache/conftool/dbconfig/20210610-094851-root.json [09:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:21] (03PS1) 10Giuseppe Lavagetto: Add base debian directory [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/699155 (https://phabricator.wikimedia.org/T284417) [09:50:39] (03Merged) 10jenkins-bot: ceph: add cookbooks to reboot osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698819 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [09:51:19] (03CR) 10MSantos: osm: create missing imposm directories, add mirror support to import (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699044 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [09:55:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698989 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [09:57:13] (03PS1) 10Elukey: profile::docker::builder: add the istio use case to image builders [puppet] - 10https://gerrit.wikimedia.org/r/699156 (https://phabricator.wikimedia.org/T278192) [09:58:15] (03PS2) 10Elukey: profile::docker::builder: add the istio use case to image builders [puppet] - 10https://gerrit.wikimedia.org/r/699156 (https://phabricator.wikimedia.org/T278192) [09:58:24] 10SRE, 10CAS-SSO, 10User-jbond: Document IDP MFA policy and processes - https://phabricator.wikimedia.org/T284725 (10MoritzMuehlenhoff) >>! In T284725#7148119, @Volans wrote: > If I may add to the wish list, support multiple tokens for those that have more than one for added redundancy. For U2F that's curre... [09:59:43] (03CR) 10jerkins-bot: [V: 04-1] profile::docker::builder: add the istio use case to image builders [puppet] - 10https://gerrit.wikimedia.org/r/699156 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [10:00:05] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210610T1000). [10:00:14] (03PS2) 10Jbond: P:cumin::master: Use sudo to run the check command [puppet] - 10https://gerrit.wikimedia.org/r/699167 (https://phabricator.wikimedia.org/T268211) [10:00:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29854/console" [puppet] - 10https://gerrit.wikimedia.org/r/699156 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [10:02:17] (03PS1) 10Urbanecm: Fix call to renamed var [extensions/WikiEditor] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699168 (https://phabricator.wikimedia.org/T284716) [10:03:36] (03PS3) 10Elukey: profile::docker::builder: add the istio use case to image builders [puppet] - 10https://gerrit.wikimedia.org/r/699156 (https://phabricator.wikimedia.org/T278192) [10:03:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 75%: Repool db1098:3317 after schema change', diff saved to https://phabricator.wikimedia.org/P16406 and previous config saved to /var/cache/conftool/dbconfig/20210610-100355-root.json [10:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:24] (03CR) 10Jbond: [C: 03+2] P:cumin::master: Use sudo to run the check command [puppet] - 10https://gerrit.wikimedia.org/r/699167 (https://phabricator.wikimedia.org/T268211) (owner: 10Jbond) [10:15:05] 10Puppet, 10SRE, 10SRE-tools, 10User-jbond: Private puppet commit hook checks current state of folder, not what is staged - https://phabricator.wikimedia.org/T278187 (10JMeybohm) > I had a file that initially failed yamllint, but when I fixed it, I forgot to stage the change, so I didn't actually commit t... [10:17:04] (03CR) 10Urbanecm: [C: 03+2] "train blocker" [extensions/WikiEditor] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699168 (https://phabricator.wikimedia.org/T284716) (owner: 10Urbanecm) [10:18:12] (03PS2) 10Kormat: mariadb: Promote db1157 as s3 primary [puppet] - 10https://gerrit.wikimedia.org/r/698981 (https://phabricator.wikimedia.org/T284648) [10:18:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 100%: Repool db1098:3317 after schema change', diff saved to https://phabricator.wikimedia.org/P16407 and previous config saved to /var/cache/conftool/dbconfig/20210610-101858-root.json [10:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:05] (03CR) 10Kormat: [C: 03+1] wmnet: Promote db1130 to s5 master [dns] - 10https://gerrit.wikimedia.org/r/699136 (https://phabricator.wikimedia.org/T284529) (owner: 10Marostegui) [10:19:10] (03PS4) 10Effie Mouzeli: WIP add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 [10:19:20] (03PS1) 10Jbond: Revert "P:cumin::master: Use sudo to run the check command" [puppet] - 10https://gerrit.wikimedia.org/r/699169 [10:21:32] 10Puppet, 10SRE, 10SRE-tools, 10User-jbond: Private puppet commit hook checks current state of folder, not what is staged - https://phabricator.wikimedia.org/T278187 (10Volans) @JMeybohm Just to clarify the chain of events: - you did change the file locally with the typo - staged for commit - commit faile... [10:21:46] !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [10:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:01] (03PS1) 10Jbond: P:cumin:monitoring_agentrun: sudo rule cant have sudo in it [puppet] - 10https://gerrit.wikimedia.org/r/699162 [10:23:19] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:cumin:monitoring_agentrun: sudo rule cant have sudo in it [puppet] - 10https://gerrit.wikimedia.org/r/699162 (owner: 10Jbond) [10:24:29] 10Puppet, 10SRE, 10SRE-tools, 10User-jbond: Private puppet commit hook checks current state of folder, not what is staged - https://phabricator.wikimedia.org/T278187 (10JMeybohm) >>! In T278187#7148190, @Volans wrote: > Is that correct? Absolutely. [10:25:01] (03PS1) 10David Caro: wmcs.ceph: rename reboot->roll_reboot [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/699163 [10:25:33] !log mvolz@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:12] (03CR) 10Jbond: [C: 03+2] netbox-next: switch to ldap [puppet] - 10https://gerrit.wikimedia.org/r/699148 (owner: 10Jbond) [10:27:27] (03PS1) 10Jbond: Revert "netbox-next: switch to ldap" [puppet] - 10https://gerrit.wikimedia.org/r/699170 [10:27:35] (03PS1) 10Muehlenhoff: Grant datacenter-ops access to cuminunpriv [puppet] - 10https://gerrit.wikimedia.org/r/699164 [10:27:42] 10Puppet, 10SRE, 10SRE-tools, 10User-jbond: Private puppet commit hook checks current state of folder, not what is staged - https://phabricator.wikimedia.org/T278187 (10Volans) Ok, that explains what happened, thanks! As for the solution I think we should just pass the staged files to yamllint instead of t... [10:28:49] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add knative serving and net-istio images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692899 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [10:28:51] !log running optimize tables against pc1009 (pc3) T282761 [10:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:56] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [10:29:15] !log mvolz@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:14] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/699164 (owner: 10Muehlenhoff) [10:30:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113:3316', diff saved to https://phabricator.wikimedia.org/P16408 and previous config saved to /var/cache/conftool/dbconfig/20210610-103032-marostegui.json [10:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 5%: Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P16409 and previous config saved to /var/cache/conftool/dbconfig/20210610-103132-root.json [10:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:55] (03PS1) 10Jbond: netbox - cas: only import cas view if we have cas enabled [software/netbox] - 10https://gerrit.wikimedia.org/r/699165 [10:32:59] (03CR) 10Jbond: [C: 03+2] Revert "netbox-next: switch to ldap" [puppet] - 10https://gerrit.wikimedia.org/r/699170 (owner: 10Jbond) [10:37:52] (03CR) 10David Caro: [C: 03+2] wmcs.ceph: Add cookbook to reboot mons (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698989 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [10:38:08] (03CR) 10David Caro: [C: 03+2] wmcs.ceph.reboot_node: use newer icinga_hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/699154 (owner: 10David Caro) [10:38:12] (03PS5) 10Effie Mouzeli: WIP add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 [10:39:05] (03Merged) 10jenkins-bot: Fix call to renamed var [extensions/WikiEditor] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699168 (https://phabricator.wikimedia.org/T284716) (owner: 10Urbanecm) [10:39:40] (03CR) 10jerkins-bot: [V: 04-1] WIP add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 (owner: 10Effie Mouzeli) [10:40:16] (03PS6) 10Effie Mouzeli: WIP add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 [10:41:43] (03CR) 10jerkins-bot: [V: 04-1] WIP add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 (owner: 10Effie Mouzeli) [10:41:49] PROBLEM - MariaDB Replica Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 193468.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:41:55] (03CR) 10Muehlenhoff: [C: 03+2] Grant datacenter-ops access to cuminunpriv [puppet] - 10https://gerrit.wikimedia.org/r/699164 (owner: 10Muehlenhoff) [10:42:11] PROBLEM - MariaDB Replica Lag: s5 on db2111 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 193492.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:42:12] checking db2075 [10:42:15] uh? [10:42:25] PROBLEM - MariaDB Replica Lag: s5 on db2137 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 193504.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:42:29] Ah I think I know what it is [10:43:01] PROBLEM - MariaDB Replica Lag: s5 on db2113 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 193542.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:43:07] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.9/extensions/WikiEditor/modules/jquery.wikiEditor.js: 8a17c43c5470b84ba58239bb2cf947dbebf1979f: Fix call to renamed var (T284716) (duration: 01m 25s) [10:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:13] T284716: Code editor failing to load completely on English Wiktionary - https://phabricator.wikimedia.org/T284716 [10:43:13] PROBLEM - MariaDB Replica Lag: s5 on db2128 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 193553.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:43:17] fixed [10:43:26] 👀 [10:43:26] it was the heartbeat that I forgot to start after the reboot [10:43:31] RECOVERY - MariaDB Replica Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:43:34] ah :) [10:43:57] RECOVERY - MariaDB Replica Lag: s5 on db2111 is OK: OK slave_sql_lag Replication lag: 0.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:44:03] marostegui: i need to fix that. i don't like the current behaviour (even, or especially, because it's my fault) [10:44:09] RECOVERY - MariaDB Replica Lag: s5 on db2137 is OK: OK slave_sql_lag Replication lag: 0.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:44:14] hahaha [10:44:45] RECOVERY - MariaDB Replica Lag: s5 on db2113 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:44:46] yeah, maybe we need to find a way to get it up automatically or something after a reboot [10:44:57] RECOVERY - MariaDB Replica Lag: s5 on db2128 is OK: OK slave_sql_lag Replication lag: 0.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:45:00] let me fix that now, actually. [10:45:08] <3 [10:46:23] (03CR) 10Jbond: [V: 03+2 C: 03+2] netbox - cas: only import cas view if we have cas enabled [software/netbox] - 10https://gerrit.wikimedia.org/r/699165 (owner: 10Jbond) [10:46:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 10%: Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P16410 and previous config saved to /var/cache/conftool/dbconfig/20210610-104635-root.json [10:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:19] !log T283163: Adding "metric-out minimum-igp" to BGP group Confed_eqord on eqiad, codfw and eqdfw CRs. [10:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:24] T283163: BGP Policy on aggregate routes prevents them being created in some circumstances. - https://phabricator.wikimedia.org/T283163 [10:48:55] (03PS1) 10Jbond: Update to v2.10.4-wmf4 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/699188 (https://phabricator.wikimedia.org/T244849) [10:52:04] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/699188 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [10:53:40] (03CR) 10Jbond: [V: 03+2 C: 03+2] Update to v2.10.4-wmf4 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/699188 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [10:58:22] 10SRE, 10Wikimedia-Mailing-lists: Please close the wmfkids@ mailing list - https://phabricator.wikimedia.org/T284683 (10Volans) @greg @zeljkofilipin being a private ML, should we just disable the list or fully remove it? What about its archives? [10:58:59] (03CR) 10David Caro: [C: 03+2] wmcs.ceph: rename reboot->roll_reboot [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/699163 (owner: 10David Caro) [10:59:04] (03CR) 10David Caro: [C: 03+2] wmcs: Fixed docstring on CephController.get_nodes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698988 (owner: 10David Caro) [10:59:24] !log jbond@deploy1002 Started deploy [netbox/deploy@e9f2382]: deploy v2.10.4-wmf4 to netbox-next [10:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:55] RECOVERY - HP RAID on ms-be2038 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:00:04] Amir1, Lucas_WMDE, apergos, and duesen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for EU Backport and Config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210610T1100). [11:00:17] (03PS1) 10Jbond: netbox-next: switch to ldap [puppet] - 10https://gerrit.wikimedia.org/r/699172 [11:00:17] !log jbond@deploy1002 Finished deploy [netbox/deploy@e9f2382]: deploy v2.10.4-wmf4 to netbox-next (duration: 00m 53s) [11:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:00] there are no patches scheduled and no one has signed up for this training slot. as such, I will not be joining the google meet for the training. [11:01:01] (03CR) 10Jbond: [C: 03+2] netbox-next: switch to ldap [puppet] - 10https://gerrit.wikimedia.org/r/699172 (owner: 10Jbond) [11:01:22] (03PS1) 10Jbond: Revert "netbox-next: switch to ldap" [puppet] - 10https://gerrit.wikimedia.org/r/699173 [11:01:34] I will however stick around for 0 minutes or so in case a self-service deployer wants to sneak in at the last minute [11:01:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 20%: Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P16411 and previous config saved to /var/cache/conftool/dbconfig/20210610-110139-root.json [11:01:41] s/0/10/ ;-D [11:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:57] (03Merged) 10jenkins-bot: wmcs: Fixed docstring on CephController.get_nodes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698988 (owner: 10David Caro) [11:01:59] (03Merged) 10jenkins-bot: wmcs.ceph: Add cookbook to reboot mons [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/698989 (https://phabricator.wikimedia.org/T281248) (owner: 10David Caro) [11:02:01] (03Merged) 10jenkins-bot: wmcs.ceph.reboot_node: use newer icinga_hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/699154 (owner: 10David Caro) [11:02:46] (03Merged) 10jenkins-bot: wmcs.ceph: rename reboot->roll_reboot [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/699163 (owner: 10David Caro) [11:03:43] 10SRE, 10Traffic, 10netops: BGP Policy on aggregate routes prevents them being created in some circumstances. - https://phabricator.wikimedia.org/T283163 (10cmooney) Ok configuration has been added to cr1-eqiad, cr2-eqiad and cr2-codfw (routers with transport links to eqord). Looks to have been successful.... [11:05:51] (03CR) 10Jbond: [C: 03+2] Revert "netbox-next: switch to ldap" [puppet] - 10https://gerrit.wikimedia.org/r/699173 (owner: 10Jbond) [11:12:34] seeing no one, I'm back to my usual work, ping me if needed, etc. [11:16:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 30%: Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P16412 and previous config saved to /var/cache/conftool/dbconfig/20210610-111643-root.json [11:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:06] (03PS1) 10Ssingh: install_server: add DHCP entries for doh1001 and doh1002 [puppet] - 10https://gerrit.wikimedia.org/r/699191 (https://phabricator.wikimedia.org/T284348) [11:21:55] (03CR) 10Ssingh: "Reference: https://phabricator.wikimedia.org/T284348#7147472" [puppet] - 10https://gerrit.wikimedia.org/r/699191 (https://phabricator.wikimedia.org/T284348) (owner: 10Ssingh) [11:25:59] (03CR) 10Ssingh: [C: 03+2] install_server: add DHCP entries for doh1001 and doh1002 [puppet] - 10https://gerrit.wikimedia.org/r/699191 (https://phabricator.wikimedia.org/T284348) (owner: 10Ssingh) [11:27:45] (03CR) 10Ssingh: "(Partman configuration is not required as it is taken care by doh*.)" [puppet] - 10https://gerrit.wikimedia.org/r/699191 (https://phabricator.wikimedia.org/T284348) (owner: 10Ssingh) [11:31:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 40%: Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P16413 and previous config saved to /var/cache/conftool/dbconfig/20210610-113146-root.json [11:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:12] (03PS1) 10Muehlenhoff: Check active Kerberos ticket when running as non-root [software/cumin] - 10https://gerrit.wikimedia.org/r/699192 [11:42:56] (03PS7) 10Effie Mouzeli: WIP add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 [11:44:24] (03CR) 10jerkins-bot: [V: 04-1] WIP add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 (owner: 10Effie Mouzeli) [11:45:08] (03CR) 10Jgiannelos: [C: 04-1] osm: create missing imposm directories, add mirror support to import (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699044 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [11:45:48] (03CR) 10jerkins-bot: [V: 04-1] Check active Kerberos ticket when running as non-root [software/cumin] - 10https://gerrit.wikimedia.org/r/699192 (owner: 10Muehlenhoff) [11:46:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 50%: Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P16414 and previous config saved to /var/cache/conftool/dbconfig/20210610-114650-root.json [11:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:02] (03PS8) 10Effie Mouzeli: WIP add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 [11:53:18] (03PS2) 10Muehlenhoff: Check active Kerberos ticket when running as non-root [software/cumin] - 10https://gerrit.wikimedia.org/r/699192 [11:59:20] (03CR) 10jerkins-bot: [V: 04-1] Check active Kerberos ticket when running as non-root [software/cumin] - 10https://gerrit.wikimedia.org/r/699192 (owner: 10Muehlenhoff) [12:01:38] (03PS1) 10Jbond: puppetmaster: update the private repo pre-commit hook to check staged [puppet] - 10https://gerrit.wikimedia.org/r/699196 (https://phabricator.wikimedia.org/T278187) [12:01:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 60%: Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P16415 and previous config saved to /var/cache/conftool/dbconfig/20210610-120153-root.json [12:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:29] (03PS3) 10Muehlenhoff: Check active Kerberos ticket when running as non-root [software/cumin] - 10https://gerrit.wikimedia.org/r/699192 [12:06:00] (03PS2) 10Jbond: puppetmaster: update the private repo pre-commit hook to check staged [puppet] - 10https://gerrit.wikimedia.org/r/699196 (https://phabricator.wikimedia.org/T278187) [12:06:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29860/console" [puppet] - 10https://gerrit.wikimedia.org/r/699196 (https://phabricator.wikimedia.org/T278187) (owner: 10Jbond) [12:10:43] (03CR) 10jerkins-bot: [V: 04-1] Check active Kerberos ticket when running as non-root [software/cumin] - 10https://gerrit.wikimedia.org/r/699192 (owner: 10Muehlenhoff) [12:13:27] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, feel free to add via homer or let us know if you want us to take care of it." [homer/public] - 10https://gerrit.wikimedia.org/r/698971 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [12:13:37] (03CR) 10Jbond: [C: 03+2] concat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/696380 (owner: 10Jbond) [12:13:41] (03CR) 10Jbond: [C: 03+2] profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [12:13:50] (03CR) 10Jbond: [V: 03+1 C: 03+2] (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [12:14:03] (03PS23) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) [12:14:16] (03PS7) 10Jbond: concat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/696380 [12:14:28] (03PS24) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 (https://phabricator.wikimedia.org/T216088) [12:14:37] (03PS25) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (https://phabricator.wikimedia.org/T216088) [12:16:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 75%: Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P16416 and previous config saved to /var/cache/conftool/dbconfig/20210610-121657-root.json [12:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:05] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.02417 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:24:42] * jbond lookiing ^^ [12:25:45] 10SRE, 10Traffic, 10observability: Implement SLI measurement for Varnish Frontend - https://phabricator.wikimedia.org/T284576 (10ema) I wrote a varnishlog consumer to see the actual maximum values for each timestamp we're currently getting: ` #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ VarnishMaxTi... [12:26:14] (03PS1) 10Urbanecm: wgWelcomeSurveyExperimentalGroups: Use new syntax in CS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699198 (https://phabricator.wikimedia.org/T284597) [12:27:02] (03PS1) 10Jbond: Revert "(Test): Example PR demonstrating the contacts profile" [puppet] - 10https://gerrit.wikimedia.org/r/699174 [12:27:04] (03PS1) 10Jbond: Revert "profile::contacts: add a profile and define for adding c..." [puppet] - 10https://gerrit.wikimedia.org/r/699175 [12:27:06] (03PS1) 10Jbond: Revert "concat: Add puppetlabs-concat module" [puppet] - 10https://gerrit.wikimedia.org/r/699176 [12:27:58] (03CR) 10jerkins-bot: [V: 04-1] Revert "profile::contacts: add a profile and define for adding c..." [puppet] - 10https://gerrit.wikimedia.org/r/699175 (owner: 10Jbond) [12:28:05] (03CR) 10jerkins-bot: [V: 04-1] Revert "concat: Add puppetlabs-concat module" [puppet] - 10https://gerrit.wikimedia.org/r/699176 (owner: 10Jbond) [12:28:29] (03PS2) 10Jbond: Revert "concat: Add puppetlabs-concat module" [puppet] - 10https://gerrit.wikimedia.org/r/699176 [12:28:33] (03CR) 10jerkins-bot: [V: 04-1] Revert "(Test): Example PR demonstrating the contacts profile" [puppet] - 10https://gerrit.wikimedia.org/r/699174 (owner: 10Jbond) [12:29:21] (03CR) 10Jbond: [C: 03+2] Revert "concat: Add puppetlabs-concat module" [puppet] - 10https://gerrit.wikimedia.org/r/699176 (owner: 10Jbond) [12:29:37] (03PS2) 10Jbond: Revert "profile::contacts: add a profile and define for adding c..." [puppet] - 10https://gerrit.wikimedia.org/r/699175 [12:29:51] (03PS2) 10Jbond: Revert "(Test): Example PR demonstrating the contacts profile" [puppet] - 10https://gerrit.wikimedia.org/r/699174 [12:29:59] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "(Test): Example PR demonstrating the contacts profile" [puppet] - 10https://gerrit.wikimedia.org/r/699174 (owner: 10Jbond) [12:30:05] (03PS3) 10Jbond: Revert "profile::contacts: add a profile and define for adding c..." [puppet] - 10https://gerrit.wikimedia.org/r/699175 [12:30:10] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "profile::contacts: add a profile and define for adding c..." [puppet] - 10https://gerrit.wikimedia.org/r/699175 (owner: 10Jbond) [12:30:19] (03PS3) 10Jbond: Revert "concat: Add puppetlabs-concat module" [puppet] - 10https://gerrit.wikimedia.org/r/699176 [12:30:34] (03PS2) 10Urbanecm: wgWelcomeSurveyExperimentalGroups: Use new syntax in CS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699198 (https://phabricator.wikimedia.org/T284597) [12:32:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 100%: Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P16417 and previous config saved to /var/cache/conftool/dbconfig/20210610-123201-root.json [12:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:15] (03PS1) 10Jbond: Revert^2 "concat: Add puppetlabs-concat module" [puppet] - 10https://gerrit.wikimedia.org/r/699177 [12:32:21] (03PS1) 10Jbond: Revert "Revert "profile::contacts: add a profile and define for ..." [puppet] - 10https://gerrit.wikimedia.org/r/699178 [12:32:25] (03PS1) 10Jbond: Revert "Revert "(Test): Example PR demonstrating the contacts pr..." [puppet] - 10https://gerrit.wikimedia.org/r/699179 [12:32:54] (03PS2) 10Jbond: oncat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/699177 [12:33:05] (03PS3) 10Jbond: concat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/699177 [12:33:21] (03PS2) 10Jbond: oncat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/699178 [12:33:33] (03PS3) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/699178 (https://phabricator.wikimedia.org/T216088) [12:33:43] (03PS4) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/699178 (https://phabricator.wikimedia.org/T216088) [12:34:10] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "(Test): Example PR demonstrating the contacts pr..." [puppet] - 10https://gerrit.wikimedia.org/r/699179 (owner: 10Jbond) [12:34:22] (03PS2) 10Jbond: ssretest: Add contacts to sretest [puppet] - 10https://gerrit.wikimedia.org/r/699179 (https://phabricator.wikimedia.org/T216088) [12:34:42] (03PS3) 10Jbond: ssretest: Add contacts to sretest [puppet] - 10https://gerrit.wikimedia.org/r/699179 (https://phabricator.wikimedia.org/T216088) [12:39:03] (03CR) 10Volans: [C: 03+1] "I did't test it but looks sane." [puppet] - 10https://gerrit.wikimedia.org/r/699196 (https://phabricator.wikimedia.org/T278187) (owner: 10Jbond) [12:40:34] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10SRE-swift-storage: tegola-vector-tiles load testing and Swift throughput experiments - https://phabricator.wikimedia.org/T284440 (10LSobanski) [12:42:15] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:45:28] 10SRE, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10cmooney) Thus far we have: 1. Validated your IP range or any subset thereof is not on any ban or block lists. 2. Validated we can route from our front-end load-balancer IPs to your... [12:55:16] (03PS1) 10Marostegui: clouddb1015: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/699204 [12:55:34] (03PS4) 10Jbond: concat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/699177 [12:55:36] (03CR) 10Kosta Harlan: [C: 03+1] wgWelcomeSurveyExperimentalGroups: Use new syntax in CS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699198 (https://phabricator.wikimedia.org/T284597) (owner: 10Urbanecm) [12:56:12] (03CR) 10Marostegui: [C: 03+2] clouddb1015: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/699204 (owner: 10Marostegui) [12:56:30] (03PS5) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/699178 (https://phabricator.wikimedia.org/T216088) [12:56:40] (03PS4) 10Jbond: ssretest: Add contacts to sretest [puppet] - 10https://gerrit.wikimedia.org/r/699179 (https://phabricator.wikimedia.org/T216088) [12:57:27] (03CR) 10Jbond: [C: 03+2] concat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/699177 (owner: 10Jbond) [13:08:41] (03CR) 10Jbond: [C: 03+2] profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/699178 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [13:08:53] (03CR) 10Jbond: [C: 03+2] ssretest: Add contacts to sretest [puppet] - 10https://gerrit.wikimedia.org/r/699179 (https://phabricator.wikimedia.org/T216088) (owner: 10Jbond) [13:14:48] (03PS1) 10Hashar: beta: add warning motd and link to term of uses [puppet] - 10https://gerrit.wikimedia.org/r/699207 (https://phabricator.wikimedia.org/T100837) [13:16:14] 10SRE, 10Beta-Cluster-Infrastructure, 10Cloud-VPS, 10Patch-For-Review: On deployment-prep, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837 (10hashar) a:03hashar It is never too late to fix a half decade old task. So here is the patch: [[ https://ge... [13:23:49] (03CR) 10Giuseppe Lavagetto: [C: 04-1] profile::docker::builder: add the istio use case to image builders (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/699156 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [13:25:58] 10SRE, 10Patch-For-Review, 10User-jbond: Filter (if possible) downtimed hosts from check_puppet_run_changes.py's report - https://phabricator.wikimedia.org/T268211 (10jbond) 05Open→03Resolved a:03jbond I have now moved this check to cumin and we use spicerack to ensure we only include hosts which match... [13:26:10] (03PS4) 10Elukey: profile::docker::builder: add the istio use case to image builders [puppet] - 10https://gerrit.wikimedia.org/r/699156 (https://phabricator.wikimedia.org/T278192) [13:26:51] (03CR) 10Elukey: "duly noted, I miscopied values :(" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/699156 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [13:27:13] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29861/console" [puppet] - 10https://gerrit.wikimedia.org/r/699156 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [13:31:11] (03CR) 10Ssingh: "> Patch Set 2: Code-Review+1" [homer/public] - 10https://gerrit.wikimedia.org/r/698971 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [13:31:18] (03CR) 10Ssingh: [C: 03+2] Add doh4001 to BGP anycast in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/698971 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [13:32:01] (03Merged) 10jenkins-bot: Add doh4001 to BGP anycast in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/698971 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [13:32:13] (03PS1) 10Ayounsi: Add profile::contact to multiple roles/profiles [puppet] - 10https://gerrit.wikimedia.org/r/699209 [13:34:06] (03CR) 10Elukey: [V: 03+1 C: 03+2] "The diff shows Joe's point being addressed, going to merge :)" [puppet] - 10https://gerrit.wikimedia.org/r/699156 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [13:36:44] 10SRE, 10Traffic, 10netops, 10Patch-For-Review: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 (10ssingh) > INFO:homer.transports.junos:Committing the configuration on cr4-ulsfo.wikimedia.org > INFO:homer:Homer run completed successfully on 2 devices:... [13:36:50] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/699209 (owner: 10Ayounsi) [13:38:02] (03PS1) 10Elukey: profile::docker: add ca config to the build istio config [puppet] - 10https://gerrit.wikimedia.org/r/699212 (https://phabricator.wikimedia.org/T278192) [13:39:03] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29862/console" [puppet] - 10https://gerrit.wikimedia.org/r/699212 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [13:39:43] (03CR) 10Hnowlan: osm: create missing imposm directories, add mirror support to import (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699044 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [13:41:33] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::docker: add ca config to the build istio config [puppet] - 10https://gerrit.wikimedia.org/r/699212 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [13:43:30] 10SRE, 10Traffic, 10netops, 10Patch-For-Review: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 (10cmooney) I can confirm the 185.71.138.0/24 prefix is now being announced to peers from ulsfo, for example: ` cmooney@cr4-ulsfo> show route advertising-pro... [13:44:47] (03PS3) 10Hnowlan: osm: create missing imposm directories, add mirror support to import [puppet] - 10https://gerrit.wikimedia.org/r/699044 (https://phabricator.wikimedia.org/T269582) [13:48:46] (03PS4) 10Hnowlan: osm: create missing imposm directories, add mirror support to import [puppet] - 10https://gerrit.wikimedia.org/r/699044 (https://phabricator.wikimedia.org/T269582) [13:50:54] (03PS1) 10Kormat: mariadb: Automatically manage pt-heartbeat. [puppet] - 10https://gerrit.wikimedia.org/r/699213 [13:53:10] (03PS2) 10Kormat: mariadb: Automatically manage pt-heartbeat. [puppet] - 10https://gerrit.wikimedia.org/r/699213 [13:55:00] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29863/console" [puppet] - 10https://gerrit.wikimedia.org/r/699213 (owner: 10Kormat) [13:55:17] (03PS1) 10David Caro: tools: try to alleviate sudo crashing when triggering oom [puppet] - 10https://gerrit.wikimedia.org/r/699216 (https://phabricator.wikimedia.org/T284130) [13:56:31] (03PS1) 10Ssingh: Add doh1001 and doh1002 to BGP anycast in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/699217 (https://phabricator.wikimedia.org/T283503) [13:56:55] (03CR) 10Elukey: "Interesting: from https://github.com/kubeflow/kfserving/blob/42007c532286a1e43893ef2be03b15e104bfd7a4/config/rbac/kustomization.yaml#L6-8 " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [13:57:05] (03CR) 10jerkins-bot: [V: 04-1] tools: try to alleviate sudo crashing when triggering oom [puppet] - 10https://gerrit.wikimedia.org/r/699216 (https://phabricator.wikimedia.org/T284130) (owner: 10David Caro) [13:59:09] (03CR) 10Elukey: "I am going to just drop kube-rbac-proxy for the moment, will send a separate code review if needed." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [14:02:35] (03PS1) 10Wikitrent: Enable $wgSecurePollSingleTransferableVoteEnabled on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699218 (https://phabricator.wikimedia.org/T283711) [14:02:37] (03PS11) 10Elukey: Add base kubeflow kfserving images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) [14:02:39] (03PS10) 10Elukey: Add Jetstack's cert-manager base go images. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T280661) [14:03:16] (03CR) 10Giuseppe Lavagetto: WIP add nutcracker pools for kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699150 (owner: 10Effie Mouzeli) [14:03:53] (03CR) 10Elukey: [C: 03+2] "Going to merge the patch due to the +1s added before (simply dropped kube-rbac-proxy)." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [14:03:58] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add base kubeflow kfserving images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [14:08:41] (03CR) 10Andrew Bogott: "lgtm!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/697930 (owner: 10David Caro) [14:09:54] (03PS1) 10Ssingh: site: switch doh1001 and doh1002 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/699220 (https://phabricator.wikimedia.org/T284348) [14:12:16] (03PS2) 10Wikitrent: Enable $wgSecurePollSingleTransferableVoteEnabled on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699218 (https://phabricator.wikimedia.org/T283711) [14:14:59] (03PS9) 10Effie Mouzeli: nutcracker::yaml_defs: add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 (https://phabricator.wikimedia.org/T284420) [14:15:36] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Get the VMs running before configuring the routers if they aren't already. Thanks." [homer/public] - 10https://gerrit.wikimedia.org/r/699217 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [14:16:28] (03CR) 10jerkins-bot: [V: 04-1] nutcracker::yaml_defs: add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli) [14:17:16] (03CR) 10Ssingh: "> Patch Set 1: Code-Review+1" [homer/public] - 10https://gerrit.wikimedia.org/r/699217 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [14:23:42] (03CR) 10Ssingh: [C: 03+2] site: switch doh1001 and doh1002 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/699220 (https://phabricator.wikimedia.org/T284348) (owner: 10Ssingh) [14:23:47] (03PS2) 10David Caro: tools: try to alleviate sudo crashing when triggering oom [puppet] - 10https://gerrit.wikimedia.org/r/699216 (https://phabricator.wikimedia.org/T284130) [14:25:12] (03CR) 10jerkins-bot: [V: 04-1] tools: try to alleviate sudo crashing when triggering oom [puppet] - 10https://gerrit.wikimedia.org/r/699216 (https://phabricator.wikimedia.org/T284130) (owner: 10David Caro) [14:26:35] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqiad - https://phabricator.wikimedia.org/T284348 (10ssingh) 05Open→03Resolved a:03ssingh doh1001 and doh1002 have been created; closing this task. Thanks for the help! [14:27:13] (03CR) 10Ssingh: [C: 03+2] Add doh1001 and doh1002 to BGP anycast in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/699217 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [14:27:49] (03Merged) 10jenkins-bot: Add doh1001 and doh1002 to BGP anycast in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/699217 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [14:29:05] (03PS1) 10Cwhite: rsyslog: add log.level to ecs compatible templates [puppet] - 10https://gerrit.wikimedia.org/r/699222 [14:29:31] (03PS1) 10Ema: varnish: add timing data to varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/699223 (https://phabricator.wikimedia.org/T284576) [14:29:42] (03PS5) 10Elukey: Add the custom_deploy.d directory with basic Istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) [14:30:51] (03PS2) 10Ema: varnish: add timing data to varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/699223 (https://phabricator.wikimedia.org/T284576) [14:32:50] 10SRE, 10Traffic, 10netops, 10Patch-For-Review: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 (10ssingh) > INFO:homer.transports.junos:Committing the configuration on cr2-eqiad.wikimedia.org > INFO:homer:Homer run completed successfully on 2 devices:... [14:33:12] (03PS3) 10David Caro: tools: try to alleviate sudo crashing when triggering oom [puppet] - 10https://gerrit.wikimedia.org/r/699216 (https://phabricator.wikimedia.org/T284130) [14:34:42] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install fran2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T282056 (10Jgreen) [14:37:22] 10SRE, 10Traffic, 10netops, 10Patch-For-Review: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 (10ssingh) Additional confirmation, since I am enjoying the reduced latency of the new Toronto -> eqiad route instead of the old Toronto -> codfw :) ` kdig... [14:38:03] 10SRE, 10Traffic, 10netops, 10Patch-For-Review: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 (10cmooney) Yep! Seeing very nice latency from NY to wikidough now :) ` root@nyc2:~# mtr -b -w -z -c 5 185.71.138.138 Start: 2021-06-10T16:35:02+0200 HOST:... [14:39:41] (03PS10) 10Effie Mouzeli: nutcracker::yaml_defs: add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 (https://phabricator.wikimedia.org/T284420) [14:41:07] (03CR) 10jerkins-bot: [V: 04-1] nutcracker::yaml_defs: add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli) [14:41:38] (03CR) 10David Caro: [C: 03+2] "> Patch Set 3:" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/697930 (owner: 10David Caro) [14:41:54] (03PS11) 10Effie Mouzeli: nutcracker::yaml_defs: add nutcracker pools for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/699150 (https://phabricator.wikimedia.org/T284420) [14:51:45] PROBLEM - Check systemd state on doh1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:40] (03CR) 10Marostegui: "If for whatever reason we want to stop it manually, is there a way of doing so apart from chmod -x the binary? :)" [puppet] - 10https://gerrit.wikimedia.org/r/699213 (owner: 10Kormat) [14:54:11] (03CR) 10Kormat: [V: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/699213 (owner: 10Kormat) [14:54:47] (03CR) 10Marostegui: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/699213 (owner: 10Kormat) [15:03:29] PROBLEM - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:03:51] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [15:09:00] power down ms-be2038 for BBU replacement [15:09:11] !power down ms-be2038 for BBU replacement [15:09:24] !log power down ms-be2038 for BBU replacement [15:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10RobH) [15:10:05] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:11:57] PROBLEM - Host ms-be2038 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:47] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:14:17] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:16:00] jbond: heyo, this patch broke cloud puppet: https://gerrit.wikimedia.org/r/c/operations/puppet/+/699178, as it expects _role to exist, can we revert it? [15:16:08] (or fix it if you have a quick idea on how) [15:16:14] maybe put a defualt value? xd [15:16:37] that might actually be good :/ [15:18:20] jbond: I'll try to fix that, let me know when you are around to see if that's ok or should be done differently [15:18:33] RECOVERY - Host ms-be2038 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms [15:18:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10RobH) [15:20:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10RobH) a:05RobH→03Cmjohnson @cmjohnson, Please review and test the following servers, as their mgmt is offline. This can be caused by the cable not being prope... [15:21:26] 10SRE, 10DC-Ops, 10netops: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10RobH) [15:24:16] (03PS1) 10David Caro: contacts: don't fail if _role is not defined [puppet] - 10https://gerrit.wikimedia.org/r/699233 [15:24:25] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: add log.level to ecs compatible templates [puppet] - 10https://gerrit.wikimedia.org/r/699222 (owner: 10Cwhite) [15:24:36] jbond: ^ there you go https://gerrit.wikimedia.org/r/c/operations/puppet/+/699233 [15:24:47] RECOVERY - HP RAID on ms-be2038 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:25:06] 10SRE, 10DC-Ops, 10SRE-tools, 10netops: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10Volans) I'm not sure yet how the automation side of things will look like, but there is a good chance that it could use redfish. In... [15:25:30] papaul: thank you (re: bbu), I'll keep an eye on it and see if this one does better [15:25:45] (03CR) 10jerkins-bot: [V: 04-1] contacts: don't fail if _role is not defined [puppet] - 10https://gerrit.wikimedia.org/r/699233 (owner: 10David Caro) [15:26:54] godog: no problem [15:28:12] 10SRE, 10ops-codfw, 10User-fgiunchedi: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T283401 (10Papaul) a:05Papaul→03fgiunchedi BBU replaced . Please resolve task when all go. Thanks [15:28:29] (03PS2) 10David Caro: contacts: don't fail if _role is not defined [puppet] - 10https://gerrit.wikimedia.org/r/699233 [15:36:33] (03CR) 10Bstorm: contacts: don't fail if _role is not defined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699233 (owner: 10David Caro) [15:37:49] (03CR) 10Bstorm: contacts: don't fail if _role is not defined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699233 (owner: 10David Caro) [15:38:44] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:42:35] (03CR) 10David Caro: contacts: don't fail if _role is not defined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699233 (owner: 10David Caro) [15:44:23] (03PS3) 10David Caro: contacts: don't fail if _role is not defined on labs realm [puppet] - 10https://gerrit.wikimedia.org/r/699233 [15:44:44] (03CR) 10Bstorm: contacts: don't fail if _role is not defined on labs realm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699233 (owner: 10David Caro) [15:50:14] (03PS1) 10Jgiannelos: Bump mobileapps image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/699235 [15:56:27] 10SRE, 10Traffic: Offer Wikidough as an anycasted service - https://phabricator.wikimedia.org/T283027 (10ssingh) [15:56:35] 10SRE, 10Traffic, 10netops: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 (10ssingh) 05Open→03Resolved a:03ssingh Marking this as resolved as we have completed all the intended tasks for now and the routers have been configured. On our (Traffic's... [15:56:39] (03CR) 10Bstorm: [C: 03+1] "There's no way this will work in cloud, looking at the file, so I think this is good, personally. I don't think this will break anything a" [puppet] - 10https://gerrit.wikimedia.org/r/699233 (owner: 10David Caro) [15:58:06] (03CR) 10David Caro: [V: 03+1 C: 03+1] "Tested on toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/699233 (owner: 10David Caro) [16:00:02] 10SRE, 10DC-Ops, 10SRE-tools, 10netops: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10ayounsi) From IRC conversation: We're going to do a 1 off to ease DCops pain of upgrading a large amount of firmwares. Once those 40... [16:00:03] dcaro: sorry was in a meeting looking now [16:00:04] jbond42 and cdanis: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210610T1600). [16:05:11] (03PS1) 10Jbond: P:base: only use P:contacts on production [puppet] - 10https://gerrit.wikimedia.org/r/699237 [16:07:16] (03CR) 10Jbond: contacts: don't fail if _role is not defined on labs realm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699233 (owner: 10David Caro) [16:07:24] (03CR) 10Jbond: [C: 03+2] P:base: only use P:contacts on production [puppet] - 10https://gerrit.wikimedia.org/r/699237 (owner: 10Jbond) [16:08:07] (03PS2) 10Razzi: hadoop: increase the HDFS Namenode's service handler threads [puppet] - 10https://gerrit.wikimedia.org/r/698194 (https://phabricator.wikimedia.org/T283733) (owner: 10Elukey) [16:08:14] (03CR) 10David Caro: [C: 03+1] "Tested and working on toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/699237 (owner: 10Jbond) [16:09:14] (03Abandoned) 10David Caro: contacts: don't fail if _role is not defined on labs realm [puppet] - 10https://gerrit.wikimedia.org/r/699233 (owner: 10David Caro) [16:09:26] dcaro: ^^ mine is deployed now so shuold be fixed sorry about that [16:09:59] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [16:10:00] jbond: thanks, testedi it on toolsbeta [16:11:31] yes looks good to me thanks [16:11:47] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [16:12:01] (03Abandoned) 10Ahmon Dancy: Test commit. Disregard [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/698601 (owner: 10Ahmon Dancy) [16:12:08] (03PS2) 10Krinkle: Simplify mc.php (1/7): Fix load order in Beta to match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692722 [16:12:29] (03CR) 10Krinkle: [C: 03+2] Simplify mc.php (1/7): Fix load order in Beta to match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692722 (owner: 10Krinkle) [16:13:07] (03CR) 10Razzi: [C: 03+2] hadoop: increase the HDFS Namenode's service handler threads [puppet] - 10https://gerrit.wikimedia.org/r/698194 (https://phabricator.wikimedia.org/T283733) (owner: 10Elukey) [16:13:14] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install pc2011-pc2014 - https://phabricator.wikimedia.org/T282482 (10Papaul) [16:13:49] (03Merged) 10jenkins-bot: Simplify mc.php (1/7): Fix load order in Beta to match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692722 (owner: 10Krinkle) [16:17:09] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "My main doubt is about metrics gathering:" (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [16:24:35] !log razzi@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters [16:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:24] (03PS1) 10Giuseppe Lavagetto: mwdebug: add service proxy listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/699243 [16:29:19] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:16] (03PS1) 10Jbond: P:contacts: Add test to ensure role is defined and bail out early if not [puppet] - 10https://gerrit.wikimedia.org/r/699245 [16:30:28] dcaro: ^^ [16:31:18] (03PS5) 10Hnowlan: postgres: use remote script on replica to resync [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) [16:34:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:17] (03CR) 10David Caro: [C: 03+1] P:contacts: Add test to ensure role is defined and bail out early if not [puppet] - 10https://gerrit.wikimedia.org/r/699245 (owner: 10Jbond) [16:37:26] !log krinkle@deploy1002 Synchronized wmf-config/CommonSettings.php: no-op for Beta I2a42c222003 (duration: 01m 07s) [16:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:30] (03CR) 10Jbond: [C: 03+2] P:contacts: Add test to ensure role is defined and bail out early if not [puppet] - 10https://gerrit.wikimedia.org/r/699245 (owner: 10Jbond) [16:37:58] (03PS4) 10David Caro: openstack.cloudvirt.{un}set_maintenance: use current host aggregates [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/697930 [16:41:58] (03CR) 10Jgiannelos: [C: 03+2] Bump mobileapps image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/699235 (owner: 10Jgiannelos) [16:42:15] (03CR) 10David Caro: [C: 03+2] openstack.cloudvirt.{un}set_maintenance: use current host aggregates (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/697930 (owner: 10David Caro) [16:44:22] (03Merged) 10jenkins-bot: Bump mobileapps image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/699235 (owner: 10Jgiannelos) [16:45:42] (03PS1) 10Phuedx: Fire language change hook [extensions/UniversalLanguageSelector] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699183 (https://phabricator.wikimedia.org/T280770) [16:47:45] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10Cmjohnson) [16:47:50] 10SRE, 10ops-eqiad, 10DC-Ops: update hostname labels on logstash103[345] & db11[51-76] - https://phabricator.wikimedia.org/T273922 (10Cmjohnson) 05Open→03Resolved Fixed [16:48:17] (03PS1) 10Jgiannelos: Add blubber variant for tile pregeneration image [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/699251 [16:49:09] (03CR) 10Jgiannelos: [C: 04-1] "WIP" [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/699251 (owner: 10Jgiannelos) [16:50:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) Thanks @RobH I will take a look [16:50:21] (03CR) 10Hnowlan: postgres: use remote script on replica to resync (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [16:51:52] !log installing rails security updates [16:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:17] !log razzi@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) [16:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:54] 10SRE, 10Analytics, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10odimitrijevic) a:03odimitrijevic [16:59:52] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr I went to assign this to B4 port 23 but netbox has cloudcephosd1017 in that port. Could you please verify the correct port. Thanks [17:00:04] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210610T1700). [17:02:29] (03PS1) 10Cwhite: logstash: add ecs migration config for sampled webrequest logs [puppet] - 10https://gerrit.wikimedia.org/r/699254 (https://phabricator.wikimedia.org/T234565) [17:03:21] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [17:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:46] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:35] 10SRE, 10DC-Ops, 10SRE-tools, 10netops: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10RobH) a:03jbond @jbond & @MoritzMuehlenhoff: Would it be ok for me to temp push the Dell firmware files to our install server via... [17:09:05] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:19] !log updating bullseye installer image to latest daily image (kernel ABI changed again) T275873 [17:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:23] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [17:12:37] (03PS2) 10Cwhite: logstash: add ecs migration config for sampled webrequest logs [puppet] - 10https://gerrit.wikimedia.org/r/699254 (https://phabricator.wikimedia.org/T234565) [17:14:22] (03PS1) 10Muehlenhoff: Update sudo permission to use run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/699255 [17:20:59] 10SRE, 10DC-Ops, 10SRE-tools, 10netops: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10MoritzMuehlenhoff) >>! In T283771#7149531, @RobH wrote: > @jbond & @MoritzMuehlenhoff: > > Would it be ok for me to temp push the D... [17:21:04] (03CR) 10Cwhite: logstash: add ecs migration config for sampled webrequest logs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/699254 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:24:13] (03CR) 10Cwhite: [C: 03+2] rsyslog: add log.level to ecs compatible templates [puppet] - 10https://gerrit.wikimedia.org/r/699222 (owner: 10Cwhite) [17:34:22] (03PS1) 10Ebernhardson: Add pool counter for automated search requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699257 (https://phabricator.wikimedia.org/T284479) [17:53:49] 10SRE, 10DC-Ops, 10SRE-tools, 10netops: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10RobH) >>! In T283771#7149561, @MoritzMuehlenhoff wrote: >>>! In T283771#7149531, @RobH wrote: >> @jbond & @MoritzMuehlenhoff: >> >>... [17:59:06] 10SRE, 10DC-Ops, 10SRE-tools, 10netops: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10MoritzMuehlenhoff) >>! In T283771#7149663, @RobH wrote: >> Which size are these files? That's fine, if it's not more than say 5 G, t... [17:59:53] (03PS1) 10Herron: wip [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/699260 [18:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210610T1800). [18:00:04] phuedx and urbanecm: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:11] i can deploy tgoday [18:00:13] *today [18:00:21] phuedx: hello, around? [18:00:27] (03CR) 10Urbanecm: [C: 03+2] wgWelcomeSurveyExperimentalGroups: Use new syntax in CS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699198 (https://phabricator.wikimedia.org/T284597) (owner: 10Urbanecm) [18:00:29] (03PS3) 10Urbanecm: wgWelcomeSurveyExperimentalGroups: Use new syntax in CS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699198 (https://phabricator.wikimedia.org/T284597) [18:00:32] urbanecm: o/ [18:00:34] (03CR) 10Urbanecm: wgWelcomeSurveyExperimentalGroups: Use new syntax in CS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699198 (https://phabricator.wikimedia.org/T284597) (owner: 10Urbanecm) [18:00:40] (03CR) 10Urbanecm: [C: 03+2] wgWelcomeSurveyExperimentalGroups: Use new syntax in CS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699198 (https://phabricator.wikimedia.org/T284597) (owner: 10Urbanecm) [18:00:45] urbanecm: Thanks. I know what needs to be tested for the patch [18:00:52] great! [18:01:00] (03CR) 10Urbanecm: [C: 03+2] Fire language change hook [extensions/UniversalLanguageSelector] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699183 (https://phabricator.wikimedia.org/T280770) (owner: 10Phuedx) [18:01:24] (03Merged) 10jenkins-bot: wgWelcomeSurveyExperimentalGroups: Use new syntax in CS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699198 (https://phabricator.wikimedia.org/T284597) (owner: 10Urbanecm) [18:01:30] urbanecm: Also, I have a Beta-Cluster-only patch. I've forgotten whether that can just be merged and pulled on to the deployment host or to do something else [18:01:50] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/699037 [18:01:54] phuedx: if it changes only -labs.php files, yes, just merge and pull to deployment host [18:02:07] if it also changes non-labs files (but is labs-only), then it should be synced [18:02:16] (03CR) 10Urbanecm: [C: 03+2] Drop description on beta labs test survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699037 (https://phabricator.wikimedia.org/T257695) (owner: 10Jdlrobson) [18:02:19] Aha! Thanks. Noted [18:03:13] 10SRE: Request for more CPU and RAM for releases1002/2002 - https://phabricator.wikimedia.org/T284772 (10dancy) [18:03:33] (03Merged) 10jenkins-bot: Drop description on beta labs test survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699037 (https://phabricator.wikimedia.org/T257695) (owner: 10Jdlrobson) [18:05:26] the beta only patch should be live soon [18:05:35] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: d26968c1c3b3f3e115ff37a9a138d225cabba25a: wgWelcomeSurveyExperimentalGroups: Use new syntax in CS.php (T284597; T284735) (duration: 01m 08s) [18:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:41] T284735: PHP Warning: in_array() expects parameter 2 to be array, null given - https://phabricator.wikimedia.org/T284735 [18:05:41] T284597: PHP Notice: Undefined index: questions - https://phabricator.wikimedia.org/T284597 [18:08:05] (03PS3) 10Dzahn: static-bugzilla: add config to server gzipped HTML and a test file [container/miscweb] - 10https://gerrit.wikimedia.org/r/698079 (https://phabricator.wikimedia.org/T281538) [18:10:09] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T284751 (10Naike) Hi @Aklapper please feel free to delete this ticket. Thanks! [18:11:06] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T284751 (10Urbanecm) 05Stalled→03Invalid @Naike Tickets cannot be deleted, but I set the status to Invalid. Best, Martin [18:11:10] (03PS2) 10Ebernhardson: Add pool counter for automated search requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699257 (https://phabricator.wikimedia.org/T284479) [18:21:54] (03Merged) 10jenkins-bot: Fire language change hook [extensions/UniversalLanguageSelector] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699183 (https://phabricator.wikimedia.org/T280770) (owner: 10Phuedx) [18:23:44] phuedx: pulled onto mwdebug1001, please check [18:25:01] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active - NTT, AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:26:48] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T284751 (10Naike) @Aklapper do LDAP-Access-Requests need to be assigned to you, or can they be submitted unassigned? [18:27:51] phuedx: ping? [18:28:51] urbanecm: Done. Thanks! [18:29:06] phuedx: does that mean the prod patch works? [18:32:01] phuedx: ping? [18:32:04] urbanecm: Sorry. I meant "on it". No. It's not working and I'm not sure why. The patch doesn't break anything (the ULS is working correctly) but it doesn't fix the underlying issue. I think it should be reverted until I can figure out why [18:32:15] ah, ok [18:33:46] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T284751 (10Aklapper) @Naike: Hi, see https://phabricator.wikimedia.org/project/profile/1564/ for instructions [18:34:10] phuedx: were you testing it at htwiki? [18:35:14] if so, htwiki is still at wmf.7, and your backport is for wmf.9 [18:35:35] *facepalm* [18:35:42] (03PS1) 10Urbanecm: Remove sep11 interwiki link from dumpinterwiki.php [extensions/WikimediaMaintenance] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699184 (https://phabricator.wikimedia.org/T284222) [18:36:18] phuedx: should i still revert it, or are you testing it at another wiki? 🙂 [18:36:40] urbanecm: Thanks for pointing that out. I think it's time for me to get some more coffee [18:36:52] I've tested it on a wiki that is actually on the correct branch and it LGTM [18:36:59] great [18:37:00] syncing it out [18:38:40] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.9/extensions/UniversalLanguageSelector/resources/js/ext.uls.launch.js: 8aeab139879613782548b20fc11af5e66589e30a: Fire language change hook (T280770) (duration: 01m 07s) [18:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:45] T280770: Instrumentation QA for language switching - https://phabricator.wikimedia.org/T280770 [18:38:46] and, here you go [18:38:50] should be live phuedx :) [18:38:57] Thanks [18:39:10] np [18:39:50] !log urbanecm@deploy1002 update-interwiki-cache aborted: Update interwiki cache (duration: 00m 03s) [18:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:56] (03PS1) 10RobH: adding mgmt subnet to iptable rules [puppet] - 10https://gerrit.wikimedia.org/r/699269 (https://phabricator.wikimedia.org/T283771) [18:41:31] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699270 [18:41:33] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699270 (owner: 10Urbanecm) [18:42:05] (03PS2) 10Urbanecm: Remove sep11 interwiki link from dumpinterwiki.php [extensions/WikimediaMaintenance] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699184 (https://phabricator.wikimedia.org/T284222) [18:42:08] (03CR) 10Urbanecm: [C: 03+2] Remove sep11 interwiki link from dumpinterwiki.php [extensions/WikimediaMaintenance] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699184 (https://phabricator.wikimedia.org/T284222) (owner: 10Urbanecm) [18:42:27] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699270 (owner: 10Urbanecm) [18:44:02] (03PS2) 10RobH: adding mgmt subnet to iptable rules [puppet] - 10https://gerrit.wikimedia.org/r/699269 (https://phabricator.wikimedia.org/T283771) [18:44:04] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699272 [18:44:06] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699272 (owner: 10Urbanecm) [18:44:39] (03Abandoned) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699272 (owner: 10Urbanecm) [18:45:22] !log urbanecm@deploy1002 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 01m 23s) [18:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:26] (03PS1) 10Cathal Mooney: Allow subset of eqiad mgmt range to connect to tftp servers. [puppet] - 10https://gerrit.wikimedia.org/r/699273 (https://phabricator.wikimedia.org/T283771) [18:45:52] (03CR) 10Dzahn: [C: 03+1] "looks good to me. this is defined in class network::constants but you have to look for non-capitalized "mgmt_networks"" [puppet] - 10https://gerrit.wikimedia.org/r/699269 (https://phabricator.wikimedia.org/T283771) (owner: 10RobH) [18:46:38] (03Merged) 10jenkins-bot: Remove sep11 interwiki link from dumpinterwiki.php [extensions/WikimediaMaintenance] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699184 (https://phabricator.wikimedia.org/T284222) (owner: 10Urbanecm) [18:49:14] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.9/extensions/WikimediaMaintenance/dumpInterwiki.php: b21904e326e917f5ac6d7129a4d224380c6e4c21: Remove sep11 interwiki link from dumpinterwiki.php (duration: 01m 08s) [18:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:42] (03PS3) 10RobH: Allow mgmt range to connect to tftp servers. [puppet] - 10https://gerrit.wikimedia.org/r/699269 (https://phabricator.wikimedia.org/T283771) [18:52:11] (03Abandoned) 10Cathal Mooney: Allow subset of eqiad mgmt range to connect to tftp servers. [puppet] - 10https://gerrit.wikimedia.org/r/699273 (https://phabricator.wikimedia.org/T283771) (owner: 10Cathal Mooney) [18:53:43] (03CR) 10Cathal Mooney: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/699269 (https://phabricator.wikimedia.org/T283771) (owner: 10RobH) [18:54:19] (03CR) 10RobH: [C: 03+2] Allow mgmt range to connect to tftp servers. [puppet] - 10https://gerrit.wikimedia.org/r/699269 (https://phabricator.wikimedia.org/T283771) (owner: 10RobH) [18:55:23] 10SRE, 10ops-eqiad, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10Cmjohnson) @ayounsi circling back to this, some of these ports have now been filled. Some are still there, how do I delete them manually now? I cannot find some of them in netbox [18:57:24] 10SRE, 10ops-eqiad, 10DC-Ops: Update Documentation for dl360 Motherboard Swap - https://phabricator.wikimedia.org/T254272 (10Cmjohnson) @wiki_willy I am not sure we really need to document this, if we have any HP or Dell servers that need motherboard swaps then we should be utilizing the techs that each comp... [18:57:53] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:58:11] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:59:13] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:00:05] longma and twentyafterfour: #bothumor I � Unicode. All rise for MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210610T1900). [19:03:17] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:04:33] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:05:21] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:06:53] (03PS1) 10Jeena Huneidi: all wikis to 1.37.0-wmf.9 refs T281150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699274 [19:06:55] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.37.0-wmf.9 refs T281150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699274 (owner: 10Jeena Huneidi) [19:07:37] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.9 refs T281150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699274 (owner: 10Jeena Huneidi) [19:09:21] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.9 refs T281150 [19:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:26] T281150: 1.37.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T281150 [19:27:21] (03CR) 10Ryan Kemper: [C: 03+2] Undeploy mjolnir profile from analytics [puppet] - 10https://gerrit.wikimedia.org/r/698025 (https://phabricator.wikimedia.org/T265547) (owner: 10Ebernhardson) [19:31:27] !log T265547 Cleanup following merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/698025: `sudo -E cumin -b 5 'P:analytics::cluster::elasticsearch' 'sudo rm -rfv /etc/mjolnir /srv/deployment/search/mjolnir'` [19:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:32] T265547: Replace mjolnir venv deployment scheme in analytics - https://phabricator.wikimedia.org/T265547 [19:44:41] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:44:59] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:45:51] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:50:05] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:51:17] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:52:13] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:00:10] !log jhuneidi@deploy1002 Pruned MediaWiki: 1.37.0-wmf.5 (duration: 03m 33s) [20:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:23] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:13:09] !log installed tftp client on install1003 for debugging [20:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:21] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:50:12] (03PS10) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) [20:58:25] (03PS11) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) [20:59:51] (03CR) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [21:04:07] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:17:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:33] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 66, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:30:07] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:33:36] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=testwiki discussiontools # T282699 [21:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:42] T282699: Create topic subscription table - https://phabricator.wikimedia.org/T282699 [21:36:16] !log Start of urbanecm@mwmaint1002:~$ foreachwiki extensions/WikimediaMaintenance/createExtensionTables.php discussiontools # T282699 [21:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:32] !log End of urbanecm@mwmaint1002:~$ foreachwiki extensions/WikimediaMaintenance/createExtensionTables.php discussiontools # T282699 [21:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:37] T282699: Create topic subscription table - https://phabricator.wikimedia.org/T282699 [21:56:35] (03PS3) 10Ebernhardson: Add pool counter for automated search requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699257 (https://phabricator.wikimedia.org/T284479) [22:03:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10RobH) [22:06:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10RobH) [22:12:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10RobH) [22:14:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10RobH) >>! In T273915#7149157, @RobH wrote: > @cmjohnson, > > Please review and test the following servers, as their mgmt is offline. This can be caused by the cab... [22:19:58] (03CR) 10Bstorm: [C: 03+2] dumps distribution: remove mirrors.freemirror.org [puppet] - 10https://gerrit.wikimedia.org/r/698836 (owner: 10Bstorm) [22:31:13] (03CR) 10Bstorm: "That file in the prometheus module only exists to serve toolforge::grid::base (which is probably not obvious). It isn't used anywhere else" [puppet] - 10https://gerrit.wikimedia.org/r/699216 (https://phabricator.wikimedia.org/T284130) (owner: 10David Caro) [23:00:05] brennen: Your horoscope predicts another unfortunate US Backport and Config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210610T2300). [23:00:22] here - [23:00:29] looks like no patches. [23:05:18] brennen: Patch inbound. [23:05:27] (But waiting for CI.) [23:05:49] are you deploying? [23:05:53] I can, sure. [23:06:04] thanks for taking care of that [23:06:09] AntiComposite: Thanks for the fix! [23:07:07] James_F: is it https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Citoid/+/699315 ? [23:07:18] Yes, and it doesn't cleanly cherry-pick. :-( [23:07:26] fun! [23:07:40] would you mind if we deployed it for backport training? [23:08:04] (03PS1) 10Jforrester: CitoidInspector: rename getParameterNames to getOrderedParameterNames [extensions/Citoid] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699288 (https://phabricator.wikimedia.org/T284786) [23:08:15] thcipriani: Go for it! [23:08:21] <3 [23:08:50] It's a nice clean single-line single-file single-extension patch. [23:08:55] The best kind of back-port. [23:09:25] yes! [23:09:46] (03CR) 10Thcipriani: [C: 03+2] CitoidInspector: rename getParameterNames to getOrderedParameterNames [extensions/Citoid] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699288 (https://phabricator.wikimedia.org/T284786) (owner: 10Jforrester) [23:14:32] (03Merged) 10jenkins-bot: CitoidInspector: rename getParameterNames to getOrderedParameterNames [extensions/Citoid] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/699288 (https://phabricator.wikimedia.org/T284786) (owner: 10Jforrester) [23:21:14] AntiComposite, James_F: Patch is on mwdebug1002 [23:21:23] ready for testing [23:24:03] Thanks! [23:25:13] xSavitar: Yup, works. Thank you. [23:25:26] works for me too, thanks [23:25:45] Okay, making it live now James_F, AntiComposite [23:25:51] Excellent. [23:29:25] !log derick@deploy1002 Synchronized php-1.37.0-wmf.9/extensions/Citoid/modules/ve/ve.ui.CitoidInspector.js: Backport: [[gerrit:699288|CitoidInspector: rename getParameterNames to getOrderedParameterNames (T284786)]] (duration: 00m 57s) [23:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:31] T284786: Automatic citation addition in VisualEditor has stopped working - https://phabricator.wikimedia.org/T284786 [23:29:48] James_F, AntiComposite: it's live. Thank you :) [23:29:59] And thanks to you, too. :-) [23:32:44] Thanks! [23:46:28] (03PS4) 10Dzahn: static-bugzilla: add config to server gzipped HTML and a test file [container/miscweb] - 10https://gerrit.wikimedia.org/r/698079 (https://phabricator.wikimedia.org/T281538) [23:48:33] (03PS5) 10Dzahn: static-bugzilla: add config to server gzipped HTML and a test file [container/miscweb] - 10https://gerrit.wikimedia.org/r/698079 (https://phabricator.wikimedia.org/T281538) [23:55:37] (03CR) 10Dzahn: [C: 03+2] static-bugzilla: add config to server gzipped HTML and a test file [container/miscweb] - 10https://gerrit.wikimedia.org/r/698079 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)