[00:21:45] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [00:24:01] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [00:28:04] (03PS1) 10Jdlrobson: [Vector] Enable table of contents on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764454 [00:28:06] (03PS1) 10Jdlrobson: [Cleanup] Remove non-existent config wgVectorUseWvuiSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764455 [00:36:09] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [00:41:05] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [00:48:39] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: durum6001, doh6001, durum6002, stat1004, an-test-client1001, doh6002 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [00:53:19] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [00:58:13] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [01:37:31] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:40:36] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:41:45] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Move service name openstack.eqiad1.wikimediacloud.org to cloudcontrol1005" [dns] - 10https://gerrit.wikimedia.org/r/764421 (owner: 10Andrew Bogott) [01:51:57] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1005.wikimedia.org with OS bullseye [01:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:52] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1005.wikimedia.org with reason: host reimage [02:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:06:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.23 [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764459 [02:07:37] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.23 [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764459 (owner: 10TrainBranchBot) [02:08:41] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1005.wikimedia.org with reason: host reimage [02:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:25] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.23 [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764459 (owner: 10TrainBranchBot) [02:23:07] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [02:28:03] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [02:29:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:30:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:18] (03PS1) 10Andrew Bogott: openstack::trove::service::victoria: create /usr/share/trove-common/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/764461 [02:32:53] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:32:53] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:33:04] (03CR) 10jerkins-bot: [V: 04-1] openstack::trove::service::victoria: create /usr/share/trove-common/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/764461 (owner: 10Andrew Bogott) [02:35:36] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [02:36:04] (03PS2) 10Andrew Bogott: trove::service::victoria: create /usr/share/trove-common/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/764461 [02:36:57] (03CR) 10Andrew Bogott: [C: 03+2] trove::service::victoria: create /usr/share/trove-common/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/764461 (owner: 10Andrew Bogott) [02:37:49] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:37:49] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:37:53] (Primary outbound port utilisation over 80% #page) firing: (2) Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:37:53] (Primary outbound port utilisation over 80% #page) firing: (2) Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:38:43] ack, online [02:38:52] not sure what to look at yet? [02:42:49] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [02:45:17] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [02:46:20] (03PS1) 10Andrew Bogott: Try to fix ordering around /usr/share/trove-common/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/764462 [02:46:38] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1005.wikimedia.org with OS bullseye [02:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:56] (03CR) 10jerkins-bot: [V: 04-1] Try to fix ordering around /usr/share/trove-common/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/764462 (owner: 10Andrew Bogott) [02:49:44] (03PS1) 10Cwhite: hiera: enable public_clouds_shutdown [puppet] - 10https://gerrit.wikimedia.org/r/764463 [02:50:29] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: stat1004, doh6001, durum6002, durum6001, an-test-client1001, doh6002 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [02:50:40] (03CR) 10CDanis: [C: 03+1] hiera: enable public_clouds_shutdown [puppet] - 10https://gerrit.wikimedia.org/r/764463 (owner: 10Cwhite) [02:50:57] (03CR) 10Cwhite: [C: 03+2] hiera: enable public_clouds_shutdown [puppet] - 10https://gerrit.wikimedia.org/r/764463 (owner: 10Cwhite) [02:50:58] (03CR) 10RLazarus: [C: 03+1] hiera: enable public_clouds_shutdown [puppet] - 10https://gerrit.wikimedia.org/r/764463 (owner: 10Cwhite) [02:53:14] (03PS2) 10Andrew Bogott: Try to fix ordering around /usr/share/trove-common/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/764462 [02:53:50] (03CR) 10jerkins-bot: [V: 04-1] Try to fix ordering around /usr/share/trove-common/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/764462 (owner: 10Andrew Bogott) [02:57:49] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:57:49] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:57:53] (Primary outbound port utilisation over 80% #page) resolved: (2) Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:57:53] (Primary outbound port utilisation over 80% #page) resolved: (2) Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [02:59:51] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [03:00:00] (03PS3) 10Andrew Bogott: Try to fix ordering around /usr/share/trove-common/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/764462 [03:01:56] (03CR) 10Andrew Bogott: [C: 03+2] Try to fix ordering around /usr/share/trove-common/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/764462 (owner: 10Andrew Bogott) [03:02:17] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [03:04:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2081.codfw.wmnet with reason: Maintenance [03:04:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2081.codfw.wmnet with reason: Maintenance [03:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2081 (T302185)', diff saved to https://phabricator.wikimedia.org/P21232 and previous config saved to /var/cache/conftool/dbconfig/20220222-030456-ladsgroup.json [03:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:04] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [03:06:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2081.codfw.wmnet with OS bullseye [03:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:18:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2081.codfw.wmnet with reason: host reimage [03:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2081.codfw.wmnet with reason: host reimage [03:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:41] (03PS1) 10Andrew Bogott: nrpe_local.cfg.erb: increase nrpe timeout to 5 minutes [puppet] - 10https://gerrit.wikimedia.org/r/764464 [03:34:07] (03PS1) 104nn1l2: InitialiseSettings: General cleanup, wgRemoveGroups (A-D) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764465 (https://phabricator.wikimedia.org/T301647) [03:35:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2081.codfw.wmnet with OS bullseye [03:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:19] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [03:40:54] (03CR) 104nn1l2: InitialiseSettings: General cleanup, wgRemoveGroups (A-D) (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764465 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [03:44:17] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [03:52:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2081 (T302185)', diff saved to https://phabricator.wikimedia.org/P21233 and previous config saved to /var/cache/conftool/dbconfig/20220222-035257-ladsgroup.json [03:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:06] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [03:54:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2080.codfw.wmnet with reason: Maintenance [03:54:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2080.codfw.wmnet with reason: Maintenance [03:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2080 (T302185)', diff saved to https://phabricator.wikimedia.org/P21234 and previous config saved to /var/cache/conftool/dbconfig/20220222-035419-ladsgroup.json [03:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2080.codfw.wmnet with OS bullseye [03:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:23] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (9) node(s) change every puppet run: durum6001, stat1004, cloudcontrol1005, an-test-client1001, doh6001, cloudcontrol1003, cloudcontrol1004, durum6002, doh6002 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [03:59:03] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [04:03:59] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [04:05:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [04:05:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [04:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:05:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T300992)', diff saved to https://phabricator.wikimedia.org/P21235 and previous config saved to /var/cache/conftool/dbconfig/20220222-040537-ladsgroup.json [04:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:05:45] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [04:07:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2080.codfw.wmnet with reason: host reimage [04:07:31] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T300992)', diff saved to https://phabricator.wikimedia.org/P21236 and previous config saved to /var/cache/conftool/dbconfig/20220222-040957-ladsgroup.json [04:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2080.codfw.wmnet with reason: host reimage [04:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:19] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [04:21:13] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [04:24:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2080.codfw.wmnet with OS bullseye [04:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P21237 and previous config saved to /var/cache/conftool/dbconfig/20220222-042502-ladsgroup.json [04:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2080 (T302185)', diff saved to https://phabricator.wikimedia.org/P21238 and previous config saved to /var/cache/conftool/dbconfig/20220222-042940-ladsgroup.json [04:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:46] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [04:40:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P21239 and previous config saved to /var/cache/conftool/dbconfig/20220222-044006-ladsgroup.json [04:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2086.codfw.wmnet with reason: Maintenance [04:53:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2086.codfw.wmnet with reason: Maintenance [04:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2086:3317 (T302185)', diff saved to https://phabricator.wikimedia.org/P21240 and previous config saved to /var/cache/conftool/dbconfig/20220222-045349-ladsgroup.json [04:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:57] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [04:54:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2086:3318 (T302185)', diff saved to https://phabricator.wikimedia.org/P21241 and previous config saved to /var/cache/conftool/dbconfig/20220222-045406-ladsgroup.json [04:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2086.codfw.wmnet with OS bullseye [04:55:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T300992)', diff saved to https://phabricator.wikimedia.org/P21242 and previous config saved to /var/cache/conftool/dbconfig/20220222-045511-ladsgroup.json [04:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:17] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [04:58:13] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [05:02:57] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:03:09] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [05:08:55] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:09:39] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:10:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2086.codfw.wmnet with reason: host reimage [05:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:33] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:13:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2086.codfw.wmnet with reason: host reimage [05:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:40] !log dbmaint on s1@codfw (T302185) [05:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:46] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [05:17:37] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [05:19:55] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [05:27:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2086.codfw.wmnet with OS bullseye [05:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2086:3317 (T302185)', diff saved to https://phabricator.wikimedia.org/P21243 and previous config saved to /var/cache/conftool/dbconfig/20220222-053102-ladsgroup.json [05:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:08] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [05:31:59] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [05:34:25] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [05:35:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2086:3318 (T302185)', diff saved to https://phabricator.wikimedia.org/P21244 and previous config saved to /var/cache/conftool/dbconfig/20220222-053525-ladsgroup.json [05:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:53] 10SRE, 10LDAP-Access-Requests: Grant Access to releasers-mediawiki for MarkAHershberger and Mglaser - https://phabricator.wikimedia.org/T302160 (10Legoktm) releasers-mediawiki isn't an LDAP group, it's a shell group, you need to follow #sre-access-requests, and have one ticket per-person. That said, I don't t... [05:38:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2085.codfw.wmnet with reason: Maintenance [05:38:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2085.codfw.wmnet with reason: Maintenance [05:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2085:3311 (T302185)', diff saved to https://phabricator.wikimedia.org/P21245 and previous config saved to /var/cache/conftool/dbconfig/20220222-053836-ladsgroup.json [05:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:43] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [05:39:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2085:3318 (T302185)', diff saved to https://phabricator.wikimedia.org/P21246 and previous config saved to /var/cache/conftool/dbconfig/20220222-053901-ladsgroup.json [05:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:13] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10Legoktm) Are there Gerrit patches that [[https://gerrit.wikimedia.org/r/q/owner:sam%2540theresnotime.co.uk|this search]] doesn't pick up? No issue from me on trust or competence, but I... [05:40:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2085.codfw.wmnet with OS bullseye [05:40:36] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:25] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:49:15] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [05:49:17] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:54:09] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [05:55:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2085.codfw.wmnet with reason: host reimage [05:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2085.codfw.wmnet with reason: host reimage [05:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:19] !log dbmain on db2077 s7@codfw T302222 [06:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:25] T302222: Check and fix compressed mismatched tables - https://phabricator.wikimedia.org/T302222 [06:10:38] (03PS1) 10Ladsgroup: auto_schema: Split dry run logs [software] - 10https://gerrit.wikimedia.org/r/764620 [06:11:17] (03PS1) 10Marostegui: db2077: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/764621 (https://phabricator.wikimedia.org/T302222) [06:12:02] (03CR) 10Marostegui: [C: 03+2] db2077: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/764621 (https://phabricator.wikimedia.org/T302222) (owner: 10Marostegui) [06:12:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2085.codfw.wmnet with OS bullseye [06:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [06:12:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [06:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T300775)', diff saved to https://phabricator.wikimedia.org/P21247 and previous config saved to /var/cache/conftool/dbconfig/20220222-061235-marostegui.json [06:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:43] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [06:13:13] (03PS1) 10Marostegui: Revert "db2074,db2094: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/764397 [06:14:12] (03CR) 10Marostegui: [C: 03+2] Revert "db2074,db2094: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/764397 (owner: 10Marostegui) [06:15:05] marostegui: dbmaint not dbmain :P [06:17:47] (03PS1) 10Giuseppe Lavagetto: Revert "hiera: enable public_clouds_shutdown" [puppet] - 10https://gerrit.wikimedia.org/r/764398 [06:20:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2085:3311 (T302185)', diff saved to https://phabricator.wikimedia.org/P21248 and previous config saved to /var/cache/conftool/dbconfig/20220222-062018-ladsgroup.json [06:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:25] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [06:22:21] !log dbmaint on db2077 s7@codfw T302222 [06:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:26] T302222: Check and fix compressed mismatched tables - https://phabricator.wikimedia.org/T302222 [06:24:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2085:3318 (T302185)', diff saved to https://phabricator.wikimedia.org/P21249 and previous config saved to /var/cache/conftool/dbconfig/20220222-062443-ladsgroup.json [06:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2082.codfw.wmnet with reason: Maintenance [06:27:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2082.codfw.wmnet with reason: Maintenance [06:27:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [06:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [06:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2082 (T302185)', diff saved to https://phabricator.wikimedia.org/P21250 and previous config saved to /var/cache/conftool/dbconfig/20220222-062711-ladsgroup.json [06:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:23] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [06:29:52] (03PS1) 10Kevin Bazira: ml-services: add glwiki, hewiki & hiwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/764623 (https://phabricator.wikimedia.org/T301415) [06:31:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2082.codfw.wmnet with OS bullseye [06:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:55] (03CR) 10jerkins-bot: [V: 04-1] ml-services: add glwiki, hewiki & hiwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/764623 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [06:35:09] Amir1: It wasn't logged :(? [06:35:31] marostegui: no because it must be "dbmaint". [06:35:36] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [06:35:42] I can make it understand dbmain as well but are you sure? [06:35:44] Amir1: [07:22:20] !log dbmaint on db2077 s7@codfw T302222 [06:35:44] T302222: Check and fix compressed mismatched tables - https://phabricator.wikimedia.org/T302222 [06:35:59] aah [06:36:02] let me see then [06:36:34] I tested it on my log, it worked so let me run the code [06:36:39] sure [06:41:04] marostegui: hmm, the code so far didn't log that dbmaint found but the log was malformed, probably I forgot to push the code to toolforge 🤦‍♂️ [06:41:12] XDDDDDD [06:42:38] (03PS1) 10Marostegui: Revert "db1105: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/764399 [06:42:50] Amir1: ^ [06:42:52] ok to merge? [06:45:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2082.codfw.wmnet with reason: host reimage [06:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:52] marostegui: sure [06:45:59] (03CR) 10Marostegui: [C: 03+2] Revert "db1105: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/764399 (owner: 10Marostegui) [06:48:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2082.codfw.wmnet with reason: host reimage [06:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:17] marostegui: I think I found the bug with your dbmaint logs, retrying and deploying [06:50:27] ok, let me know when you want me to try again [06:51:04] it should not be needed (unless you're adding more sections) [06:51:11] nop, not for now [06:51:45] basically the problem was that autologs overrode the manua(e)l ones [07:01:26] marostegui: vola [07:02:00] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33909/console" [puppet] - 10https://gerrit.wikimedia.org/r/764398 (owner: 10Giuseppe Lavagetto) [07:03:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2082.codfw.wmnet with OS bullseye [07:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:15] (03CR) 10Ayounsi: [C: 03+1] "+1 but be ready to re-apply it if it saturates our infra again." [puppet] - 10https://gerrit.wikimedia.org/r/764398 (owner: 10Giuseppe Lavagetto) [07:06:55] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] Revert "hiera: enable public_clouds_shutdown" [puppet] - 10https://gerrit.wikimedia.org/r/764398 (owner: 10Giuseppe Lavagetto) [07:07:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2082 (T302185)', diff saved to https://phabricator.wikimedia.org/P21251 and previous config saved to /var/cache/conftool/dbconfig/20220222-070759-ladsgroup.json [07:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:05] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [07:09:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [07:09:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [07:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T302185)', diff saved to https://phabricator.wikimedia.org/P21252 and previous config saved to /var/cache/conftool/dbconfig/20220222-071003-ladsgroup.json [07:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:45] !log dbmaint on db2104 (and its replicas) s2@codfw T300381 [07:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:51] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [07:18:59] RECOVERY - Check systemd state on doh6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:21:31] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [07:23:49] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [07:25:57] PROBLEM - Check systemd state on doh6001 is CRITICAL: CRITICAL - degraded: The following units failed: bird6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:26:32] (03CR) 10Elukey: ml-services: add glwiki, hewiki & hiwiki editquality isvcs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764623 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [07:31:28] !log dbmaint on non-pooled hosts s2@eqiad T300381 [07:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:35] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [07:35:49] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [07:37:04] (03PS1) 10Elukey: Add overlayfs settings for ml-serve1001 [puppet] - 10https://gerrit.wikimedia.org/r/764706 [07:37:06] (03PS1) 10Elukey: Add overlayfs settings for ml-serve1002 [puppet] - 10https://gerrit.wikimedia.org/r/764707 [07:37:08] (03PS1) 10Elukey: Add overlay settings for ml-serve1003 [puppet] - 10https://gerrit.wikimedia.org/r/764708 [07:37:10] (03PS1) 10Elukey: Add overlayfs settings for ml-serve1004 [puppet] - 10https://gerrit.wikimedia.org/r/764709 [07:37:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:38:03] (03PS2) 10Kevin Bazira: ml-services: add glwiki, hewiki & hiwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/764623 (https://phabricator.wikimedia.org/T301415) [07:38:07] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [07:40:24] (03CR) 10Kevin Bazira: ml-services: add glwiki, hewiki & hiwiki editquality isvcs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764623 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [07:40:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [07:40:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [07:40:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T300381)', diff saved to https://phabricator.wikimedia.org/P21253 and previous config saved to /var/cache/conftool/dbconfig/20220222-074106-marostegui.json [07:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:17] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [07:44:21] (03CR) 10Elukey: [C: 03+2] ml-services: add glwiki, hewiki & hiwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/764623 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [07:46:33] RECOVERY - Check systemd state on durum6002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:37] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [07:49:50] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [07:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T300381)', diff saved to https://phabricator.wikimedia.org/P21254 and previous config saved to /var/cache/conftool/dbconfig/20220222-075020-marostegui.json [07:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:26] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [07:51:10] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [07:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:59] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [07:53:31] PROBLEM - Check systemd state on durum6002 is CRITICAL: CRITICAL - degraded: The following units failed: bird6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:05] Amir1, awight, Urbanecm, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220222T0800). [08:00:05] MatmaRex: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:21] hello [08:00:31] anyone deploying at this dreadful morning hour? :D [08:01:57] hey [08:01:59] looking [08:02:28] (03PS1) 10Marostegui: db1125: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/764710 (https://phabricator.wikimedia.org/T301879) [08:03:02] (03CR) 10Majavah: [C: 03+2] Don't suppress teardown prompt when pressing escape [extensions/VisualEditor] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764396 (https://phabricator.wikimedia.org/T302096) (owner: 10Bartosz Dziewoński) [08:03:35] MatmaRex: at least CI seems quiet this early :D [08:03:43] hah [08:05:20] (03PS2) 10Marostegui: db1125: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/764710 (https://phabricator.wikimedia.org/T301879) [08:05:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P21255 and previous config saved to /var/cache/conftool/dbconfig/20220222-080525-marostegui.json [08:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T302185)', diff saved to https://phabricator.wikimedia.org/P21256 and previous config saved to /var/cache/conftool/dbconfig/20220222-081022-ladsgroup.json [08:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:28] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [08:11:01] (03PS3) 10Marostegui: db1125: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/764710 (https://phabricator.wikimedia.org/T301879) [08:13:01] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:13:43] XioNoX topranks I'd like to help heal the icinga config, ok to send a patch to add the new switches? that should do it I think [08:14:09] godog: ah right, I forgot about that [08:14:10] (03CR) 10Marostegui: "PCC looks good https://puppet-compiler.wmflabs.org/pcc-worker1001/33913/" [puppet] - 10https://gerrit.wikimedia.org/r/764710 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [08:14:22] godog: do you have the list? I can take care of it [08:14:34] (the ones being problematic) [08:15:03] XioNoX: at least lsw1-e3-eqiad.mgmt.eqiad.wmnet so I'm guessing all the new rows [08:15:15] (03CR) 10Marostegui: [C: 03+2] db1125: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/764710 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [08:15:24] godog: we don't have servers in all the racks yet, one sec [08:16:16] ah ok got it [08:16:35] godog: maybe we do https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=49&rack_group_id=50&role=server&mac_address=&has_primary_ip=&local_context_data=&virtual_chassis_member=&console_ports=&console_server_ports=&power_ports=&power_outlets=&interfaces=&pass_through_ports=&cf_purchase_date=&cf_ticket= :) [08:16:51] (03Merged) 10jenkins-bot: Don't suppress teardown prompt when pressing escape [extensions/VisualEditor] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764396 (https://phabricator.wikimedia.org/T302096) (owner: 10Bartosz Dziewoński) [08:16:54] I can see the one triggering the issue [08:17:02] I'll add E3 to monitoring [08:17:08] XioNoX: ok thank you! SGTM [08:18:39] MatmaRex: ok, your patch is available for testing on mwdebug1001 [08:18:53] looking [08:19:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:44] taavi: seems good [08:19:49] ok, syncing [08:19:58] (03PS2) 10Ayounsi: Icinga/netops re-organize devices [puppet] - 10https://gerrit.wikimedia.org/r/764367 [08:20:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P21257 and previous config saved to /var/cache/conftool/dbconfig/20220222-082029-marostegui.json [08:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:37] godog: added to https://gerrit.wikimedia.org/r/c/operations/puppet/+/764367 [08:20:52] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.22/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.js: Backport: Revert: [[gerrit:764396|Don't suppress teardown prompt when pressing escape (T302096)]] (duration: 00m 49s) [08:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:58] T302096: Escape key instantly closes VE and discards changes with no confirmation - https://phabricator.wikimedia.org/T302096 [08:21:04] aand it's live [08:21:04] (running PCC) [08:21:05] (03CR) 10jerkins-bot: [V: 04-1] Icinga/netops re-organize devices [puppet] - 10https://gerrit.wikimedia.org/r/764367 (owner: 10Ayounsi) [08:21:12] anything else? [08:21:15] PROBLEM - mysql_up reduced availability on alert1001 is CRITICAL: 0.6319 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:21:39] thanks taavi [08:21:41] !log UTC morning deploys done [08:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:22:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:35] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33914/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/764367 (owner: 10Ayounsi) [08:24:45] (03PS3) 10Ayounsi: Icinga/netops re-organize devices [puppet] - 10https://gerrit.wikimedia.org/r/764367 [08:25:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P21258 and previous config saved to /var/cache/conftool/dbconfig/20220222-082527-ladsgroup.json [08:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:38] (03CR) 10Ayounsi: [C: 03+2] Icinga/netops re-organize devices [puppet] - 10https://gerrit.wikimedia.org/r/764367 (owner: 10Ayounsi) [08:26:00] (03PS1) 10Marostegui: sanitarium_multiinstance.my.cnf: Remove innodb_file_format [puppet] - 10https://gerrit.wikimedia.org/r/764711 (https://phabricator.wikimedia.org/T301879) [08:27:30] XioNoX: patch LGTM overall, let me know if it works and/or I can help [08:27:44] yep, will do [08:28:03] godog: I forgot what hieradata/common/monitoring.yaml was used for [08:28:12] (03CR) 10Marostegui: "As expected: NOOP: https://puppet-compiler.wmflabs.org/pcc-worker1003/33915/" [puppet] - 10https://gerrit.wikimedia.org/r/764711 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [08:28:13] aka is it required to fix that one issue? [08:28:18] (03CR) 10Marostegui: [C: 03+2] sanitarium_multiinstance.my.cnf: Remove innodb_file_format [puppet] - 10https://gerrit.wikimedia.org/r/764711 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [08:28:52] XioNoX: mmhh yeah I think so, that'll create hostgroups in icinga [08:29:01] ok [08:29:42] XioNoX: it should be fine though to add all switches there now [08:30:09] cool, doing it now [08:30:52] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [08:31:51] \o/ [08:31:56] (03PS1) 10Ayounsi: Add new eqiad leaf switches to monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/764713 [08:32:02] godog, XioNoX: thanks for working on this [08:32:03] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/764713 [08:32:13] (03PS1) 10Marostegui: analytics_multiinstance.my.cnf: Remove innodb_file_format [puppet] - 10https://gerrit.wikimedia.org/r/764714 (https://phabricator.wikimedia.org/T301879) [08:33:11] topranks: sure np! [08:33:13] XioNoX: checking [08:33:47] actually confirming that we do need to change monitoring::groups, would be nice if we didn't have to anymore [08:34:06] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/764713 (owner: 10Ayounsi) [08:34:06] ok no we do [08:34:20] (03CR) 10Filippo Giunchedi: [C: 03+1] Add new eqiad leaf switches to monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/764713 (owner: 10Ayounsi) [08:34:24] PROBLEM - OSPF status on lsw1-e3-eqiad.mgmt.eqiad.wmnet is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 0/0 UP : 2 v2 P2P interfaces vs. 0 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:34:41] yeah sorry it was top of my agenda today. [08:35:01] ^^^ these kind of issues I was worried about but we can work through [08:35:02] ah right no ospf v3 there and the current script assumes there is both [08:35:12] topranks: np it happens [08:35:17] yeah exactly, for now we can also remove ospf => true from that device [08:35:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T300381)', diff saved to https://phabricator.wikimedia.org/P21259 and previous config saved to /var/cache/conftool/dbconfig/20220222-083534-marostegui.json [08:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:41] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [08:35:57] topranks: I'm also wondering if we shouldn't use the loopback IP as target, so it actually checks the data plane reachability [08:36:22] Yeah Icinga won’t be able to ping the loopback [08:36:22] PROBLEM - BFD status on asw1-b12-drmrs.wikimedia.org is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:36:32] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt.eqiad.wmnet is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:36:33] PROBLEM - Juniper alarms on lsw1-e3-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [08:36:34] PROBLEM - OSPF status on cloudsw1-d5-eqiad.mgmt.eqiad.wmnet is CRITICAL: OSPFv2: 1/1 UP : OSPFv3: 0/0 UP : 1 v2 P2P interfaces vs. 0 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:36:34] PROBLEM - OSPF status on cloudsw1-c8-eqiad.mgmt.eqiad.wmnet is CRITICAL: OSPFv2: 1/1 UP : OSPFv3: 0/0 UP : 1 v2 P2P interfaces vs. 0 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:37:05] Ok em [08:37:24] Drmrs BFD down? Plus cloudsw connection in Eqiad? [08:37:25] (03CR) 10Marostegui: [C: 03+2] analytics_multiinstance.my.cnf: Remove innodb_file_format [puppet] - 10https://gerrit.wikimedia.org/r/764714 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [08:38:11] ospf is false alarm [08:38:23] bfd I'll have a look [08:38:37] Ok thanks [08:39:27] (03PS2) 10Ayounsi: Add new eqiad leaf switches to monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/764713 [08:40:08] I removed OSPF in https://gerrit.wikimedia.org/r/c/operations/puppet/+/764713 [08:40:10] running PCC [08:40:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P21260 and previous config saved to /var/cache/conftool/dbconfig/20220222-084031-ladsgroup.json [08:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:59] (03PS1) 10Filippo Giunchedi: karma: don't fetch alert history from icinga [puppet] - 10https://gerrit.wikimedia.org/r/764716 [08:41:37] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/764713 (owner: 10Ayounsi) [08:42:14] (03PS1) 10Bartosz Dziewoński: Add overrides for 2FA disabled notification [extensions/WikimediaMessages] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764400 (https://phabricator.wikimedia.org/T210075) [08:42:35] anyone wants to merge a wmf.23 patch? ^ (it's marker as a release blocker) [08:43:04] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33916/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/764713 (owner: 10Ayounsi) [08:43:22] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Add new eqiad leaf switches to monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/764713 (owner: 10Ayounsi) [08:43:45] (03CR) 10JMeybohm: [C: 03+1] k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [08:44:28] MatmaRex: that changes messages, needs a full scap and i18n rebuild which takes an hour or so [08:44:31] (03CR) 10Filippo Giunchedi: [C: 03+2] karma: don't fetch alert history from icinga [puppet] - 10https://gerrit.wikimedia.org/r/764716 (owner: 10Filippo Giunchedi) [08:45:00] Amir1: wmf.23 is not deployed yet though, so i thought that will just happen with the train rollout? [08:45:18] (03CR) 10Ladsgroup: [C: 03+2] Add overrides for 2FA disabled notification [extensions/WikimediaMessages] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764400 (https://phabricator.wikimedia.org/T210075) (owner: 10Bartosz Dziewoński) [08:45:25] MatmaRex: I can give it a try [08:45:45] if the scap-prep is not done, then the +2 should be enough [08:49:50] RECOVERY - mysql_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:50:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ArielGlenn) So... what's happening with these? Do we have some sort of schedule? [08:50:10] thanks [08:54:12] (03PS1) 10JMeybohm: Add a dedicated profile for k8s_wikikube [labs/private] - 10https://gerrit.wikimedia.org/r/764719 (https://phabricator.wikimedia.org/T290966) [08:54:13] (03PS1) 10JMeybohm: Add a dedicated profile for k8s_wikikube [puppet] - 10https://gerrit.wikimedia.org/r/764718 (https://phabricator.wikimedia.org/T290966) [08:54:53] (03CR) 10jerkins-bot: [V: 04-1] Add a dedicated profile for k8s_wikikube [puppet] - 10https://gerrit.wikimedia.org/r/764718 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [08:55:16] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@17a70a0]: Add aqs hourly [08:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:24] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@17a70a0]: Add aqs hourly (duration: 00m 08s) [08:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T302185)', diff saved to https://phabricator.wikimedia.org/P21261 and previous config saved to /var/cache/conftool/dbconfig/20220222-085536-ladsgroup.json [08:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:42] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [08:56:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [08:56:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [08:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T302185)', diff saved to https://phabricator.wikimedia.org/P21262 and previous config saved to /var/cache/conftool/dbconfig/20220222-085653-ladsgroup.json [08:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [08:57:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [08:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T300381)', diff saved to https://phabricator.wikimedia.org/P21263 and previous config saved to /var/cache/conftool/dbconfig/20220222-085752-marostegui.json [08:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:00] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [08:58:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T302185)', diff saved to https://phabricator.wikimedia.org/P21264 and previous config saved to /var/cache/conftool/dbconfig/20220222-085835-ladsgroup.json [08:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:16] (03CR) 10Hashar: "I have changed on test so that it now verifies the package name in the changelog and in control files are matching." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/731149 (https://phabricator.wikimedia.org/T283855) (owner: 10Hashar) [08:59:20] (03PS2) 10Hashar: Introduce lint command [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/731149 (https://phabricator.wikimedia.org/T283855) [09:00:02] (03Merged) 10jenkins-bot: Add overrides for 2FA disabled notification [extensions/WikimediaMessages] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764400 (https://phabricator.wikimedia.org/T210075) (owner: 10Bartosz Dziewoński) [09:00:11] (03PS2) 10JMeybohm: Add a dedicated profile for k8s_wikikube [labs/private] - 10https://gerrit.wikimedia.org/r/764719 (https://phabricator.wikimedia.org/T290966) [09:00:19] (03PS2) 10JMeybohm: Add a dedicated profile for k8s_wikikube [puppet] - 10https://gerrit.wikimedia.org/r/764718 (https://phabricator.wikimedia.org/T290966) [09:00:49] (03CR) 10jerkins-bot: [V: 04-1] Introduce lint command [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/731149 (https://phabricator.wikimedia.org/T283855) (owner: 10Hashar) [09:01:33] grblblb [09:01:36] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add a dedicated profile for k8s_wikikube [labs/private] - 10https://gerrit.wikimedia.org/r/764719 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:02:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300381)', diff saved to https://phabricator.wikimedia.org/P21265 and previous config saved to /var/cache/conftool/dbconfig/20220222-090226-marostegui.json [09:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:43] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33918/console" [puppet] - 10https://gerrit.wikimedia.org/r/764718 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:04:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:04:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:26] (03PS3) 10Hashar: Introduce lint command [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/731149 (https://phabricator.wikimedia.org/T283855) [09:05:32] (03PS1) 10Ayounsi: Bird: disable multihop when peer is the default route [puppet] - 10https://gerrit.wikimedia.org/r/764720 [09:07:09] (03PS2) 10Ayounsi: Bird: disable multihop when peer is the default route [puppet] - 10https://gerrit.wikimedia.org/r/764720 [09:07:30] (03PS6) 10Hashar: Provide current $PATH to the verify script [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [09:14:03] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33920/" [puppet] - 10https://gerrit.wikimedia.org/r/764720 (owner: 10Ayounsi) [09:14:04] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:15:52] (03PS1) 10JMeybohm: Add credentiald for cfssl-issuter to deployment_server_secrets [labs/private] - 10https://gerrit.wikimedia.org/r/764722 (https://phabricator.wikimedia.org/T290966) [09:16:16] (03CR) 10Jbond: R:varnish:instance: Add general public cloud rate limiting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [09:17:29] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add credentiald for cfssl-issuter to deployment_server_secrets [labs/private] - 10https://gerrit.wikimedia.org/r/764722 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:17:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P21266 and previous config saved to /var/cache/conftool/dbconfig/20220222-091730-marostegui.json [09:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:45] (03PS1) 10JMeybohm: Enable ingress and cert-manager in wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/764723 (https://phabricator.wikimedia.org/T290966) [09:25:16] (03PS1) 10Ayounsi: Icinga: add drmrs routers mgmt interface [puppet] - 10https://gerrit.wikimedia.org/r/764725 [09:25:58] RECOVERY - traffic_server tls process restarted on cp6014 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6014&var-layer=tls [09:27:51] (03PS1) 10Filippo Giunchedi: sre: adjust ProbeDown alert [alerts] - 10https://gerrit.wikimedia.org/r/764726 (https://phabricator.wikimedia.org/T291946) [09:28:14] RECOVERY - traffic_server backend process restarted on cp6014 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6014&var-layer=backend [09:29:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/764718 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:29:40] RECOVERY - traffic_server tls process restarted on cp6016 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6016&var-layer=tls [09:29:53] (03CR) 10Majavah: [C: 04-1] Bird: disable multihop when peer is the default route (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764720 (owner: 10Ayounsi) [09:29:56] (03CR) 10jerkins-bot: [V: 04-1] sre: adjust ProbeDown alert [alerts] - 10https://gerrit.wikimedia.org/r/764726 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:30:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I don't think this is a good idea at the moment for a series of reasons:" [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [09:31:16] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt.eqiad.wmnet is OK: BGP OK - up: 2, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:32:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P21267 and previous config saved to /var/cache/conftool/dbconfig/20220222-093235-marostegui.json [09:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:56] RECOVERY - Check systemd state on cp6016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:04] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/33921/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/764725 (owner: 10Ayounsi) [09:35:18] RECOVERY - traffic_server tls process restarted on cp6010 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6010&var-layer=tls [09:36:08] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6010 is OK: HTTP OK: HTTP/1.0 200 OK - 25389 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:36:24] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp6010 is OK: HTTP OK: HTTP/1.1 200 Ok - 33730 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:36:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1099.eqiad.wmnet with OS bullseye [09:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:50] RECOVERY - Ensure traffic_server is running for instance backend on cp6010 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:37:48] (03PS2) 10Filippo Giunchedi: sre: adjust ProbeDown alert [alerts] - 10https://gerrit.wikimedia.org/r/764726 (https://phabricator.wikimedia.org/T291946) [09:38:01] !log jayme@cumin1001 START - Cookbook sre.dns.netbox [09:38:02] !log Deploying analytics/refinery on hadoop-test only. [09:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:57] (03PS3) 10Ayounsi: Bird: disable multihop when peer is the default route [puppet] - 10https://gerrit.wikimedia.org/r/764720 [09:41:12] (03PS3) 10JMeybohm: Add a dedicated profile for k8s_wikikube [puppet] - 10https://gerrit.wikimedia.org/r/764718 (https://phabricator.wikimedia.org/T290966) [09:42:21] (03CR) 10JMeybohm: Add a dedicated profile for k8s_wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764718 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:42:35] (03CR) 10JMeybohm: [C: 03+2] Add a dedicated profile for k8s_wikikube [puppet] - 10https://gerrit.wikimedia.org/r/764718 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:43:38] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:05] (03CR) 10Ayounsi: Bird: disable multihop when peer is the default route (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764720 (owner: 10Ayounsi) [09:44:29] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: adjust ProbeDown alert [alerts] - 10https://gerrit.wikimedia.org/r/764726 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:45:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1099.eqiad.wmnet with reason: host reimage [09:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:24] (03PS1) 10JMeybohm: Add k8s-inress-wikikube LVS VIPs [dns] - 10https://gerrit.wikimedia.org/r/764728 (https://phabricator.wikimedia.org/T290966) [09:47:28] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Joe) Any update on this? This upgrade is blocking serviceops who needs bullseye for the kubernetes python libraries and cookbooks. [09:47:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300381)', diff saved to https://phabricator.wikimedia.org/P21268 and previous config saved to /var/cache/conftool/dbconfig/20220222-094740-marostegui.json [09:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:46] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [09:47:54] !log aqu@deploy1002 Started deploy [analytics/refinery@ed5c9f9] (hadoop-test): Migrate aqs/hourly to Airflow TEST [analytics/refinery@ed5c9f9] [09:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:58] !log aqu@deploy1002 Finished deploy [analytics/refinery@ed5c9f9] (hadoop-test): Migrate aqs/hourly to Airflow TEST [analytics/refinery@ed5c9f9] (duration: 00m 03s) [09:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1099.eqiad.wmnet with reason: host reimage [09:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:24] !log restarting cr2-drmrs for software upgrade [09:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:58] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 82 probes of 660 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:00:42] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [10:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:49] (03CR) 10Volans: "LGTM +1, just couple of optional nits in user visible messages." [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [10:02:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1099.eqiad.wmnet with OS bullseye [10:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:36] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 57 probes of 660 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:05:34] (03CR) 10Elukey: [C: 03+2] Add overlayfs settings for ml-serve1001 [puppet] - 10https://gerrit.wikimedia.org/r/764706 (owner: 10Elukey) [10:06:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [10:07:20] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bullseye [10:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:22] (03CR) 10Jbond: R:varnish:instance: Add general public cloud rate limiting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [10:10:03] thanos-rule is me [10:10:40] 10SRE, 10LDAP-Access-Requests: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10MatthewVernon) @KFrancis can you confirm that @Ammarpad has signed an NDA, please? [10:10:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [10:11:20] (03PS1) 10JMeybohm: Add LVS servie k8s-ingress-wikikube [puppet] - 10https://gerrit.wikimedia.org/r/764733 (https://phabricator.wikimedia.org/T290966) [10:11:22] (03PS1) 10JMeybohm: Move k8s-ingress-wikikube to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/764734 (https://phabricator.wikimedia.org/T290966) [10:11:24] (03PS1) 10JMeybohm: Move k8s-ingress-wikikube to state: monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/764735 (https://phabricator.wikimedia.org/T290966) [10:11:26] (03PS1) 10JMeybohm: Move k8s-ingress-wikikube to state: production [puppet] - 10https://gerrit.wikimedia.org/r/764736 (https://phabricator.wikimedia.org/T290966) [10:11:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [10:11:42] (03PS1) 10Btullis: Absent the eventlogging_to_druid_job job temporarily [puppet] - 10https://gerrit.wikimedia.org/r/764737 (https://phabricator.wikimedia.org/T302263) [10:12:17] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet [10:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:26] (03CR) 10Btullis: "Temporarily absenting the job so that we can re-enable puppet on the host." [puppet] - 10https://gerrit.wikimedia.org/r/764737 (https://phabricator.wikimedia.org/T302263) (owner: 10Btullis) [10:13:40] (03PS1) 10JMeybohm: Add k8s-ingress-wikikube discovery record [dns] - 10https://gerrit.wikimedia.org/r/764738 (https://phabricator.wikimedia.org/T290966) [10:13:43] (03CR) 10Elukey: [C: 03+1] Absent the eventlogging_to_druid_job job temporarily [puppet] - 10https://gerrit.wikimedia.org/r/764737 (https://phabricator.wikimedia.org/T302263) (owner: 10Btullis) [10:13:48] (03CR) 10jerkins-bot: [V: 04-1] Absent the eventlogging_to_druid_job job temporarily [puppet] - 10https://gerrit.wikimedia.org/r/764737 (https://phabricator.wikimedia.org/T302263) (owner: 10Btullis) [10:14:26] (KubernetesCalicoDown) firing: ml-serve1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:14:40] (03CR) 10jerkins-bot: [V: 04-1] Add k8s-ingress-wikikube discovery record [dns] - 10https://gerrit.wikimedia.org/r/764738 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:15:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [10:16:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T302185)', diff saved to https://phabricator.wikimedia.org/P21269 and previous config saved to /var/cache/conftool/dbconfig/20220222-101604-ladsgroup.json [10:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:10] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [10:16:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [10:16:43] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [10:16:45] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [10:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:49] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T300774)', diff saved to https://phabricator.wikimedia.org/P21270 and previous config saved to /var/cache/conftool/dbconfig/20220222-101649-kormat.json [10:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:55] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [10:16:56] (03PS2) 10Btullis: Absent the eventlogging_to_druid_job job temporarily [puppet] - 10https://gerrit.wikimedia.org/r/764737 (https://phabricator.wikimedia.org/T302263) [10:17:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [10:17:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [10:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T300381)', diff saved to https://phabricator.wikimedia.org/P21271 and previous config saved to /var/cache/conftool/dbconfig/20220222-101710-marostegui.json [10:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:17] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [10:20:23] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage [10:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:00] (03CR) 10Btullis: [C: 03+2] Absent the eventlogging_to_druid_job job temporarily [puppet] - 10https://gerrit.wikimedia.org/r/764737 (https://phabricator.wikimedia.org/T302263) (owner: 10Btullis) [10:21:22] (03CR) 10Volans: [C: 03+1] "I agree the compiler is basically a noop on the content, lo LGTM, but let's make sure that puppet runs fine on all hosts and that there ar" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [10:21:40] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/764438 (owner: 10Jbond) [10:21:45] (03PS2) 10JMeybohm: Add k8s-ingress-wikikube discovery record [dns] - 10https://gerrit.wikimedia.org/r/764738 (https://phabricator.wikimedia.org/T290966) [10:24:18] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage [10:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:14] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:26:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T300381)', diff saved to https://phabricator.wikimedia.org/P21272 and previous config saved to /var/cache/conftool/dbconfig/20220222-102623-marostegui.json [10:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:29] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [10:26:32] note that ssh to dumpsdata regularly is just fine, it's only the mgmt interface [10:28:41] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 from inference.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) [10:29:05] (03PS1) 10JMeybohm: Add k8s-ingress-wikikube to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/764739 (https://phabricator.wikimedia.org/T290966) [10:31:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P21273 and previous config saved to /var/cache/conftool/dbconfig/20220222-103109-ladsgroup.json [10:31:12] (03PS2) 10JMeybohm: Add k8s-ingress-wikikube LVS VIPs [dns] - 10https://gerrit.wikimedia.org/r/764728 (https://phabricator.wikimedia.org/T290966) [10:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:14] (03PS3) 10JMeybohm: Add k8s-ingress-wikikube discovery record [dns] - 10https://gerrit.wikimedia.org/r/764738 (https://phabricator.wikimedia.org/T290966) [10:35:06] RECOVERY - Juniper alarms on lsw1-e3-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [10:35:36] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [10:36:27] (03PS1) 10Volans: setup.py: upper limit for black [software/spicerack] - 10https://gerrit.wikimedia.org/r/764740 [10:36:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1001.eqiad.wmnet with OS bullseye [10:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:23] (03CR) 10Volans: "It should fix local CI on bullseye, shamefully stolen from https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/761297/8/setup" [software/spicerack] - 10https://gerrit.wikimedia.org/r/764740 (owner: 10Volans) [10:37:39] (03PS1) 10Elukey: kserve-inference: simplify storage config for revscoring models [deployment-charts] - 10https://gerrit.wikimedia.org/r/764741 [10:39:26] (KubernetesCalicoDown) resolved: ml-serve1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:40:26] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:41:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P21274 and previous config saved to /var/cache/conftool/dbconfig/20220222-104128-marostegui.json [10:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:45] 10SRE, 10DNS, 10Traffic: Need Assistance adding DNS records to claim domain - https://phabricator.wikimedia.org/T300076 (10Sebastian_Berlin-WMSE) [10:43:03] (03CR) 10Elukey: [C: 03+2] Add overlayfs settings for ml-serve1002 [puppet] - 10https://gerrit.wikimedia.org/r/764707 (owner: 10Elukey) [10:43:35] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1002.eqiad.wmnet with OS bullseye [10:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:07] (03CR) 10Kevin Bazira: [C: 03+1] "this is a very nice idea. it keeps things DRY." [deployment-charts] - 10https://gerrit.wikimedia.org/r/764741 (owner: 10Elukey) [10:45:35] (03PS1) 10Kormat: Remove obsolete otrs.yaml hiera. [labs/private] - 10https://gerrit.wikimedia.org/r/764743 (https://phabricator.wikimedia.org/T293942) [10:46:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P21275 and previous config saved to /var/cache/conftool/dbconfig/20220222-104613-ladsgroup.json [10:46:16] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:52] (03CR) 10Kormat: "Hey. I saw that the file was removed in the real private repo already." [labs/private] - 10https://gerrit.wikimedia.org/r/764743 (https://phabricator.wikimedia.org/T293942) (owner: 10Kormat) [10:47:44] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) [10:48:17] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1093.eqiad.wmnet with OS bullseye [10:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye [10:49:08] (03PS2) 10Elukey: kserve-inference: simplify storage config for revscoring models [deployment-charts] - 10https://gerrit.wikimedia.org/r/764741 [10:50:26] (KubernetesCalicoDown) firing: ml-serve1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:50:50] (03PS1) 10Kormat: mariadb: Reference the actual OTRS passwords in the m2 grants file. [puppet] - 10https://gerrit.wikimedia.org/r/764744 [10:52:27] jouncebot: nowandnext [10:52:27] No deployments scheduled for the next 3 hour(s) and 7 minute(s) [10:52:27] In 3 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220222T1400) [10:52:27] In 3 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220222T1400) [10:52:48] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Reference the actual OTRS passwords in the m2 grants file. [puppet] - 10https://gerrit.wikimedia.org/r/764744 (owner: 10Kormat) [10:54:22] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) p:05Triage→03Medium [10:54:53] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/764740 (owner: 10Volans) [10:56:02] !log Deployed patch for T302192 [10:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:31] (03CR) 10Volans: [C: 03+2] setup.py: upper limit for black [software/spicerack] - 10https://gerrit.wikimedia.org/r/764740 (owner: 10Volans) [10:56:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P21276 and previous config saved to /var/cache/conftool/dbconfig/20220222-105632-marostegui.json [10:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:41] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: host reimage [10:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:53] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300774)', diff saved to https://phabricator.wikimedia.org/P21277 and previous config saved to /var/cache/conftool/dbconfig/20220222-105653-kormat.json [10:56:53] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [10:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:58] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [10:59:59] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1093.eqiad.wmnet with reason: host reimage [11:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: host reimage [11:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:41] (KubernetesCalicoDown) resolved: ml-serve1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:00:56] (03CR) 10Elukey: [C: 03+2] kserve-inference: simplify storage config for revscoring models [deployment-charts] - 10https://gerrit.wikimedia.org/r/764741 (owner: 10Elukey) [11:01:14] (03CR) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto) [11:01:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T302185)', diff saved to https://phabricator.wikimedia.org/P21278 and previous config saved to /var/cache/conftool/dbconfig/20220222-110118-ladsgroup.json [11:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:24] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [11:02:13] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10MatthewVernon) [11:02:15] (03PS2) 10Kormat: mariadb: Reference the actual OTRS passwords in the m2 grants file. [puppet] - 10https://gerrit.wikimedia.org/r/764744 [11:02:34] (03PS1) 10Jbond: P:netbox: use token not tokens [puppet] - 10https://gerrit.wikimedia.org/r/764745 [11:02:46] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:netbox: use token not tokens [puppet] - 10https://gerrit.wikimedia.org/r/764745 (owner: 10Jbond) [11:03:24] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1093.eqiad.wmnet with reason: host reimage [11:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:26] (KubernetesCalicoDown) firing: ml-serve1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:06:21] (03PS2) 10Elukey: Add overlay settings for ml-serve1003 [puppet] - 10https://gerrit.wikimedia.org/r/764708 [11:06:22] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1002.eqiad.wmnet with OS bullseye [11:06:23] (03PS2) 10Elukey: Add overlayfs settings for ml-serve1004 [puppet] - 10https://gerrit.wikimedia.org/r/764709 [11:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:20] 10SRE, 10Observability-Metrics: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266 (10Vgutierrez) [11:07:28] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33924/console" [puppet] - 10https://gerrit.wikimedia.org/r/764744 (owner: 10Kormat) [11:08:01] 10SRE, 10Observability-Metrics, 10Traffic: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266 (10Vgutierrez) [11:08:10] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1093.eqiad.wmnet with OS bullseye [11:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye executed... [11:08:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10MatthewVernon) [11:08:39] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice, thank you for taking a look" [software/spicerack] - 10https://gerrit.wikimedia.org/r/764740 (owner: 10Volans) [11:09:03] (03PS3) 10Elukey: Add overlayfs settings for ml-serve1004 [puppet] - 10https://gerrit.wikimedia.org/r/764709 [11:09:46] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33926/console" [puppet] - 10https://gerrit.wikimedia.org/r/764709 (owner: 10Elukey) [11:09:58] 10SRE, 10Observability-Metrics, 10Traffic: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266 (10Vgutierrez) p:05Triage→03Medium [11:10:35] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1093.eqiad.wmnet with OS bullseye [11:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye [11:11:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T300381)', diff saved to https://phabricator.wikimedia.org/P21279 and previous config saved to /var/cache/conftool/dbconfig/20220222-111137-marostegui.json [11:11:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [11:11:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [11:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:43] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [11:11:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T300381)', diff saved to https://phabricator.wikimedia.org/P21280 and previous config saved to /var/cache/conftool/dbconfig/20220222-111144-marostegui.json [11:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:57] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33927/console" [puppet] - 10https://gerrit.wikimedia.org/r/764709 (owner: 10Elukey) [11:11:58] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P21281 and previous config saved to /var/cache/conftool/dbconfig/20220222-111157-kormat.json [11:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:11] (KubernetesCalicoDown) resolved: ml-serve1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:12:26] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [11:12:41] (03Merged) 10jenkins-bot: setup.py: upper limit for black [software/spicerack] - 10https://gerrit.wikimedia.org/r/764740 (owner: 10Volans) [11:12:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T302185)', diff saved to https://phabricator.wikimedia.org/P21282 and previous config saved to /var/cache/conftool/dbconfig/20220222-111254-ladsgroup.json [11:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:00] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [11:14:22] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:15:26] (KubernetesCalicoDown) firing: ml-serve1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:15:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) [11:16:59] (03PS2) 10Lucas Werkmeister (WMDE): beta: Allow opening the alpha NewLexeme special page on beta-wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763703 (https://phabricator.wikimedia.org/T301234) (owner: 10Michael Große) [11:17:06] ^ I’ll deploy this beta-only patch now [11:17:18] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] beta: Allow opening the alpha NewLexeme special page on beta-wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763703 (https://phabricator.wikimedia.org/T301234) (owner: 10Michael Große) [11:18:06] (03Merged) 10jenkins-bot: beta: Allow opening the alpha NewLexeme special page on beta-wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763703 (https://phabricator.wikimedia.org/T301234) (owner: 10Michael Große) [11:20:42] !log deploy netbox puppet refactor (should be noop) [11:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:46] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:763703|beta: Allow opening the alpha NewLexeme special page on beta-wikidatawiki (T301234)]] (Beta only) (duration: 00m 48s) [11:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:52] T301234: Create basic form with text inputs - https://phabricator.wikimedia.org/T301234 [11:20:58] !log deploy netbox puppet refactor gerrit:764330 (should be noop) [11:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:15] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1093.eqiad.wmnet with reason: host reimage [11:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:23:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:51] (03CR) 10Elukey: [C: 03+2] Add overlay settings for ml-serve1003 [puppet] - 10https://gerrit.wikimedia.org/r/764708 (owner: 10Elukey) [11:24:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:29] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1003.eqiad.wmnet with OS bullseye [11:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:41] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1093.eqiad.wmnet with reason: host reimage [11:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:08] (03PS1) 10Jon Harald Søby: [beta] Set up beta incubatorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764746 (https://phabricator.wikimedia.org/T210492) [11:27:02] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P21283 and previous config saved to /var/cache/conftool/dbconfig/20220222-112702-kormat.json [11:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:56] (03CR) 10Urbanecm: [C: 03+1] [beta] Set up beta incubatorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764746 (https://phabricator.wikimedia.org/T210492) (owner: 10Jon Harald Søby) [11:28:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P21284 and previous config saved to /var/cache/conftool/dbconfig/20220222-112759-ladsgroup.json [11:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:07] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:30:26] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1093.eqiad.wmnet with OS bullseye [11:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye executed... [11:31:53] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:32:11] (KubernetesCalicoDown) firing: (2) ml-serve1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:34:54] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] InitialiseSettings: General cleanup, wgRemoveGroups (A-D) (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764465 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [11:35:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "(LGTM apart from the one open comment)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764465 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [11:36:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T300381)', diff saved to https://phabricator.wikimedia.org/P21285 and previous config saved to /var/cache/conftool/dbconfig/20220222-113609-marostegui.json [11:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:16] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [11:37:28] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: host reimage [11:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:35] (03CR) 10Volans: "forgot one comment" [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [11:40:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: host reimage [11:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:07] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300774)', diff saved to https://phabricator.wikimedia.org/P21286 and previous config saved to /var/cache/conftool/dbconfig/20220222-114206-kormat.json [11:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:13] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [11:43:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P21287 and previous config saved to /var/cache/conftool/dbconfig/20220222-114304-ladsgroup.json [11:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:25] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1003.eqiad.wmnet with OS bullseye [11:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:52] (03PS1) 10JMeybohm: miscweb: Enable ingress for all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/764749 (https://phabricator.wikimedia.org/T290966) [11:48:11] RECOVERY - Check systemd state on doh6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P21288 and previous config saved to /var/cache/conftool/dbconfig/20220222-115114-marostegui.json [11:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:01] PROBLEM - Check systemd state on doh6001 is CRITICAL: CRITICAL - degraded: The following units failed: bird6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:22] (03PS8) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns [puppet] - 10https://gerrit.wikimedia.org/r/763486 [11:54:24] (03PS6) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 [11:57:26] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [11:58:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T302185)', diff saved to https://phabricator.wikimedia.org/P21289 and previous config saved to /var/cache/conftool/dbconfig/20220222-115808-ladsgroup.json [11:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:15] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [11:58:25] (03PS1) 10Jbond: P:spicerack: Add back missing data [puppet] - 10https://gerrit.wikimedia.org/r/764751 [11:59:13] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:spicerack: Add back missing data [puppet] - 10https://gerrit.wikimedia.org/r/764751 (owner: 10Jbond) [12:03:04] topranks, elukey: the above patch ^^^ has fixed puppet on the cumin hosts, so if your reimage failed for that reason (unable to run puppet on the cumin host itself) you should be able to retry now. If the debian-installer has completed you can pass the --no-pxe to resume after that and it will be much quicker. [12:04:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) @fgiunchedi Hey, also struggling somewhat with this. That IP is currently... [12:04:43] (03CR) 10Alexandros Kosiaris: [C: 04-2] "This is more than tripling the timeout. It's also higher than the current global service_check_timeout, which is set to 90 currently, with" [puppet] - 10https://gerrit.wikimedia.org/r/764464 (owner: 10Andrew Bogott) [12:05:26] (KubernetesCalicoDown) firing: (2) ml-serve1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [12:06:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P21290 and previous config saved to /var/cache/conftool/dbconfig/20220222-120619-marostegui.json [12:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:43] (03PS1) 10Jbond: common.yaml: make netbox_api_url a global [puppet] - 10https://gerrit.wikimedia.org/r/764752 [12:07:20] (03PS2) 10Jbond: common.yaml: make netbox_api_url a global [puppet] - 10https://gerrit.wikimedia.org/r/764752 [12:07:26] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [12:07:35] (03CR) 10Jbond: [V: 03+2 C: 03+2] common.yaml: make netbox_api_url a global [puppet] - 10https://gerrit.wikimedia.org/r/764752 (owner: 10Jbond) [12:08:24] volans: thanks, I hadn't worked out what was causing it to fail but ok great I'm trying it again now. [12:12:54] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) Capture also containing requests from Prometheus1005 (10.64.0.82) which do... [12:21:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T300381)', diff saved to https://phabricator.wikimedia.org/P21291 and previous config saved to /var/cache/conftool/dbconfig/20220222-122124-marostegui.json [12:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:31] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [12:22:11] (KubernetesCalicoDown) resolved: (2) ml-serve1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [12:22:26] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [12:22:48] goood [12:23:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [12:23:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [12:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T300381)', diff saved to https://phabricator.wikimedia.org/P21292 and previous config saved to /var/cache/conftool/dbconfig/20220222-122351-marostegui.json [12:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:02] (03CR) 10Jbond: varnish/frontend: consume etcd data for dynamic banning of requests. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto) [12:29:17] (03CR) 10Elukey: [V: 03+1 C: 03+2] Add overlayfs settings for ml-serve1004 [puppet] - 10https://gerrit.wikimedia.org/r/764709 (owner: 10Elukey) [12:31:47] (03CR) 10Marostegui: mariadb: Reference the actual OTRS passwords in the m2 grants file. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764744 (owner: 10Kormat) [12:32:47] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1004.eqiad.wmnet with OS bullseye [12:32:50] (03CR) 10Volans: "One small nit and ready to go." [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T300381)', diff saved to https://phabricator.wikimedia.org/P21293 and previous config saved to /var/cache/conftool/dbconfig/20220222-123332-marostegui.json [12:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:38] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [12:35:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [12:36:53] (03CR) 10Elukey: [C: 03+1] Enable ingress and cert-manager in wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/764723 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:38:58] (03PS1) 10Jbond: O:reposync: document the need for KEYHOLDER_SOCK [puppet] - 10https://gerrit.wikimedia.org/r/764755 [12:39:26] (KubernetesCalicoDown) firing: ml-serve1004.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [12:39:34] (03CR) 10Elukey: [C: 03+1] Add k8s-ingress-wikikube LVS VIPs [dns] - 10https://gerrit.wikimedia.org/r/764728 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:40:07] (03CR) 10Elukey: [C: 03+1] Add k8s-ingress-wikikube discovery record [dns] - 10https://gerrit.wikimedia.org/r/764738 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:41:14] (03CR) 10Elukey: [C: 03+1] Add LVS servie k8s-ingress-wikikube [puppet] - 10https://gerrit.wikimedia.org/r/764733 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:41:27] (03CR) 10Elukey: [C: 03+1] Move k8s-ingress-wikikube to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/764734 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:41:39] (03CR) 10Elukey: [C: 03+1] Move k8s-ingress-wikikube to state: monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/764735 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:41:49] (03CR) 10Elukey: [C: 03+1] Move k8s-ingress-wikikube to state: production [puppet] - 10https://gerrit.wikimedia.org/r/764736 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:42:04] (03CR) 10Elukey: [C: 03+1] Add k8s-ingress-wikikube to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/764739 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:44:44] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:44:45] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:50] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T300774)', diff saved to https://phabricator.wikimedia.org/P21294 and previous config saved to /var/cache/conftool/dbconfig/20220222-124449-kormat.json [12:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:57] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [12:45:51] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: host reimage [12:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:09] (03CR) 10Kormat: [C: 03+1] auto_schema: Split dry run logs [software] - 10https://gerrit.wikimedia.org/r/764620 (owner: 10Ladsgroup) [12:46:55] (03CR) 10Kormat: [V: 03+1] mariadb: Reference the actual OTRS passwords in the m2 grants file. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764744 (owner: 10Kormat) [12:47:17] (03CR) 10Elukey: miscweb: Enable ingress for all clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764749 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:47:28] !log bounce prometheus-blackbox-exporter on prometheus1006 - T302265 [12:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:36] T302265: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 [12:48:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: host reimage [12:48:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P21295 and previous config saved to /var/cache/conftool/dbconfig/20220222-124837-marostegui.json [12:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:39] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:50:44] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on doh[6001-6002].wikimedia.org with reason: T301165; errors expected, not serving any traffic [12:50:46] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on doh[6001-6002].wikimedia.org with reason: T301165; errors expected, not serving any traffic [12:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:50] T301165: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 [12:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:22] (03CR) 10Jbond: reposync: add new class to manage syncing repositories (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:56:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) Capture of requests directly on primary interface, to get full Ethernet he... [12:59:30] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1093.eqiad.wmnet with OS bullseye [12:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye [13:00:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1004.eqiad.wmnet with OS bullseye [13:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P21296 and previous config saved to /var/cache/conftool/dbconfig/20220222-130342-marostegui.json [13:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:26] (KubernetesCalicoDown) resolved: ml-serve1004.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [13:05:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [13:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) [13:11:12] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1093.eqiad.wmnet with reason: host reimage [13:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:52] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1093.eqiad.wmnet with reason: host reimage [13:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:25] (03PS1) 10Elukey: ml-services: fix revscoring-editquality's model versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/764758 [13:18:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T300381)', diff saved to https://phabricator.wikimedia.org/P21297 and previous config saved to /var/cache/conftool/dbconfig/20220222-131846-marostegui.json [13:18:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:18:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:53] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [13:18:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T300381)', diff saved to https://phabricator.wikimedia.org/P21298 and previous config saved to /var/cache/conftool/dbconfig/20220222-131854-marostegui.json [13:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:15] (03CR) 10Elukey: [C: 03+2] ml-services: fix revscoring-editquality's model versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/764758 (owner: 10Elukey) [13:21:35] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10MoritzMuehlenhoff) >>! In T276589#7727280, @Joe wrote: > Any update on this? This upgrade is blocking serviceops who needs bullseye for the kubernetes python libraries and cookbooks. You can... [13:23:16] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [13:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:42] jouncebot: nowandnext [13:23:42] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [13:23:42] In 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220222T1400) [13:23:42] In 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220222T1400) [13:24:10] !log rebalance ganeti eqiad row_D (all nodes reimaged in there) T296721 [13:24:11] * urbanecm goes to create a beta cluster wiki [13:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:16] T296721: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 [13:24:23] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1093.eqiad.wmnet with OS bullseye [13:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye complete... [13:24:43] (03CR) 10Urbanecm: [C: 03+2] "let's set this up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764746 (https://phabricator.wikimedia.org/T210492) (owner: 10Jon Harald Søby) [13:25:54] (03Merged) 10jenkins-bot: [beta] Set up beta incubatorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764746 (https://phabricator.wikimedia.org/T210492) (owner: 10Jon Harald Søby) [13:26:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21299 and previous config saved to /var/cache/conftool/dbconfig/20220222-132637-root.json [13:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T300381)', diff saved to https://phabricator.wikimedia.org/P21300 and previous config saved to /var/cache/conftool/dbconfig/20220222-132824-marostegui.json [13:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:31] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [13:30:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:43] (03PS1) 10Urbanecm: [beta] Add wgServer for incubatorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764780 (https://phabricator.wikimedia.org/T210492) [13:31:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:31:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:02] (03CR) 10Urbanecm: [C: 03+2] [beta] Add wgServer for incubatorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764780 (https://phabricator.wikimedia.org/T210492) (owner: 10Urbanecm) [13:32:04] (03CR) 10Kormat: [C: 03+1] "On the production side, this will create a small bit of noise in the form of metrics that aren't really usable for us, but i think we can " [puppet] - 10https://gerrit.wikimedia.org/r/763490 (owner: 10Majavah) [13:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:31] !log bounce prometheus-blackbox-exporter on prometheus1005 - T302265 [13:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:36] T302265: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 [13:32:41] (03Merged) 10jenkins-bot: [beta] Add wgServer for incubatorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764780 (https://phabricator.wikimedia.org/T210492) (owner: 10Urbanecm) [13:33:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:48] (03PS1) 10Urbanecm: [beta] Add incubator.wikimedia.beta.wmflabs.org to beta sites [puppet] - 10https://gerrit.wikimedia.org/r/764781 (https://phabricator.wikimedia.org/T210492) [13:34:01] (03CR) 10Kormat: [C: 03+2] prometheus: add heartbeat collection on mysqld_exporter [puppet] - 10https://gerrit.wikimedia.org/r/763490 (owner: 10Majavah) [13:36:04] (03CR) 10David Caro: "Added a question, I'm not very familiar with ferm, so might make no sense, feel free to ignore." [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [13:38:14] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/764755 (owner: 10Jbond) [13:38:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:26] so in theory, incubator.wikimedia.beta.wmflabs.org is running, but of course i forgot it's a special wiki and needs a puppet patch :D [13:39:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:39:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:43] (03CR) 10David Caro: P:openstack::cumin::target: redefine Ferm $CUMIN_MASTERS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [13:40:11] urbanecm: if you want to test if it works, you can cherry-pick the commit to deployment-puppetmaster04 [13:40:36] taavi: how would i do that? [13:40:51] can you log in to deployment-puppetmaster04.deployment-prep.eqiad1.wikimedia.cloud? [13:40:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:40:52] (I only ever made my own commits there, dunno how to fetch from gerrit and cherry-pick) [13:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:55] yes, I'm there [13:41:08] sudo as root and cd to /var/lib/git/operations/puppet [13:41:09] in /var/lib/git/operations/puppet as root [13:41:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21301 and previous config saved to /var/cache/conftool/dbconfig/20220222-134141-root.json [13:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:02] taavi: and now i copy cherry-pick from the download screen at https://gerrit.wikimedia.org/r/c/operations/puppet/+/764781/? or sth else? [13:42:05] then, on the gerrit interface, click "download" (right side of the bar where you can choose the patch set version), and copy paste the commands under cherry-pick [13:42:22] great [13:42:24] thanks [13:42:24] there is https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated#Cherry-picking_a_patch_from_gerrit [13:42:41] heh, we've docs! [13:43:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P21302 and previous config saved to /var/cache/conftool/dbconfig/20220222-134329-marostegui.json [13:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:53] (03PS36) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [13:43:55] (03PS1) 10Jbond: spicerack: switch to push model [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 [13:44:40] (03CR) 10Majavah: [V: 03+1] P:openstack::cumin::target: redefine Ferm $CUMIN_MASTERS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [13:44:43] and `urbanecm@deployment-mediawiki12:~$ curl -i --connect-to "::$HOSTNAME" 'https://incubator.wikimedia.beta.wmflabs.org'` lets me to talk to the wiki [13:45:09] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300774)', diff saved to https://phabricator.wikimedia.org/P21303 and previous config saved to /var/cache/conftool/dbconfig/20220222-134509-kormat.json [13:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:15] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:45:18] after HTCP purge, it's up, but with a DB error :/ [13:45:21] * taavi wishes for tab completion when sshing wmcs hosts [13:45:58] i almost always copy it from openstack browser [13:46:15] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:46:41] in theory that should be fairly simple to do in deployment-prep and tools, since those projects have puppetdb meaning we can copy-paste the code config-master.wm.o uses [13:48:45] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:48:58] seems we have beta incubator now [13:49:24] (03PS1) 10Kormat: mariadb: Switch s7 primary db1173 -> db1131 [puppet] - 10https://gerrit.wikimedia.org/r/764784 (https://phabricator.wikimedia.org/T300471) [13:50:02] (03CR) 10jerkins-bot: [V: 04-1] reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:50:06] (03CR) 10jerkins-bot: [V: 04-1] spicerack: switch to push model [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 (owner: 10Jbond) [13:50:10] taavi: do you mind +2'ing https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/764785? [13:50:16] * taavi looks [13:50:38] (03CR) 10David Caro: P:openstack::cumin::target: redefine Ferm $CUMIN_MASTERS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [13:50:48] maybe tag T268576? [13:50:49] T268576: Convert Translate to AbstractSchema - https://phabricator.wikimedia.org/T268576 [13:50:56] (03PS1) 10Kormat: wmnet: Update s6-master CNAME. [dns] - 10https://gerrit.wikimedia.org/r/764786 (https://phabricator.wikimedia.org/T300471) [13:51:00] done [13:51:14] (03CR) 10Kormat: [C: 04-2] "-2 until switchover day." [puppet] - 10https://gerrit.wikimedia.org/r/764784 (https://phabricator.wikimedia.org/T300471) (owner: 10Kormat) [13:51:31] (03CR) 10Kormat: [C: 04-2] "-2 until switchover day." [dns] - 10https://gerrit.wikimedia.org/r/764786 (https://phabricator.wikimedia.org/T300471) (owner: 10Kormat) [13:52:05] (03CR) 10Marostegui: "Commit message says s7, but it is s6" [puppet] - 10https://gerrit.wikimedia.org/r/764784 (https://phabricator.wikimedia.org/T300471) (owner: 10Kormat) [13:52:14] I wonder if it's intentional that the list does not include all the .sql files in translate [13:52:20] (03CR) 10Marostegui: [C: 03+1] wmnet: Update s6-master CNAME. [dns] - 10https://gerrit.wikimedia.org/r/764786 (https://phabricator.wikimedia.org/T300471) (owner: 10Kormat) [13:52:36] taavi: very likely. AFAIK the extension doesn't use all tables in WM prod [13:52:45] ack [13:52:46] createExtensionTables.php definitely used to work for translate before [13:52:57] +2'd [13:53:00] thanks! [13:53:04] (03PS2) 10Kormat: mariadb: Switch s6 primary db1173 -> db1131 [puppet] - 10https://gerrit.wikimedia.org/r/764784 (https://phabricator.wikimedia.org/T300471) [13:53:19] (03CR) 10Kormat: [C: 04-2] mariadb: Switch s6 primary db1173 -> db1131 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764784 (https://phabricator.wikimedia.org/T300471) (owner: 10Kormat) [13:53:54] (03CR) 10Marostegui: [C: 03+1] mariadb: Switch s6 primary db1173 -> db1131 [puppet] - 10https://gerrit.wikimedia.org/r/764784 (https://phabricator.wikimedia.org/T300471) (owner: 10Kormat) [13:54:39] (03PS2) 10Jbond: spicerack: switch to push model [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 [13:54:58] (03CR) 10Urbanecm: [V: 03+1] "commit is already cherry-picked to beta (and works)." [puppet] - 10https://gerrit.wikimedia.org/r/764781 (https://phabricator.wikimedia.org/T210492) (owner: 10Urbanecm) [13:56:13] (03CR) 10Majavah: [V: 03+1] P:openstack::cumin::target: redefine Ferm $CUMIN_MASTERS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [13:56:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21304 and previous config saved to /var/cache/conftool/dbconfig/20220222-135644-root.json [13:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P21305 and previous config saved to /var/cache/conftool/dbconfig/20220222-135833-marostegui.json [13:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220222T1400) [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220222T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:14] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21306 and previous config saved to /var/cache/conftool/dbconfig/20220222-140013-kormat.json [14:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:20] duplicate jouncebot messages? [14:00:20] indeed, nothing to do [14:00:22] o_O why the double message [14:00:27] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:00:28] I'll deploy one sec patch then [14:00:41] (03CR) 10jerkins-bot: [V: 04-1] spicerack: switch to push model [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 (owner: 10Jbond) [14:00:45] taavi: ping me once done, I'll do T112147 too [14:00:46] T112147: Rename the oversight group on WMF projects to the MediaWiki standard (whatever that is) - https://phabricator.wikimedia.org/T112147 [14:01:03] ok [14:01:45] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Clarify whether members of ldap/nda should be added to #WMF-NDA - https://phabricator.wikimedia.org/T299839 (10Zabe) >>! In T299839#7725662, @MatthewVernon wrote: > Does this still need #WMF-NDA-Requests tagging in it? It means it appears in the Clinic Duty... [14:02:46] (03CR) 10David Caro: [C: 03+1] P:openstack::cumin::target: redefine Ferm $CUMIN_MASTERS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [14:02:56] (03CR) 10DCausse: [C: 03+2] flink-session-cluster: increase task manager mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/763754 (owner: 10DCausse) [14:02:59] (03PS1) 10Elukey: admin_ng: raise resource quotas for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/764787 [14:06:32] (03Merged) 10jenkins-bot: flink-session-cluster: increase task manager mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/763754 (owner: 10DCausse) [14:07:23] (03CR) 10Elukey: [C: 03+2] admin_ng: raise resource quotas for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/764787 (owner: 10Elukey) [14:07:46] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:33] 10SRE, 10DC-Ops, 10serviceops: setup/install mc20[38-55] - https://phabricator.wikimedia.org/T302218 (10Papaul) @akosiaris hello any reason why this task is assigned to me ? [14:10:37] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:41] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:14] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) [14:11:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21307 and previous config saved to /var/cache/conftool/dbconfig/20220222-141148-root.json [14:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:53] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:12:13] syncing [14:13:21] 10SRE, 10LDAP-Access-Requests: Grant Access to releasers-mediawiki for MarkAHershberger and Mglaser - https://phabricator.wikimedia.org/T302160 (10MarkAHershberger) >>! In T302160#7727009, @Legoktm wrote: > releasers-mediawiki isn't an LDAP group, it's a shell group, you need to follow #sre-access-requests, an... [14:13:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T300381)', diff saved to https://phabricator.wikimedia.org/P21308 and previous config saved to /var/cache/conftool/dbconfig/20220222-141338-marostegui.json [14:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:44] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [14:14:01] !log deploy T302248 patch [14:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:06] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) [14:15:10] urbanecm: over to you [14:15:19] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21309 and previous config saved to /var/cache/conftool/dbconfig/20220222-141518-kormat.json [14:15:22] thanks [14:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:33] (03PS2) 10Urbanecm: Do not delete the suppress group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752162 (https://phabricator.wikimedia.org/T112147) [14:15:38] (03CR) 10Urbanecm: [C: 03+2] Do not delete the suppress group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752162 (https://phabricator.wikimedia.org/T112147) (owner: 10Urbanecm) [14:16:18] (03Merged) 10jenkins-bot: Do not delete the suppress group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752162 (https://phabricator.wikimedia.org/T112147) (owner: 10Urbanecm) [14:18:38] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 6859cd28a2dd214b108b589bc8ecfb24dac93f9c: Do not delete the suppress group (T112147) (duration: 00m 50s) [14:18:42] 10SRE, 10DC-Ops, 10serviceops: setup/install mc20[38-55] - https://phabricator.wikimedia.org/T302218 (10akosiaris) a:05Papaul→03None >>! In T302218#7728106, @Papaul wrote: > @akosiaris hello any reason why this task is assigned to me ? I created it as a subtask of T294962 and forgot to remove the assign... [14:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:44] T112147: Rename the oversight group on WMF projects to the MediaWiki standard (whatever that is) - https://phabricator.wikimedia.org/T112147 [14:18:55] (03PS1) 10Urbanecm: Add suppress group to privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764789 (https://phabricator.wikimedia.org/T112147) [14:19:08] (03CR) 10Urbanecm: [C: 03+2] Add suppress group to privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764789 (https://phabricator.wikimedia.org/T112147) (owner: 10Urbanecm) [14:19:32] 10SRE, 10DC-Ops, 10serviceops: setup/install mc20[38-55] - https://phabricator.wikimedia.org/T302218 (10akosiaris) [14:19:43] 10SRE, 10WMF-NDA-Requests: Clarify whether members of ldap/nda should be added to #WMF-NDA - https://phabricator.wikimedia.org/T299839 (10MatthewVernon) [14:19:51] (03Merged) 10jenkins-bot: Add suppress group to privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764789 (https://phabricator.wikimedia.org/T112147) (owner: 10Urbanecm) [14:20:16] Okay, mwscript migrateUserGroup.php --wiki=testwiki oversight suppress works just fine [14:20:43] (03PS1) 10Cathal Mooney: Adding more new LEAF switches from Eqiad rows E/F to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/764791 (https://phabricator.wikimedia.org/T299758) [14:21:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:30] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:54] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ec07ac00a2676b9c0f6481e752ae91814e3828db: Add suppress group to privileged groups (T112147) (duration: 00m 49s) [14:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:03] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:22:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:47] 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger, Mglaser+ - https://phabricator.wikimedia.org/T302287 (10MarkAHershberger) [14:23:07] (03PS1) 10Urbanecm: Update oversight group to suppress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764792 (https://phabricator.wikimedia.org/T112147) [14:23:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:07] !log mwscript migrateUserGroup.php --wiki=metawiki oversight suppress # T112147 [14:24:26] 10SRE, 10LDAP-Access-Requests: Grant Access to releasers-mediawiki for MarkAHershberger and Mglaser - https://phabricator.wikimedia.org/T302160 (10MarkAHershberger) 05Open→03Resolved a:03MarkAHershberger See https://phabricator.wikimedia.org/T302287 [14:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:41] T112147: Rename the oversight group on WMF projects to the MediaWiki standard (whatever that is) - https://phabricator.wikimedia.org/T112147 [14:25:57] 10SRE, 10LDAP-Access-Requests: Grant Access to releasers-mediawiki for MarkAHershberger and Mglaser - https://phabricator.wikimedia.org/T302160 (10MarkAHershberger) 05Resolved→03Invalid [14:28:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:23] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300774)', diff saved to https://phabricator.wikimedia.org/P21311 and previous config saved to /var/cache/conftool/dbconfig/20220222-143023-kormat.json [14:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:29] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [14:31:08] !log Run `[urbanecm@mwmaint1002 ~]$ foreachwikiindblist oversight-wikis migrateUserGroup.php oversight suppress` in a tmux session (oversight-wikis.dblist is a temporary dblist from P21310; T112147) [14:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:13] T112147: Rename the oversight group on WMF projects to the MediaWiki standard (whatever that is) - https://phabricator.wikimedia.org/T112147 [14:31:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add LVS servie k8s-ingress-wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764733 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [14:31:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:31:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:31:55] (03CR) 10Alexandros Kosiaris: [C: 03+1] Move k8s-ingress-wikikube to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/764734 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [14:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:58] (03CR) 10Urbanecm: [C: 03+2] Update oversight group to suppress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764792 (https://phabricator.wikimedia.org/T112147) (owner: 10Urbanecm) [14:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] Move k8s-ingress-wikikube to state: monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/764735 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [14:32:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] Move k8s-ingress-wikikube to state: production [puppet] - 10https://gerrit.wikimedia.org/r/764736 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [14:32:31] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:32:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add k8s-ingress-wikikube to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/764739 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [14:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:43] (03Merged) 10jenkins-bot: Update oversight group to suppress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764792 (https://phabricator.wikimedia.org/T112147) (owner: 10Urbanecm) [14:33:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:05] (03PS1) 10Kormat: Revert "prometheus: add heartbeat collection on mysqld_exporter" [puppet] - 10https://gerrit.wikimedia.org/r/764401 [14:35:14] (03PS1) 10Urbanecm: Revert "Update oversight group to suppress" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764402 (https://phabricator.wikimedia.org/T112147) [14:35:24] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Update oversight group to suppress" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764402 (https://phabricator.wikimedia.org/T112147) (owner: 10Urbanecm) [14:35:36] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [14:35:36] (03CR) 10Kormat: "Example error:" [puppet] - 10https://gerrit.wikimedia.org/r/764401 (owner: 10Kormat) [14:35:49] urbanecm: I think you broke something https://integration.wikimedia.org/ci/job/beta-scap-sync-world/40313/console [14:36:10] taavi: yeah, I'm aware. the revert fixes it [14:36:11] (03CR) 10Kormat: [C: 03+2] Revert "prometheus: add heartbeat collection on mysqld_exporter" [puppet] - 10https://gerrit.wikimedia.org/r/764401 (owner: 10Kormat) [14:36:16] i broke the extension function [14:36:32] but i'll remove it once the migration script finishes, so...i'll just try it again after [14:37:31] (03CR) 10Kormat: "Another failure mode:" [puppet] - 10https://gerrit.wikimedia.org/r/764401 (owner: 10Kormat) [14:38:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:39:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:59] RECOVERY - Check systemd state on durum6002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:59] RECOVERY - Check systemd state on doh6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:07] (03PS1) 10Ssingh: P:wikidough: add monitoring for IPv6 endpoints [puppet] - 10https://gerrit.wikimedia.org/r/764798 [14:53:42] (03PS1) 10Elukey: knative-serving: keep only the last two revisions by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/764799 [14:53:54] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on durum[6001-6002].drmrs.wmnet with reason: T301165; errors expected, not serving any traffic [14:53:56] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on durum[6001-6002].drmrs.wmnet with reason: T301165; errors expected, not serving any traffic [14:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:00] T301165: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 [14:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:31] okay, migration script all done now [15:07:34] jouncebot: nowandnext [15:07:34] No deployments scheduled for the next 1 hour(s) and 52 minute(s) [15:07:34] In 1 hour(s) and 52 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220222T1700) [15:07:45] !log Finishing deployment of T112147 that started during B&C time [15:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:51] T112147: Rename the oversight group on WMF projects to the MediaWiki standard (whatever that is) - https://phabricator.wikimedia.org/T112147 [15:09:21] (03PS2) 10Urbanecm: Remove the oversight group hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752163 (https://phabricator.wikimedia.org/T112147) [15:09:29] (03CR) 10Urbanecm: [C: 03+2] Remove the oversight group hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752163 (https://phabricator.wikimedia.org/T112147) (owner: 10Urbanecm) [15:10:16] (03PS1) 10Hnowlan: restbase: add deployment-restbase04 [puppet] - 10https://gerrit.wikimedia.org/r/764801 (https://phabricator.wikimedia.org/T295375) [15:11:24] (03Merged) 10jenkins-bot: Remove the oversight group hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752163 (https://phabricator.wikimedia.org/T112147) (owner: 10Urbanecm) [15:13:12] (03PS1) 10Urbanecm: Revert "Revert "Update oversight group to suppress"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764403 (https://phabricator.wikimedia.org/T112147) [15:13:13] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:13:44] (03PS2) 10Urbanecm: Revert "Revert "Update oversight group to suppress"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764403 (https://phabricator.wikimedia.org/T112147) [15:13:45] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 79cfa4e7c509868bdb0a23841b70614724745a3d: Remove the oversight group hack (T112147) (duration: 00m 48s) [15:13:47] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Update oversight group to suppress"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764403 (https://phabricator.wikimedia.org/T112147) (owner: 10Urbanecm) [15:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:53] T112147: Rename the oversight group on WMF projects to the MediaWiki standard (whatever that is) - https://phabricator.wikimedia.org/T112147 [15:15:35] (03Merged) 10jenkins-bot: Revert "Revert "Update oversight group to suppress"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764403 (https://phabricator.wikimedia.org/T112147) (owner: 10Urbanecm) [15:16:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:17:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:40] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4a2a2129a9d1015674868c8539b6cae0e92a4d2a: Update oversight group to suppress (T112147) (duration: 00m 49s) [15:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:31] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:38] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:41] (03CR) 10Jcrespo: "This has a virtual +1 (haven't looked at the details) from me, probably the only (and best) thing that can be done at the moment. Thanks f" [puppet] - 10https://gerrit.wikimedia.org/r/764744 (owner: 10Kormat) [15:23:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:30] !log Run `mwscript purgeExpiredUserrights.php enwikiquote` to purge an expired but not yet removed row with the old oversight group (T112147) [15:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:35] T112147: Rename the oversight group on WMF projects to the MediaWiki standard (whatever that is) - https://phabricator.wikimedia.org/T112147 [15:24:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:24:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:30] !log [urbanecm@mwmaint1002 ~]$ mwscript migrateUserGroup.php --wiki=labswiki oversight suppress # T112147 [15:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:58] !log Migration of oversight => suppress is done (T112147) [15:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:52] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [15:26:53] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [15:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:58] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T300774)', diff saved to https://phabricator.wikimedia.org/P21312 and previous config saved to /var/cache/conftool/dbconfig/20220222-152658-kormat.json [15:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:05] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:27:24] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10Tom_Magerlein) L3 form is signed. Thanks! [15:28:42] 10SRE, 10Sustainability (Incident Followup): 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10Krinkle) [15:28:50] 10SRE, 10Traffic-Icebox, 10Performance-Team (Radar), 10User-CDanis: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Krinkle) [15:34:29] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Split dry run logs [software] - 10https://gerrit.wikimedia.org/r/764620 (owner: 10Ladsgroup) [15:38:47] (03CR) 10Ayounsi: [C: 03+1] Adding more new LEAF switches from Eqiad rows E/F to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/764791 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [15:39:07] (03CR) 10JMeybohm: Add LVS servie k8s-ingress-wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764733 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:40:41] (03Merged) 10jenkins-bot: auto_schema: Split dry run logs [software] - 10https://gerrit.wikimedia.org/r/764620 (owner: 10Ladsgroup) [15:41:59] (03CR) 10JMeybohm: miscweb: Enable ingress for all clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764749 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:43:39] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:48] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger, Mglaser+ - https://phabricator.wikimedia.org/T302287 (10MatthewVernon) I think this is awaiting the resolution of T293323 also? [15:49:03] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Jclark-ctr) @LSobanski we already have 1 per row in A,B,C,D. New cage is online shortly E, F. we will have to rack 2xE and 2xF [15:53:49] (03PS2) 10Ssingh: P:wikidough: add monitoring for IPv6 endpoints [puppet] - 10https://gerrit.wikimedia.org/r/764798 [15:54:04] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:45] (03PS1) 10Vivian Rook: update to vivian rook [puppet] - 10https://gerrit.wikimedia.org/r/764816 [15:55:47] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/764816 (owner: 10Vivian Rook) [15:56:19] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Vivian Rook - https://phabricator.wikimedia.org/T302310 (10rook) [15:56:38] (03CR) 10jerkins-bot: [V: 04-1] update to vivian rook [puppet] - 10https://gerrit.wikimedia.org/r/764816 (owner: 10Vivian Rook) [15:57:19] (03PS1) 10Ssingh: hiera: pour some more snake oil (update heira for wikidough) [labs/private] - 10https://gerrit.wikimedia.org/r/764821 [15:57:30] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Vivian Rook - https://phabricator.wikimedia.org/T302310 (10rook) https://gerrit.wikimedia.org/r/c/operations/puppet/+/764816 placating jenkins now... [15:58:09] (03PS2) 10Vivian Rook: update to vivian rook [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) [15:59:00] (03CR) 10jerkins-bot: [V: 04-1] update to vivian rook [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) (owner: 10Vivian Rook) [15:59:21] (03CR) 10Ssingh: [C: 03+2] hiera: pour some more snake oil (update heira for wikidough) [labs/private] - 10https://gerrit.wikimedia.org/r/764821 (owner: 10Ssingh) [15:59:27] (03CR) 10Ssingh: [V: 03+2 C: 03+2] hiera: pour some more snake oil (update heira for wikidough) [labs/private] - 10https://gerrit.wikimedia.org/r/764821 (owner: 10Ssingh) [15:59:37] (03CR) 10Muehlenhoff: update to vivian rook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) (owner: 10Vivian Rook) [16:00:43] (03PS3) 10Vivian Rook: update to vivian rook [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) [16:00:50] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300774)', diff saved to https://phabricator.wikimedia.org/P21313 and previous config saved to /var/cache/conftool/dbconfig/20220222-160049-kormat.json [16:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:56] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [16:01:07] (03CR) 10Ahmon Dancy: [C: 03+1] ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [16:02:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, please make sure to also update the LDAP group memberships for cn=wmf and cn=ops accordingly." [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) (owner: 10Vivian Rook) [16:04:43] (03CR) 10JMeybohm: [C: 03+2] Add k8s-ingress-wikikube LVS VIPs [dns] - 10https://gerrit.wikimedia.org/r/764728 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [16:05:51] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) I've tried reducing the workload on 1006 to test the theory that someho... [16:07:04] (03CR) 10Vivian Rook: update to vivian rook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) (owner: 10Vivian Rook) [16:09:38] (03PS2) 10JHathaway: run_ci_locally.sh: add podman support [puppet] - 10https://gerrit.wikimedia.org/r/763807 [16:11:02] (03PS3) 10JHathaway: run_ci_locally.sh: add podman support [puppet] - 10https://gerrit.wikimedia.org/r/763807 [16:13:40] (03CR) 10Muehlenhoff: [C: 03+1] update to vivian rook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) (owner: 10Vivian Rook) [16:13:58] (03CR) 10JHathaway: run_ci_locally.sh: add podman support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/763807 (owner: 10JHathaway) [16:14:00] (03CR) 10JHathaway: [C: 03+2] run_ci_locally.sh: add podman support [puppet] - 10https://gerrit.wikimedia.org/r/763807 (owner: 10JHathaway) [16:14:23] (03Abandoned) 10JHathaway: run_ci_locally.sh: merge duplicate args [puppet] - 10https://gerrit.wikimedia.org/r/763856 (owner: 10JHathaway) [16:14:30] (03CR) 10Alexandros Kosiaris: Add LVS servie k8s-ingress-wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764733 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [16:15:41] !log rebooting scs-oe16-esams to clear librenms alert [16:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:55] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21314 and previous config saved to /var/cache/conftool/dbconfig/20220222-161554-kormat.json [16:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to GLOBAL ROOT for Vivian Rook - https://phabricator.wikimedia.org/T302310 (10rook) [16:19:46] (03CR) 10RhinosF1: "shouldn't mdipedtro be set to absent rather than removed?" [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) (owner: 10Vivian Rook) [16:20:22] moritzm, Rook: ^ [16:20:36] (Processor usage over 85%) resolved: Device scs-oe16-esams.mgmt.esams.wmnet recovered from Processor usage over 85% - https://alerts.wikimedia.org [16:21:04] (03CR) 10Vivian Rook: update to vivian rook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) (owner: 10Vivian Rook) [16:21:32] Rook: absent makes sure it's actually removed [16:21:41] In that case, yes [16:21:43] Taking it out the repo just stops puppet tracking it [16:21:45] (03CR) 10Muehlenhoff: [C: 03+1] update to vivian rook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) (owner: 10Vivian Rook) [16:22:03] moritzm: can you remove the +1? [16:22:21] updating... [16:22:38] Rook: make sure to add him to the absent list at the top too [16:23:43] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [16:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:47] (03PS1) 10DCausse: rdf-streaming-updater: reduce limits for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/764824 [16:25:52] (03PS4) 10Vivian Rook: update to vivian rook [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) [16:27:01] (03CR) 10RhinosF1: [C: 04-1] "old ssh key needs removing" [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) (owner: 10Vivian Rook) [16:27:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) Following the "sth to do with icmp rate limit" lead I have: * temporar... [16:27:47] (Device rebooted) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Device rebooted - https://alerts.wikimedia.org [16:28:09] Rook: you need to take his ssh key out too [16:28:18] Look at the person below him [16:28:27] (03CR) 10JMeybohm: Add LVS servie k8s-ingress-wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764733 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [16:28:34] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: wikimediacz-l does not hold all posts for moderation - https://phabricator.wikimedia.org/T298729 (10Dzahn) [16:28:37] (03PS5) 10Vivian Rook: update to vivian rook [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) [16:29:07] 👍 [16:29:57] (03CR) 10Btullis: "This change is ready for review." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [16:30:20] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add LVS servie k8s-ingress-wikikube [puppet] - 10https://gerrit.wikimedia.org/r/764733 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [16:30:28] (03CR) 10RhinosF1: "otherwise looks good to me but can we make the commit message a bit clearer." [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) (owner: 10Vivian Rook) [16:30:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) I don't have time to dive too deep but: consider there's also a ping-offloa... [16:30:57] Rook: actual change looks good now, ^ is a bit of a nit [16:30:59] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21315 and previous config saved to /var/cache/conftool/dbconfig/20220222-163059-kormat.json [16:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:14] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: reduce limits for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/764824 (owner: 10DCausse) [16:31:15] I can update the commit message, how would you like it to read? [16:32:11] Rook: i think something like "add rook to ops, replaces mdipetro who has left" [16:32:21] can do [16:32:33] Current message doesn't really say what's been updated to rook [16:32:47] (Device rebooted) resolved: Device scs-oe16-esams.mgmt.esams.wmnet recovered from Device rebooted - https://alerts.wikimedia.org [16:33:05] (03PS6) 10Vivian Rook: add rook to ops, replaces mdipetro who has left [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) [16:33:33] (03CR) 10RhinosF1: [C: 03+1] add rook to ops, replaces mdipetro who has left [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) (owner: 10Vivian Rook) [16:34:06] moritzm: ^ [16:34:17] Rook: +1'd, thanks for being patient [16:34:47] (03Merged) 10jenkins-bot: rdf-streaming-updater: reduce limits for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/764824 (owner: 10DCausse) [16:34:49] (03PS3) 10Jbond: spicerack: switch to push model [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 [16:34:50] RhinosF1, moritzm: I'm guessing there is a chicken egg problem where I won't be able to ghost myself in to promote my user in ldap once this is merged? And that will be taken care of as part of the associated ticket? [16:35:28] Rook: you should be able to login without ldap access to the server but it'll take 30 minutes to apply [16:35:46] I'm not quite so used to the ldap side though [16:35:54] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [16:36:07] I assume you'll be removing mdipetro's access too [16:36:16] If he's in any special groups [16:36:34] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10MatthewVernon) @dr0ptp4kt your name has come up as someone with knowledge of search-related things. Do you have any concerns about us setting up a Bing webmaster tools account for W... [16:37:26] Rook: you might want to fix my awful spelling of his name in the commit messages [16:37:30] Just noticed that sorry [16:38:08] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger, Mglaser+ - https://phabricator.wikimedia.org/T302287 (10Dzahn) I don't think it is. That's just a regular access request. One person per ticket though, please. [16:38:21] (03PS7) 10Vivian Rook: add rook to ops, replaces mdipietro who has left [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) [16:38:27] heh, c'est pas grave, I didn't notice either :) [16:39:47] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [16:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:57] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [16:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:26] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger, Mglaser+ - https://phabricator.wikimedia.org/T302287 (10Dzahn) 05Open→03Stalled [16:42:33] (03CR) 10jerkins-bot: [V: 04-1] spicerack: switch to push model [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 (owner: 10Jbond) [16:44:09] (03CR) 10Dzahn: "@Muehlenhoff I am confused why no absenting of the previous root user is needed. Don't they have to be offboarded properly?" [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) (owner: 10Vivian Rook) [16:46:00] (03CR) 10Kormat: [C: 03+2] add rook to ops, replaces mdipietro who has left [puppet] - 10https://gerrit.wikimedia.org/r/764816 (https://phabricator.wikimedia.org/T302310) (owner: 10Vivian Rook) [16:46:04] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300774)', diff saved to https://phabricator.wikimedia.org/P21316 and previous config saved to /var/cache/conftool/dbconfig/20220222-164604-kormat.json [16:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:11] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [16:46:29] jhathaway: is it ok to merge your change? [16:46:39] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10Dzahn) Legal already went through the NDA process with Zabe in the past. [16:46:47] kormat: which one? [16:46:53] JHathaway: run_ci_locally.sh: add podman support (ecd5036b43) [16:46:56] it's pending on puppetmaster [16:46:57] ah the podman one, yes, sorry [16:46:58] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10Urbanecm) >>! In T302231#7727015, @Legoktm wrote: > Are there Gerrit patches that [[https://gerrit.wikimedia.org/r/q/owner:sam%2540theresnotime.co.uk|this search]] doesn't pick up? No i... [16:47:50] 10SRE, 10Security-Team, 10Performance-Team (Radar), 10SecTeam-Processed, 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10sbassett) [16:50:39] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) [16:52:23] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to GLOBAL ROOT for Vivian Rook - https://phabricator.wikimedia.org/T302310 (10nskaggs) +1, thank you! [16:54:02] (03PS15) 10Jbond: R:varnish:instance: Add hiera key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) [16:55:16] (03CR) 10Jbond: "thanks for the reviews but moving this change to WIP as i think that the following is a better route to go" [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [17:00:05] jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220222T1700). Please do the needful. [17:00:05] Urbanecm: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:27] * urbanecm waves [17:00:31] urbanecm: howdy :) taking a look [17:01:00] (03CR) 10RLazarus: [C: 03+2] [beta] Add incubator.wikimedia.beta.wmflabs.org to beta sites [puppet] - 10https://gerrit.wikimedia.org/r/764781 (https://phabricator.wikimedia.org/T210492) (owner: 10Urbanecm) [17:01:33] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [17:01:35] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:17] urbanecm: done and done -- doesn't seem like there's anything particular to test since it was already cherry-picked, is that right? [17:02:20] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [17:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:34] correct rzl, assuming the cherrypick goes away as expected :) [17:06:58] cool, I'll be around if any follow-up is needed [17:21:39] (03PS1) 10Ssingh: hiera: add some more fake data for wikidough.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/764853 [17:21:48] (03CR) 10Cwhite: [C: 03+2] k8s: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763821 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [17:21:55] (03PS2) 10Cwhite: k8s: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763821 (https://phabricator.wikimedia.org/T211982) [17:22:31] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:22:39] (03CR) 10Ssingh: [V: 03+1 C: 03+1] hiera: add some more fake data for wikidough.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/764853 (owner: 10Ssingh) [17:22:57] (03CR) 10Ssingh: [V: 03+2 C: 03+2] hiera: add some more fake data for wikidough.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/764853 (owner: 10Ssingh) [17:23:33] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10Ladsgroup) > her profile isn't what I'm generally looking for in new deployers. Well, I think we should be more inclusive in our deployers, not every deployer need to have a deep knowl... [17:23:51] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33931/console" [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [17:24:39] (03PS1) 10Jbond: P:mail::mx: add max_runtime_seconds to systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/764854 [17:25:12] (03PS3) 10Ssingh: P:wikidough: add monitoring for IPv6 endpoints [puppet] - 10https://gerrit.wikimedia.org/r/764798 [17:25:46] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33933/console" [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [17:26:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33934/console" [puppet] - 10https://gerrit.wikimedia.org/r/764854 (owner: 10Jbond) [17:26:49] (03PS4) 10Ssingh: P:wikidough: add monitoring for IPv6 endpoints [puppet] - 10https://gerrit.wikimedia.org/r/764798 [17:27:23] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33935/console" [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [17:28:29] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:31] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:47] PROBLEM - Check systemd state on wcqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:59] (03PS1) 10Ebernhardson: [DNM] Test prometheus::resource_config [puppet] - 10https://gerrit.wikimedia.org/r/764855 [17:29:41] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Test prometheus::resource_config [puppet] - 10https://gerrit.wikimedia.org/r/764855 (owner: 10Ebernhardson) [17:30:00] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:mail::mx: add max_runtime_seconds to systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/764854 (owner: 10Jbond) [17:30:09] (03PS2) 10Jbond: P:mail::mx: add max_runtime_seconds to systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/764854 [17:30:14] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:mail::mx: add max_runtime_seconds to systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/764854 (owner: 10Jbond) [17:32:28] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33936/console" [puppet] - 10https://gerrit.wikimedia.org/r/764855 (owner: 10Ebernhardson) [17:33:03] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:59] (03Abandoned) 10Ebernhardson: [DNM] Test prometheus::resource_config [puppet] - 10https://gerrit.wikimedia.org/r/764855 (owner: 10Ebernhardson) [17:37:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:38:00] ^ this is me [17:38:06] lol [17:38:39] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10Urbanecm) >>! In T302231#7729155, @Ladsgroup wrote: >> her profile isn't what I'm generally looking for in new deployers. > > Well, I think we should be more inclusive in our deployers... [17:40:03] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:21] RECOVERY - Check systemd state on doh6002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:37] 10SRE, 10Infrastructure-Foundations, 10netops: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10cmooney) > 2/ Do AS path prepending to anycast prefixes learned directly from the core routers to match the AS path length on the new design infra. >So 10.3.0.1 on cr1-e... [17:52:41] !log depooling WDQS codfw (internal + public) - issues with deployment of new updater version on cdofw [17:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:48] dcausse: ^ [17:53:06] gehel: thanks! [17:56:03] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) @bblack thanks for the input. We've validated our ping-offload is not inv... [17:56:15] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Vivian Rook - https://phabricator.wikimedia.org/T302310 (10rook) 05Open→03Resolved [17:57:48] 10SRE, 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10colewhite) This caused a significant rise in dead letters on the logging pipeline today which caused most collectors to [[ https://grafana.wikimedia.org/d/... [17:59:14] 10SRE, 10Infrastructure-Foundations, 10netops: Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355 (10cmooney) Given it came up as part of an incident report I'll explicitly mention we need to consider our "network only" POPs, like eqord, as part of this. The key balance we ne... [18:01:05] (03PS1) 10Cwhite: logstash: drop knative_dev/key field [puppet] - 10https://gerrit.wikimedia.org/r/764857 (https://phabricator.wikimedia.org/T288549) [18:01:27] (03CR) 10Herron: "Echoing what Alex said about addressing the long running check" [puppet] - 10https://gerrit.wikimedia.org/r/764464 (owner: 10Andrew Bogott) [18:01:55] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) What I mean is looking at a different layer of the ping-offload part: the c... [18:03:43] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:45] (03PS1) 10Razzi: kerberos: add krb: present for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/764858 (https://phabricator.wikimedia.org/T300450) [18:05:49] (03CR) 10jerkins-bot: [V: 04-1] kerberos: add krb: present for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/764858 (https://phabricator.wikimedia.org/T300450) (owner: 10Razzi) [18:06:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) @bblack ok I understand where you're coming from. We didn't see any of... [18:07:32] (03PS2) 10Razzi: kerberos: add krb: present for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/764858 (https://phabricator.wikimedia.org/T300450) [18:08:15] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:12:43] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) [18:13:27] (03CR) 10Herron: [C: 03+1] logstash: drop knative_dev/key field [puppet] - 10https://gerrit.wikimedia.org/r/764857 (https://phabricator.wikimedia.org/T288549) (owner: 10Cwhite) [18:13:59] (03PS5) 10Herron: remove references to centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/754029 (https://phabricator.wikimedia.org/T298994) [18:16:43] (03CR) 10Herron: [C: 03+2] remove references to centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/754029 (https://phabricator.wikimedia.org/T298994) (owner: 10Herron) [18:17:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) The relevant settings on the LVSes are in `modules/lvs/manifests/kernel_con... [18:17:50] (03CR) 10Razzi: [C: 03+2] kerberos: add krb: present for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/764858 (https://phabricator.wikimedia.org/T300450) (owner: 10Razzi) [18:20:07] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) [18:20:15] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:10] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) Stepping back out to the broader question again though: I get why we normal... [18:22:36] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:23:37] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts centrallog2001.codfw.wmnet [18:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:15] (03PS1) 10Dduvall: testwikis wikis to 1.38.0-wmf.23 refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764860 [18:26:17] (03CR) 10Dduvall: [C: 03+2] testwikis wikis to 1.38.0-wmf.23 refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764860 (owner: 10Dduvall) [18:26:30] RECOVERY - Check systemd state on wcqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:31] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/764318 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [18:27:04] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.23 refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764860 (owner: 10Dduvall) [18:27:04] !log dduvall@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.23 refs T300199 [18:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:09] T300199: 1.38.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T300199 [18:28:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:29:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:30] PROBLEM - Check systemd state on wcqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:49] !log rebalance ganeti eqiad row_B (all nodes reimaged in there) T296721 [18:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:55] T296721: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 [18:32:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:44] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 513 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:33:32] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts centrallog2001.codfw.wmnet [18:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:20] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:34:51] (03PS1) 10Subramanya Sastry: Bump deployment tag for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/764863 (https://phabricator.wikimedia.org/T300133) [18:37:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:38:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:16] (03CR) 10MSantos: [C: 03+2] Bump deployment tag for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/764863 (https://phabricator.wikimedia.org/T300133) (owner: 10Subramanya Sastry) [18:44:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) > The ratelimit sounds similar, but the difference is that it's per-target... [18:45:09] (03PS1) 10Esanders: Enable mobile DT at ht.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764868 (https://phabricator.wikimedia.org/T302259) [18:45:53] 10SRE, 10LDAP-Access-Requests: Logstash Access for Ammarpad - https://phabricator.wikimedia.org/T302250 (10KFrancis) Hi @MatthewVernon would you please let me know @Ammarpad actual name? Thanks! [18:46:10] (03Merged) 10jenkins-bot: Bump deployment tag for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/764863 (https://phabricator.wikimedia.org/T300133) (owner: 10Subramanya Sastry) [18:49:14] !log ssastry@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [18:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:10] !log ssastry@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [18:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:15] !log ssastry@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [18:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:57] !log ssastry@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [18:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:30] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts logstash[1007-1009].eqiad.wmnet [18:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:37] o/ is anybody looking at the high wikidata maxlag? [18:56:10] !log ssastry@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [18:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:28] RECOVERY - Check systemd state on wcqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:38] !log ssastry@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [18:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:58] PROBLEM - Check systemd state on wcqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:05] dduvall and hashar: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220222T1900). [19:02:00] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) [19:02:15] (03PS1) 10Ebernhardson: mjolnir: Restore prometheus_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/764872 (https://phabricator.wikimedia.org/T301873) [19:04:00] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:07:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2066.mgmt.codfw.wmnet with reboot policy FORCED [19:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:42] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:14] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash[1007-1009].eqiad.wmnet [19:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:21] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: `logstash[1007-1009].eqiad.wmnet` - logstash1007.eqiad.wmnet (**P... [19:11:55] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts logstash[2004-2006].codfw.wmnet [19:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:50] (03PS1) 10Addshore: Temp remove codfw [puppet] - 10https://gerrit.wikimedia.org/r/764875 (https://phabricator.wikimedia.org/T302330) [19:14:15] (03PS3) 10Ebernhardson: search-platform: Port alerts from icinga [alerts] - 10https://gerrit.wikimedia.org/r/762902 (https://phabricator.wikimedia.org/T289077) [19:14:19] (03CR) 10Ebernhardson: search-platform: Port alerts from icinga (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/762902 (https://phabricator.wikimedia.org/T289077) (owner: 10Ebernhardson) [19:15:00] (03PS2) 10Addshore: Temp remove codfw from wikidata updateQueryServiceLag check [puppet] - 10https://gerrit.wikimedia.org/r/764875 (https://phabricator.wikimedia.org/T302330) [19:15:23] o/ anyone around that could deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/764875 Temp remove codfw from wikidata updateQueryServiceLag check ? [19:16:20] !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.23 refs T300199 (duration: 49m 17s) [19:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:26] T300199: 1.38.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T300199 [19:16:43] (03CR) 10Tarrow: [C: 03+1] Temp remove codfw from wikidata updateQueryServiceLag check [puppet] - 10https://gerrit.wikimedia.org/r/764875 (https://phabricator.wikimedia.org/T302330) (owner: 10Addshore) [19:20:47] !log dduvall@deploy1002 Pruned MediaWiki: 1.38.0-wmf.21 (duration: 03m 50s) [19:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Volans) [19:22:39] (03PS2) 10Jdlrobson: [Vector] Enable table of contents on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764454 [19:23:25] addshore: dduvall is on train duty so he can probably help yout [19:23:29] -t [19:23:38] =] Hi dduvall ! [19:23:53] a puppet patch? cannot do, sorry :) [19:23:53] (03CR) 10Ryan Kemper: [C: 03+2] Temp remove codfw from wikidata updateQueryServiceLag check [puppet] - 10https://gerrit.wikimedia.org/r/764875 (https://phabricator.wikimedia.org/T302330) (owner: 10Addshore) [19:24:06] thanks ryankemper :) [19:24:08] oh, puppet... didn't notice that. [19:24:11] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash[2004-2006].codfw.wmnet [19:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:21] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: `logstash[2004-2006].codfw.wmnet` - logstash2004.codfw.wmnet (**P... [19:24:29] i wish! or do i? not sure [19:24:32] we should turn these op[tions into mw config options tarrow so we could deploy emergancy things if needed [19:24:36] tarrow: ^^ [19:24:43] we should indeed [19:25:14] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron) 05Open→03Resolved [19:25:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:32] !log T302330 `ryankemper@cumin1001:~$ sudo -E cumin '*mwmaint*' 'run-puppet-agent'` (getting https://gerrit.wikimedia.org/r/c/operations/puppet/+/764875 out) [19:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:37] T302330: Wikidata MaxLag above 10 for 1hr - https://phabricator.wikimedia.org/T302330 [19:25:56] (03PS2) 104nn1l2: InitialiseSettings: General cleanup, wgRemoveGroups (A-D) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764465 (https://phabricator.wikimedia.org/T301647) [19:26:08] RECOVERY - Check systemd state on wcqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:27:14] thanks ryankemper :) [19:27:31] (03CR) 104nn1l2: InitialiseSettings: General cleanup, wgRemoveGroups (A-D) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764465 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [19:27:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [19:29:50] PROBLEM - Check systemd state on wcqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:23] (03CR) 10Dzahn: [C: 03+1] "thanks, Stevie Beth, you are right about this, let Arnold merge it." [labs/private] - 10https://gerrit.wikimedia.org/r/764743 (https://phabricator.wikimedia.org/T293942) (owner: 10Kormat) [19:32:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:32:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:29] (03PS37) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [19:33:59] (03CR) 10AOkoth: [C: 03+1] Remove obsolete otrs.yaml hiera. [labs/private] - 10https://gerrit.wikimedia.org/r/764743 (https://phabricator.wikimedia.org/T293942) (owner: 10Kormat) [19:34:26] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:36:25] (03PS1) 10Addshore: Revert "Temp remove codfw from wikidata updateQueryServiceLag check" [puppet] - 10https://gerrit.wikimedia.org/r/764830 [19:39:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:22] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:43:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2066.mgmt.codfw.wmnet with reboot policy FORCED [19:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:14] (03PS5) 10Ssingh: P:wikidough: add monitoring for IPv6 endpoints [puppet] - 10https://gerrit.wikimedia.org/r/764798 [19:49:30] (03CR) 10Ssingh: "This is the same weird issue we were running into earlier!" [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [19:49:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2067.mgmt.codfw.wmnet with reboot policy FORCED [19:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:34] legoktm: hey, any updates re systemd recount_categories? Is it okay to leave it enabled? :-) [19:50:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:50:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:51] I think so [19:51:07] I'll update and reclose the ticket after work [19:52:24] (03CR) 10Ssingh: "Note that fake-private has all the relevant keys:" [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [19:55:10] (03CR) 10Dzahn: P:wikidough: add monitoring for IPv6 endpoints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [19:55:11] great :) [19:55:20] out for a while [19:55:39] sukhe: are you using the line that you say in the comment is the one that does NOT work? [19:55:53] mutante: sorry, it was confusing. webserver doesn't work, webserver_config does [19:56:09] the current patchset is failing for example: https://puppet-compiler.wmflabs.org/pcc-worker1003/33937/doh4002.wikimedia.org/change.doh4002.wikimedia.org.err [19:56:17] currently you have: Dnsdist::Webserver_config $webserver = lookup('profile::wikidough::dnsdist::webserver' [19:56:21] yep [19:56:41] Dnsdist::Webserver_config $webserver = lookup('profile::wikidough::dnsdist::webserver', {'merge' => hash}) [19:56:44] Doesn't work, for some reason. [19:56:48] same thing? [19:56:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:22] yes, this fails [19:57:34] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@f0d05eb] (wcqs): Deploy 0.3.104 to WCQS [19:57:38] Dnsdist::Webserver_config $webserver_config = lookup('profile::wikidough::dnsdist::webserver', {'merge' => hash}) [19:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:41] works [19:57:52] dduvall: I forgot to sync up about train I apologize. It went fine, though I might not have triaged everysingle log messages [19:57:58] oh, so you do NOT expect the latest PS to work. ok then [19:58:08] !log T302340 `scap deploy -v --environment wcqs 'Deploy 0.3.104 to WCQS'` [19:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:13] T302340: codfw wcqs updater failures - https://phabricator.wikimedia.org/T302340 [19:59:22] (03CR) 10Ssingh: P:wikidough: add monitoring for IPv6 endpoints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [20:00:16] (03PS1) 10JHathaway: Rename system::role to base::set_role_motd [puppet] - 10https://gerrit.wikimedia.org/r/764884 [20:00:34] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@f0d05eb] (wcqs): Deploy 0.3.104 to WCQS (duration: 03m 00s) [20:00:37] hashar: no problem. thanks for the quick update [20:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:54] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:06] sukhe: got it now. checking labs/private... that's a very odd error [20:01:15] dduvall: though I did sync up with Jeena at end of last week. I guess I mixed up who ran the train previously with who was about to run it :-\ [20:02:19] (03CR) 10jerkins-bot: [V: 04-1] Rename system::role to base::set_role_motd [puppet] - 10https://gerrit.wikimedia.org/r/764884 (owner: 10JHathaway) [20:02:30] mutante: yep! [20:02:39] we had this one before as well and j.bond spent a lot of time but gave up [20:02:42] it's really weird [20:02:51] if he gives up.. that's not a good sign [20:03:09] haha but also because I was like let's not worry and we will just use webserver_config [20:03:16] almost like "webserver" is a reserved word but I find nothing [20:03:17] which again, I am fine to do now but yeah, it's bugging me :P [20:03:21] yep, nothing [20:03:44] so the reason I added you both was I was like let's start afresh and maybe I was missing something last time [20:04:10] (03PS4) 10Jbond: spicerack: switch to push model [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 [20:04:57] sukhe: already aware you have both of these too? [20:04:58] hieradata/role/common/wikidough.yaml:profile::wikidough::dnsdist::webserver: [20:05:01] (03PS4) 10Ebernhardson: search-platform: Port alerts from icinga [alerts] - 10https://gerrit.wikimedia.org/r/762902 (https://phabricator.wikimedia.org/T289077) [20:05:01] hieradata/role/common/wikidough.yaml:profile::wikidough::dnsdist_webserver: [20:05:23] hieradata/role/common/wikidough.yaml:profile::wikidough::webserver: [20:05:25] (03PS1) 10Dduvall: group0 wikis to 1.38.0-wmf.23 refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764885 [20:05:27] (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.38.0-wmf.23 refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764885 (owner: 10Dduvall) [20:05:28] so there is 3 versions [20:05:43] yeah, these are the ones from our last experiment :P [20:05:51] I see [20:06:01] !log T302340 [WCQS] Forgot to fetch & rebase `deploy1002:/srv/deployment/wdqs/wdqs` before deploy, so `0.3.104` did not actually deploy (still on `0.3.103`). Re-rolling deploy... [20:06:05] (03PS5) 10Jbond: reposync: switch to push model [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 [20:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:07] T302340: codfw wcqs updater failures - https://phabricator.wikimedia.org/T302340 [20:06:09] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.23 refs T300199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764885 (owner: 10Dduvall) [20:06:10] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@5d384a5] (wcqs): Deploy 0.3.104 to WCQS [20:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:44] and you are _not even changing the lookup_.. just the variable you are filling it with.. wtf [20:06:47] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some Search Platform / Discovery clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271143 (10Gehel) [20:06:58] mutante: sorry, it's one of those ones :D [20:07:25] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.23 refs T300199 [20:07:26] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some Search Platform / Discovery clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271143 (10Gehel) p:05Triage→03High [20:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:30] T300199: 1.38.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T300199 [20:08:06] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:44] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@5d384a5] (wcqs): Deploy 0.3.104 to WCQS (duration: 02m 33s) [20:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2067.mgmt.codfw.wmnet with reboot policy FORCED [20:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:50] !log T302340 [WCQS] Seeing `0.3.104` running on the hosts now [20:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2068.mgmt.codfw.wmnet with reboot policy FORCED [20:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:32] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:49] (03CR) 10jerkins-bot: [V: 04-1] reposync: switch to push model [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 (owner: 10Jbond) [20:11:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:41] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (DIFF 105): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33938/console" [puppet] - 10https://gerrit.wikimedia.org/r/764884 (owner: 10JHathaway) [20:13:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:13:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:35] sukhe: are you sure you need "merge => hash" instead of default behaviour? I wonder if the issue is gone with default "first found wins" merge behavior [20:15:00] mutante: we need that because password and api_key are defined in private [20:15:06] (fake-private and actual private) [20:15:29] and also, this does work if I just set the var to webserver_config (!) [20:15:35] which makes me believe that the issue is there, I think [20:17:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:34] (03PS6) 10Jbond: spicerack: switch to push model [software/spicerack] - 10https://gerrit.wikimedia.org/r/764782 [20:18:19] (03CR) 10Ssingh: "Since I made it more confusing that it needs to be: https://gerrit.wikimedia.org/r/c/operations/puppet/+/764798/4..5" [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [20:19:12] RECOVERY - Check systemd state on wcqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:44] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:27] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10skyenet) I have signed the L3 acknowledgment form. [20:24:32] (03PS1) 10Ssingh: hiera: add more varieties of snake oil for wikidough.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/764888 [20:25:20] (03PS6) 10Ssingh: P:wikidough: add monitoring for IPv6 endpoints [puppet] - 10https://gerrit.wikimedia.org/r/764798 [20:25:45] (03CR) 10Ssingh: [V: 03+2 C: 03+2] hiera: add more varieties of snake oil for wikidough.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/764888 (owner: 10Ssingh) [20:26:28] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33939/console" [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [20:26:54] !log begin opensearch upgrade (codfw) T299168 [20:26:57] (03CR) 10Ssingh: [V: 03+1] "So profile::wikidough::dnsdist::webserver2 works but not profile::wikidough::dnsdist::webserver2." [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [20:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:00] T299168: Upgrade OpenSearch - https://phabricator.wikimedia.org/T299168 [20:27:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2068.mgmt.codfw.wmnet with reboot policy FORCED [20:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2069.mgmt.codfw.wmnet with reboot policy FORCED [20:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:46] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Remove obsolete otrs.yaml hiera. [labs/private] - 10https://gerrit.wikimedia.org/r/764743 (https://phabricator.wikimedia.org/T293942) (owner: 10Kormat) [20:33:00] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger, Mglaser+ - https://phabricator.wikimedia.org/T302287 (10MarkAHershberger) >>! In T302287#7728896, @Dzahn wrote: > I don't think it is. That's just a regular access request. One person per ticket though, please. I thought I had r... [20:33:12] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10MarkAHershberger) [20:34:32] (03PS7) 10Ssingh: P:wikidough: add monitoring for IPv6 endpoints [puppet] - 10https://gerrit.wikimedia.org/r/764798 [20:35:11] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33940/console" [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [20:36:58] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1004.wikimedia.org with OS bullseye [20:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:03] (03PS1) 10Ssingh: hiera: cleanup wikidough.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/764889 [20:40:20] (03PS2) 10Ssingh: hiera: cleanup wikidough.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/764889 [20:40:31] (03CR) 10Ssingh: [V: 03+2 C: 03+2] hiera: cleanup wikidough.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/764889 (owner: 10Ssingh) [20:42:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:19] (03PS8) 10Ssingh: P:wikidough: add monitoring for IPv6 endpoints [puppet] - 10https://gerrit.wikimedia.org/r/764798 [20:43:16] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33941/console" [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [20:46:08] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33942/console" [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [20:49:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:49:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [20:50:55] (03PS1) 10Ssingh: Revert "hiera: cleanup wikidough.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/764891 [20:51:09] sukhe: omg, that broke it? [20:51:42] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Revert "hiera: cleanup wikidough.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/764891 (owner: 10Ssingh) [20:51:56] mutante: yep :| [20:52:09] getting crazier?:p [20:52:13] ! [20:52:19] this will be one of the great mysteries of the world [20:53:03] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33943/console" [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [20:53:27] ^ ¯\_(ツ)_/¯ [20:54:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [20:55:06] now I could try to isolate which key in private is causing this to fail but no, I am not going to do that [20:55:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:13] instead I am going to rename it to webserver_config and enjoy my tea [20:57:45] (03PS1) 10JHathaway: Add Puppet Bolt hacks to .gitignore [puppet] - 10https://gerrit.wikimedia.org/r/764894 [20:57:56] sukhe: yea, in addition to ... fails to lookup "host, port and acl" but breaks when you remove "password and api_key" [20:58:53] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/764895 [21:00:04] mutante: I suspect it could be your theory of the merging the hash strategy but we do need it to fetch the private data [21:00:05] RoanKattouw and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220222T2100). [21:00:05] eigyan and nn1l2: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:08] hi [21:00:09] also like I said, too lazy to try it out now :P [21:00:13] i can deploy today [21:00:13] Greetings all [21:00:20] hello eigyan [21:00:34] greetings urbanecm [21:00:45] Hey eigyan just saw your slack [21:00:49] (03PS9) 10Urbanecm: [wmf-config]: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [21:00:53] urbanecm: i also have 2 patches [21:00:55] not sure why no ping [21:00:55] (03CR) 10Urbanecm: [C: 03+2] [wmf-config]: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [21:01:04] (03CR) 10JHathaway: [C: 03+2] Add Puppet Bolt hacks to .gitignore [puppet] - 10https://gerrit.wikimedia.org/r/764894 (owner: 10JHathaway) [21:01:06] hello Jdlrobson, i don't see them in the calendar? [21:01:10] i put them in tomorrows by accident [21:01:11] i'll move them [21:01:17] okay [21:01:21] we should have time for them :) [21:01:23] sukhe: yep :) all good [21:01:36] Greetings Jdlrobson glad to see you I hoped I didn't reach out to you too late [21:01:43] (03Merged) 10jenkins-bot: [wmf-config]: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [21:01:51] urbanecm: done - they are both beta cluster patches so should be straightforward [21:01:58] eigyan: no problem happy to help [21:02:18] eigyan: your patch is at mwdebug1001. Can you test it please? [21:02:31] Jdlrobson: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/764455 doesn't seem to be beta cluster specific? [21:02:51] urbanecm checking now [21:03:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2069.mgmt.codfw.wmnet with reboot policy FORCED [21:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:28] +1 to taavi's question [21:03:54] urbanecm is that where you run " mwscript shell.php fawiki" [21:04:08] eigyan: I don't understand your question? [21:04:25] you should use https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage and test your patch from the browser [21:04:55] (03CR) 10Ssingh: [V: 03+1] "To summarize, this is still failing like it was last time we tried to call it webserver and not webserver_config." [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [21:05:19] taavi urbanecm it's removing config in production that is unused [21:05:29] so it's a noop [21:05:34] Jdlrobson: it's not beta only then [21:05:37] so it's not beta-only, but no-op [21:05:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:05:45] noop != beta-only [21:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:38] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1004.wikimedia.org with OS bullseye [21:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:58] let me rephrase what I was trying to say then: Both patches do not require any testing. [21:07:36] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:07:36] That makes more sense [21:07:46] urbanecm Jdlrobson just to be clear this is a production change [21:08:04] Yes we know yours is [21:08:06] eigyan: yeah, I'm aware :) [21:08:13] Jon has his own patches too in the queue [21:09:19] eigyan: how's your testing going [21:09:30] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) [21:10:43] RhinosF1 I typically give these changes about 30min to propagate before testing in the browser [21:11:11] eigyan: they shouldn't take any time to propagate [21:11:31] (Also you really should have said that to urbanecm as we deploy patches one by one) [21:11:42] indeed. they should be immediately testable via ?quicksurvey= GET parameter [21:12:08] and yes, we can't afford to spend 30+ minutes testing -- the window has 60 minutes, and we have four patches today :) [21:12:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host elastic2076.mgmt.codfw.wmnet with reboot policy FORCED [21:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:12:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:51] (03CR) 10Herron: "interested in your thoughts and feedback on and approach like this" [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [21:13:18] urbanecm RhinosF1 thanks for letting me know and I am currently testing [21:13:38] Please keep the channel updated [21:15:36] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash2025.codfw.wmnet, logstash2030.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:16:06] RhinosF1 will do, currently formatting a proper fawiki url [21:16:28] Ok [21:16:48] (03PS1) 10Papaul: Add new elastic2073 to elastic2082 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/764897 (https://phabricator.wikimedia.org/T299608) [21:17:04] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:wikidough: add monitoring for IPv6 endpoints [puppet] - 10https://gerrit.wikimedia.org/r/764798 (owner: 10Ssingh) [21:17:44] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:19:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:36] !log end opensearch upgrade (codfw) T299168 [21:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:42] T299168: Upgrade OpenSearch - https://phabricator.wikimedia.org/T299168 [21:19:51] eigyan: how is your test going please? [21:19:54] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10Papaul) [21:20:25] cwhite: I assume that's why pybal alerted [21:20:30] (03CR) 10Papaul: [C: 03+2] Add new elastic2073 to elastic2082 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/764897 (https://phabricator.wikimedia.org/T299608) (owner: 10Papaul) [21:20:57] urbanecm: i'm confused.. did you make this available on the debug servers or sync? [21:21:02] (this change > https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/762881/) [21:21:15] Jdlrobson: the survey is at mwdebug1001 [21:21:26] it's not synced yet [21:21:31] ah then the url won't work [21:21:37] eigyan ^^ [21:21:50] (sorry to jump in) [21:21:55] it's working [21:21:56] hey mepps :). /me is confused [21:21:58] it can be synced [21:21:58] why shouldn't it work? [21:21:59] https://fa.wikipedia.org/wiki/%D8%AF%D9%86%DB%8C%D8%A7%DA%AF%DB%8C%D8%B1%DB%8C_%DA%A9%D9%88%D9%88%DB%8C%D8%AF-%DB%B1%DB%B9?quicksurvey=true [21:22:05] mepps ah that makes sense [21:22:09] https://usercontent.irccloud-cdn.com/file/uzfAKBa1/Screen%20Shot%202022-02-22%20at%201.22.05%20PM.png [21:22:15] mepps: why wouldn't it work [21:22:20] sorry i might be confused [21:22:23] You need to use the https://wikitech.wikimedia.org/wiki/WikimediaDebug extension [21:22:27] to test these changes [21:22:42] and point the server to mwdebug1001 [21:22:51] (I don't want to assume you've done a backport before here) [21:23:27] Jdlrobson that helps immensely [21:23:32] you won't be able to test the sampling until about 30 mins after the changes have rolled out [21:23:40] which I think was the confusion here. [21:23:52] Yes @jdl [21:24:10] eigyan: so, does it work on your side too? 🙂 [21:24:11] Jdlrobson in the past I have always tested using the 30 min window [21:24:30] We can go ahead and sync this [21:24:34] I can vouch the survey is working as configured [21:24:39] syncing [21:24:41] hmm i dont' see it with the debug turned on, but it looks like you do jdlrobson? [21:24:46] thank you so much jdlrobson!! [21:25:01] https://usercontent.irccloud-cdn.com/file/BOC5aoX6/Screen%20Shot%202022-02-22%20at%201.24.56%20PM.png [21:25:06] mepps: I'm testing on the URL https://fa.wikipedia.org/wiki/%D8%AF%D9%86%DB%8C%D8%A7%DA%AF%DB%8C%D8%B1%DB%8C_%DA%A9%D9%88%D9%88%DB%8C%D8%AF-%DB%B1%DB%B9?quicksurvey=true&uselang=fa [21:25:10] eigyan: you can't use over 30 minutes of a window. I'm not sure whose given that idea. [21:25:13] https://fa.wikipedia.org/w/index.php?title=%DA%AF%D9%85%DB%8C%D9%86%D8%A7_%D8%A8%D9%84%DB%8C%DA%98%D9%86&quicksurvey=internal-gdi-safety-survey&useskin=vector&useskin=fa works for me too [21:25:29] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ee7608c7b56b579e2aaa50b504b6c2e28b63058e: Deploy the fawiki test safety survey to production (T297629) (duration: 00m 51s) [21:25:32] (03PS3) 10Urbanecm: InitialiseSettings: General cleanup, wgRemoveGroups (A-D) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764465 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [21:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:34] RhinosF1: eigyan is not asking to use over 30 minutes in a window [21:25:35] T297629: Deploy the fawiki test safety survey to production - https://phabricator.wikimedia.org/T297629 [21:25:36] (03CR) 10Urbanecm: [C: 03+2] InitialiseSettings: General cleanup, wgRemoveGroups (A-D) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764465 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [21:25:48] he's merely stating that to fully test this over 30 mins are needed. [21:25:50] nn1l2: your patch is next! [21:26:02] thanks [21:26:05] (confidently) [21:26:18] (03Merged) 10jenkins-bot: InitialiseSettings: General cleanup, wgRemoveGroups (A-D) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764465 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [21:26:37] Jdlrobson I'm in here trying to help test this  with eigyan as well. I've got the extension installed and am pointing my browser to https://fa.wikipedia.org/w/index.php?title=%D8%A8%D8%B1%D9%84%DB%8C%D9%86&quicksurvey=internal-gdi-safety-survey but can't seem to get the survey to come up. Clearly you have it working, so maybe I'm misunderstanding [21:26:37] what url I should be using? [21:26:49] JSherman: the URL I posted above [21:27:03] but you also need the browser extension installed [21:27:27] JSherman: have you got the browser extension installed? [21:27:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2076.mgmt.codfw.wmnet with reboot policy FORCED [21:27:29] and you need to enable the on/off slider (and ensure mwdebug1001.eqiad.wmnet is selected) [21:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:43] like at the screenshot from https://usercontent.irccloud-cdn.com/file/BOC5aoX6/Screen%20Shot%202022-02-22%20at%201.24.56%20PM.png [21:27:53] JSherman: https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_extensions specifically [21:27:56] now, it won't make any differenced (patch is already deployed), but for future backports, it would [21:28:18] nn1l2: your patch is live at mwdebug1001, can you check it? [21:28:24] ok [21:28:34] (checking special:usergrouprights for few wikis would do, just in case something's really broken) [21:28:46] (03PS3) 10Urbanecm: [Vector] Enable table of contents on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764454 (owner: 10Jdlrobson) [21:29:12] Jdlrobson confirmed that this does work for me too. For whatever reason, it took a reload. I had a working extension and url. [21:29:18] (03CR) 10Urbanecm: [C: 03+2] [Vector] Enable table of contents on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764454 (owner: 10Jdlrobson) [21:29:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host elastic2077.mgmt.codfw.wmnet with reboot policy FORCED [21:29:39] ditto Jsherman [21:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:42] JSherman: yeah, you need to load the page with the extension enabled. it doesn't work clients-side unfortunately :)) [21:29:59] (03Merged) 10jenkins-bot: [Vector] Enable table of contents on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764454 (owner: 10Jdlrobson) [21:30:10] Thank you everybody for walking us through this! [21:30:18] I checked Arabic Wikipedia as an example, and it looked good to me [21:30:18] my pleasure [21:30:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:30:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:30:33] Yes, thank you all for helping this newbie out :) [21:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:46] If you are still a little clueless about any of this, please feel free to schedule something in my calendar the next time you do a deploy and we can do a screen share [21:30:51] Happy everyone could help [21:31:05] * RhinosF1 does suggest taking up Jon's offer [21:31:30] eigyan: I'm also more than happy to answer any questions about deployment (after i finish with the rest of the patches though :)). [21:31:34] eigyan: so now that it's synced, this means your changes are live everywhere in production which means the survey will show up in an incognito window without the extension installed: https://fa.wikipedia.org/wiki/%D8%AF%D9%86%DB%8C%D8%A7%DA%AF%DB%8C%D8%B1%DB%8C_%DA%A9%D9%88%D9%88%DB%8C%D8%AF-%DB%B1%DB%B9?quicksurvey=true&uselang=fa [21:31:36] Jdlrobson I have training scheduled :) [21:31:51] which also means people are seeing the survey [21:31:57] thanks nn1l2, syncing! [21:32:03] thanks! [21:32:34] Jdlrobson the T&S tools team are all on the deploy training calendar for thursday, and I will absolutely take you up on a screenshare the next time I've got something to deploy. Thanks! [21:32:54] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 6d1d9a9ee2d633cf67e81fd2277deb4a61b87891: InitialiseSettings: General cleanup, wgRemoveGroups (A-D) (T301647) (duration: 00m 50s) [21:32:58] nn1l2: should be live! [21:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:00] T301647: Clean up InitialiseSettings - https://phabricator.wikimedia.org/T301647 [21:33:00] JSherman: has anyone filled out a phab task [21:33:20] RhinosF1 i spoke directly to tyler and we're on the calendar :) [21:33:23] Jdlrobson: just to confirm, your two patches can be synced immediately, right? [21:33:29] urbanecm: yep [21:33:33] okay, doing [21:33:35] mepps: fantastic! [21:33:37] Thanks! [21:33:57] RhinosF1 I'm excited :D [21:34:02] mepps eigyan: do you have access to https://hue.wikimedia.org/hue/editor/?type=hive ? [21:34:17] mepps let me check that [21:34:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:27] mepps: good [21:34:43] (03PS2) 10Urbanecm: [Cleanup] Remove non-existent config wgVectorUseWvuiSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764455 (owner: 10Jdlrobson) [21:34:47] (03CR) 10Urbanecm: [C: 03+2] [Cleanup] Remove non-existent config wgVectorUseWvuiSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764455 (owner: 10Jdlrobson) [21:34:49] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 717232793e002ba501a3cbd2be96255760e14ba2: [Vector] Enable table of contents on beta cluster (duration: 00m 50s) [21:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:31] urbanecm: hey, may I sneak in a config patch? [21:35:40] mepps I do not have access to that page actually [21:35:42] zabe: sure. Can you put it in the calendar please? [21:35:42] (03Merged) 10jenkins-bot: [Cleanup] Remove non-existent config wgVectorUseWvuiSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764455 (owner: 10Jdlrobson) [21:35:53] done [21:36:00] mepps: eigyan i'll start a slack thread with the two of you to reduce the noise here [21:36:02] eigyan do you mean the hive link from jdlrobson? [21:36:07] perfect jdlrobson :) [21:36:18] (03CR) 10Cwhite: [C: 03+2] logstash: drop knative_dev/key field [puppet] - 10https://gerrit.wikimedia.org/r/764857 (https://phabricator.wikimedia.org/T288549) (owner: 10Cwhite) [21:36:27] urbanecm I had the extension enabled and set 1001 on the first page load, but for whatever reason I wasn't seeing the survey initially. [21:37:08] interesting [21:37:19] and a refresh fixed it? [21:38:03] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 99f244c9539a5ae2af0bd9dddb8aae45dbc44704: [Cleanup] Remove non-existent config wgVectorUseWvuiSearch (duration: 00m 50s) [21:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:23] Jdlrobson: both patches should be live now [21:38:28] urbanecm: thanks! [21:38:34] happy to help :) [21:38:45] i'll keep an eye on the logs in case there's any beta cluster fatals related to the change. [21:38:54] thanks! [21:38:54] urbanecm yes,  though it sounds like maybe it already rolled out by the time I was able to get it to load? [21:39:00] (03CR) 10Urbanecm: [C: 03+2] filebackend: migrate $wmfSwift* to $wmgSwift* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761443 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [21:39:14] maybe? [21:39:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:35] mepps, JSherman: I see it on the google calendar. I might try and pop in on irc to make sure you enjoy it. Thursday is my busy day though. [21:39:41] (03Merged) 10jenkins-bot: filebackend: migrate $wmfSwift* to $wmgSwift* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761443 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [21:39:47] 10SRE, 10Security-Team, 10Performance-Team (Radar), 10SecTeam-Processed, 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10sbassett) >>! In T301428#7716407, @Mstyles wrote: > It seems like the simplest way forward for us would be to use the existing [[ https://wikitech... [21:39:50] thanks RhinosF1! [21:40:01] zabe: should be at mwdebug1001. can you check please? [21:40:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:40:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:12] JSherman: if you want to test the extension on your end (or anyone else!), i can put another change for you at mwdebug1001 after i finish the last patch (so you can get more confident with the extension) [21:41:16] let me know if that would be helpful [21:41:28] 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10dpifke) 05Open→03Resolved [21:41:31] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10dpifke) [21:41:42] urbanecm: lgtm, nothing seems to break and logstash seems to be clear [21:41:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:41:48] zabe: great, syncing [21:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:55] please watch the error logs for a while, just in case [21:42:03] yup [21:42:23] urbanecm that would actually be awesome. I'm not sure how to tell when an issue is because the code isn't working or if there is a problem client side [21:43:05] !log urbanecm@deploy1002 Synchronized wmf-config/filebackend.php: 91b81ac9dc42893c872f09620566379ab6158f12: filebackend: migrate $wmfSwift* to $wmgSwift* (T45956) (duration: 00m 52s) [21:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:11] zabe: live! [21:43:11] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [21:43:15] and, that was the last patch [21:43:32] thx :) [21:44:18] JSherman: okay, let's do that then. [21:44:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2077.mgmt.codfw.wmnet with reboot policy FORCED [21:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:52] (03PS1) 10Urbanecm: DNM: Testing patch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764901 [21:45:15] I pulled the above patch (https://gerrit.wikimedia.org/r/764901) to mwdebug1001. It creates a new user group at test.wikipedia that's called `testgroup` [21:45:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2067.codfw.wmnet with OS stretch [21:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:26] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2067.codfw.wmnet with OS stretch [21:45:31] the patch should be testable at https://test.wikipedia.org/wiki/Special:ListGroupRights (a new group should show up) [21:45:38] urbanecm looking now [21:46:15] (03PS2) 10JHathaway: Rename system::role to base::add_motd_role [puppet] - 10https://gerrit.wikimedia.org/r/764884 [21:46:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:59] let me know if you have any questions or anything i can help wiuth [21:47:04] *with [21:47:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host elastic2078.mgmt.codfw.wmnet with reboot policy FORCED [21:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:03] urbanecm yes, I can see the testgroup with the extension on, and not see it with the extension off [21:48:34] i see it too (lurking to also practice and learn) [21:48:36] (03CR) 10jerkins-bot: [V: 04-1] Rename system::role to base::add_motd_role [puppet] - 10https://gerrit.wikimedia.org/r/764884 (owner: 10JHathaway) [21:49:03] urbanecm what was the test you mentioned, I was in a few other conversations and missed your message [21:49:23] JSherman: mepps: great! Is the extension behaving as you would expect? [21:49:38] eigyan if you go to https://test.wikipedia.org/wiki/Special:ListGroupRight [21:49:44] Thanks urbanecm for doing this :-) [21:49:47] and turn on the debug extension [21:49:59] you'll be able to see testgroup show up with the extension on, but not with it off [21:50:16] it is urbanecm, but it took a couple reloads to see it [21:50:34] yes a big thank you urbanecm [21:50:34] eigyan: for context, i offered people to pull https://gerrit.wikimedia.org/r/764901 to a debug server, so we can test how the extension behaves [21:51:06] I'm happy to help :) [21:51:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:51:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:52] urbanecm aha I see thank you for that, that is very helpful [21:52:07] mepps: it's interesting it takes a few reloads. For me, it works immediately [21:52:19] * thcipriani lurks after realizing he gets notified by the word "tyler" :D [21:52:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:25] urbanecm yes, this is exactly the behavior I'd expect. For whatever reason our changes weren't loading up initially, but the whole thing is working very consistently for me. FWIW, I have the extension in a browser that I only use occasionally, so it's possible that I had something janky going on my end. [21:52:39] *working very consistently now* [21:52:39] oops sorry thcipriani! [21:53:10] mepps: no problem at all, happy to lurk [21:53:20] i'm just glad it's working urbanecm :) [21:53:32] JSherman: good to know. Maybe QuickSurveys starts to ship some frontend modules (which takes some time)? [21:54:59] urbanecm yeah it includes a client-side vue.js app, so I could see that first payload taking a bit since the client cache for that fqdn is cold right after deploy [21:55:05] makes sense [21:55:14] do you want me to enable a survey for another wiki for testing? [21:55:20] (at mwdebug1001 only, of course) [21:57:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:41] JSherman: ^^, in case you missed my message :) [21:58:17] urbanecm that would be great [21:58:25] okay, let me quickly do that [21:59:47] i'm signing off but thank you JSherman and again a big thank you to urbanecm :) [22:00:14] happy to help and see you later mepps! [22:00:26] (03PS2) 10Urbanecm: DNM: Testing patch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764901 [22:00:34] JSherman: okay, I updated the test patch and pulled it to mwdebug1001 [22:00:42] it enables the same survey at ar.wikipedia [22:00:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [22:00:56] urbanecm looking now, thanks! [22:02:52] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1004.wikimedia.org with OS bullseye [22:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:36] urbanecm I'm actually having trouble navigating to an article page on ar. Survey's can't deploy to the main page or special pages [22:04:04] i usually go to https://ar.wikipedia.org/wiki/Special:RandomPage manually in similar cases [22:04:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:04:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:13] (by manually, i mean by actually typing the full URL out) [22:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:37] it also preserves GET parameters, so you can do https://ar.wikipedia.org/wiki/Special:RandomPage?quicksurvey=internal-gdi-safety-survey (in this case) and test that way [22:04:52] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, and 2 others: The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dduvall) a:05dduvall→03None [22:05:26] urbanecm I am able to see the survey at the above url, many thanks to you! [22:05:33] urbanecm helpful! thanks! [22:05:36] no problem! [22:05:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [22:07:11] so, looks it works fine for both? [22:07:46] urbanecm yes, the random article worked at the first page load, so I'm going to chalk whatever I had going on as a client/network blip [22:07:50] (03PS1) 10DLynch: Avoid undefined index for mobileformat [extensions/VisualEditor] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764833 (https://phabricator.wikimedia.org/T302344) [22:07:54] urbanecm yes it does [22:07:59] great! [22:08:08] (03PS1) 10DLynch: Avoid undefined index for mobileformat [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764834 (https://phabricator.wikimedia.org/T302344) [22:08:37] eigyan: JSherman: are there any other config changes you'd like to play with now? [22:09:15] urbanecm I will save my appetite for the upcoming training :) [22:09:29] okay :) [22:09:30] urbanecm I really appreciate your extra time and care on this. thank you. I feel much more confident on how to use the extension across projects. [22:09:44] I'm very happy to hear that JSherman. [22:09:45] ^^ Ditto urbanecm [22:10:05] I'm out of here for today, folks. [22:10:08] (03PS1) 10Dwisehaupt: Add dns entries for civi1002 [dns] - 10https://gerrit.wikimedia.org/r/764905 (https://phabricator.wikimedia.org/T296409) [22:10:25] If I am not needed I will sign off as well [22:10:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:10:39] just one last thing :) [22:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:47] thank you for your support JSherman [22:10:56] urbanecm yes [22:12:03] JSherman: eigyan: as a friendly advice for future deployments, do feel free to ask any questions during the window. I'm always happy to answer them – it's just very hard for me to explain something when I don't know it needs to be explained :) [22:12:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2078.mgmt.codfw.wmnet with reboot policy FORCED [22:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:25] urbanecm I will definitely keep that in mind as we move forward. I learned a lot here tonight [22:13:34] ^ +1 to urbanecm 's advice, and I'll add: if you have a question, you're probably not the only one, so asking publicly benefits all the lurkers :D [22:13:47] :) [22:14:14] I'm happy to hear that eigyan. See you later! [22:14:44] urbanecm take care and see you next time [22:14:44] i removed the testing patch from mwdebug1001 now [22:14:50] you too! [22:15:16] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2067.codfw.wmnet with OS stretch [22:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:20] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be20[66-69] - https://phabricator.wikimedia.org/T299468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2067.codfw.wmnet with OS stretch executed with errors: - m... [22:15:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:21] (03CR) 10Jgreen: [C: 03+2] Add dns entries for civi1002 [dns] - 10https://gerrit.wikimedia.org/r/764905 (https://phabricator.wikimedia.org/T296409) (owner: 10Dwisehaupt) [22:16:37] urbanecm ack, wilco. I'd say my biggest challenge is just tracking the discussion while trying to do the steps necessary on my end. The channel moves fast and has multiple discussion happening simultaneously, so sometimes I find myself scrolling back to understand where the process is at (or moving on to testing in another window) only to come back [22:16:37] and find that someone has asked something of me that I have inadvertently ghosted them on. [22:17:31] Which is to say, not being an irc native has been my biggest barrier so far (IMO) [22:18:24] JSherman: I understand that. A lot of the bots in this channel can probably be ignored by your IRC client (by /ignore). I'm also sure that B&C deployers (I would at least :)) would be happy to ensure they ping you in every message they send to you, to increase the chance it's not missed [22:19:37] urbanecm that is great advice. I'll bake that into my role call when I've got something to deploy from here on out [22:20:08] sounds like a plan :) [22:20:27] I'm out; thanks 1,000,000! [22:20:33] see you later JSherman! [22:22:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:22:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:10] jouncebot: now [22:32:10] No deployments scheduled for the next 9 hour(s) and 27 minute(s) [22:32:18] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1004.wikimedia.org with OS bullseye [22:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:10] urbanecm: just double checking, are you all done with deployments? i'd like to merge/sync a couple of wmf.23 fixes [22:33:20] dduvall: go ahead [22:33:25] thanks! [22:33:54] (03CR) 10Dduvall: [C: 03+2] Avoid undefined index for mobileformat [extensions/VisualEditor] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764833 (https://phabricator.wikimedia.org/T302344) (owner: 10DLynch) [22:33:54] (03CR) 10Dduvall: [C: 03+2] Avoid undefined index for mobileformat [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764834 (https://phabricator.wikimedia.org/T302344) (owner: 10DLynch) [22:40:23] dduvall: by any chance, do you know whether train will move forward soon? I see this is for the blocker, so...maybe? [22:41:04] we're on group0 still. i never rolled back since this is most probably just logspam [22:41:44] Why did i think it's Wednesday... [22:41:52] we should be on for the usual group1 promotion tomorrow [22:42:01] Good to know [22:45:43] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1004.wikimedia.org with OS bullseye [22:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:45] (03Merged) 10jenkins-bot: Avoid undefined index for mobileformat [extensions/VisualEditor] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764833 (https://phabricator.wikimedia.org/T302344) (owner: 10DLynch) [22:48:46] (03Merged) 10jenkins-bot: Avoid undefined index for mobileformat [extensions/DiscussionTools] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/764834 (https://phabricator.wikimedia.org/T302344) (owner: 10DLynch) [22:52:21] !log dduvall@deploy1002 Synchronized php-1.38.0-wmf.23/extensions/DiscussionTools/includes/ApiDiscussionToolsEdit.php: Backport: [[gerrit:764834|DiscussionTools: Avoid undefined index for mobileformat ([T302344])]] (duration: 00m 51s) [22:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:27] T302344: PHP Notice: Undefined index: mobileformat - https://phabricator.wikimedia.org/T302344 [22:53:49] !log dduvall@deploy1002 Synchronized php-1.38.0-wmf.23/extensions/VisualEditor/includes/ApiVisualEditorEdit.php: Backport: [[gerrit:764833|VisualEditor: Avoid undefined index for mobileformat ([T302344])]] (duration: 00m 49s) [22:53:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:58:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:01] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1004.wikimedia.org with reason: host reimage [23:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:29] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10Legoktm) >>! In T302231#7729265, @Urbanecm wrote: >>>! In T302231#7729155, @Ladsgroup wrote: >> Well, I think we should be more inclusive in our deployers, not every deployer need to ha... [23:20:25] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1004.wikimedia.org with reason: host reimage [23:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:45] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test