[00:01:44] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [00:10:17] (03PS2) 10Cwhite: remove deprecated piechart plugin [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763334 (https://phabricator.wikimedia.org/T282863) [00:10:19] (03PS2) 10Cwhite: update grafana-image-renderer to 3.3.0 [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763335 (https://phabricator.wikimedia.org/T282863) [00:10:21] (03PS2) 10Cwhite: update grafana-simple-json-datasource to 1.4.2 [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763337 (https://phabricator.wikimedia.org/T282863) [00:10:23] (03PS1) 10Cwhite: use grafana api for worldmap plugin artifact [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763618 (https://phabricator.wikimedia.org/T282863) [00:10:25] (03PS1) 10Cwhite: Update changelog and build instructions [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763619 [00:11:38] (03CR) 10Cwhite: [V: 03+2] Update changelog and build instructions [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763619 (owner: 10Cwhite) [00:14:04] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [00:24:20] razzi: ^ is that ther 1003 host ? [00:37:20] (03CR) 10Krinkle: [C: 03+1] "Yep, no longer needed. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/763561 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [00:47:20] (03PS1) 10Andrew Bogott: cloudcontrols: override profile::openstack::codfw1dev::keystone::wsgi_server: 'keystone' [puppet] - 10https://gerrit.wikimedia.org/r/763629 (https://phabricator.wikimedia.org/T281276) [00:48:00] (03CR) 10jerkins-bot: [V: 04-1] cloudcontrols: override profile::openstack::codfw1dev::keystone::wsgi_server: 'keystone' [puppet] - 10https://gerrit.wikimedia.org/r/763629 (https://phabricator.wikimedia.org/T281276) (owner: 10Andrew Bogott) [00:51:28] (03PS2) 10Andrew Bogott: cloudcontrols: override profile::openstack::eqiad1::keystone::wsgi_server: [puppet] - 10https://gerrit.wikimedia.org/r/763629 (https://phabricator.wikimedia.org/T281276) [00:51:58] (03CR) 10Dzahn: [C: 03+2] Remove unused module xvfb [puppet] - 10https://gerrit.wikimedia.org/r/763561 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [00:54:32] (03CR) 10Andrew Bogott: [C: 03+2] cloudcontrols: override profile::openstack::eqiad1::keystone::wsgi_server: [puppet] - 10https://gerrit.wikimedia.org/r/763629 (https://phabricator.wikimedia.org/T281276) (owner: 10Andrew Bogott) [01:05:16] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10Papaul) [01:14:08] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:33:12] hey y'all, I need to talk to a sysadmin fairly urgently ref security ticket T302047 - getting botted hard on enwp and the edit filters are throttling [01:37:10] apologies in advance! [01:37:32] #page - T302047, need to mitigate a botnet on enwp [01:38:00] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:40:22] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:41:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [01:46:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [01:53:17] @reedy awake? [01:56:42] (03PS1) 10BBlack: Emergency reduction of AF EmergencyDisableAge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763632 (https://phabricator.wikimedia.org/T302047) [01:57:37] (03PS1) 10CDanis: Disable AbuseFilter throttling on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763633 (https://phabricator.wikimedia.org/T302047) [01:57:47] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [01:58:23] (03PS2) 10CDanis: Disable AbuseFilter throttling on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763633 (https://phabricator.wikimedia.org/T302047) [01:59:10] (03CR) 10BBlack: [C: 03+1] Disable AbuseFilter throttling on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763633 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [01:59:48] (03Abandoned) 10BBlack: Emergency reduction of AF EmergencyDisableAge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763632 (https://phabricator.wikimedia.org/T302047) (owner: 10BBlack) [02:01:05] (03CR) 10Dzahn: [C: 03+1] Disable AbuseFilter throttling on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763633 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [02:01:13] (03CR) 10CDanis: [C: 03+2] Disable AbuseFilter throttling on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763633 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [02:01:55] (03Merged) 10jenkins-bot: Disable AbuseFilter throttling on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763633 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [02:03:36] !log cdanis@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable AbuseFilter throttling on enwiki 6692b4642 T302047 (duration: 00m 49s) [02:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:49] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [02:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:09] @cdanis @bblack: We might need emergency captcha as well. [02:05:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:55] Seddon: okay, for IP edits only? I'll try to figure that out [02:06:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:06:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:51] @cdanis wmgEmergencyCaptcha [02:07:38] Seddon: perfect ty [02:07:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:24] cdanis: I think 'enwiki' => true, [02:08:56] (03PS1) 10CDanis: enable wmgEmergencyCaptcha for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763634 (https://phabricator.wikimedia.org/T302047) [02:09:29] (03CR) 10BBlack: [C: 03+1] enable wmgEmergencyCaptcha for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763634 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [02:09:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:52] (03CR) 10CDanis: [C: 03+2] enable wmgEmergencyCaptcha for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763634 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [02:10:54] (03Merged) 10jenkins-bot: enable wmgEmergencyCaptcha for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763634 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [02:11:46] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:12:12] !log cdanis@deploy1002 Synchronized wmf-config/InitialiseSettings.php: enable wmgEmergencyCaptcha for enwiki ff2f7ef64 T302047 (duration: 00m 49s) [02:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:14:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:03] (03PS1) 10BBlack: block a bad UA [puppet] - 10https://gerrit.wikimedia.org/r/763635 (https://phabricator.wikimedia.org/T302047) [02:22:14] (03CR) 10CDanis: [C: 03+1] block a bad UA [puppet] - 10https://gerrit.wikimedia.org/r/763635 (https://phabricator.wikimedia.org/T302047) (owner: 10BBlack) [02:22:49] (03CR) 10BBlack: [C: 03+2] block a bad UA [puppet] - 10https://gerrit.wikimedia.org/r/763635 (https://phabricator.wikimedia.org/T302047) (owner: 10BBlack) [02:23:56] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [02:31:12] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:41:40] PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 7.073e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:48:04] (03PS1) 10Andrew Bogott: openstack::designate::dns_floating_ip_updater: remove python-ipaddress dep [puppet] - 10https://gerrit.wikimedia.org/r/763638 [02:49:51] (03CR) 10Andrew Bogott: [C: 03+2] openstack::designate::dns_floating_ip_updater: remove python-ipaddress dep [puppet] - 10https://gerrit.wikimedia.org/r/763638 (owner: 10Andrew Bogott) [02:54:10] (03PS1) 10Andrew Bogott: openstack::designate::dns_floating_ip_updater: remove python-ipaddress dep [puppet] - 10https://gerrit.wikimedia.org/r/763639 [02:54:51] (03CR) 10jerkins-bot: [V: 04-1] openstack::designate::dns_floating_ip_updater: remove python-ipaddress dep [puppet] - 10https://gerrit.wikimedia.org/r/763639 (owner: 10Andrew Bogott) [02:56:58] (03PS2) 10Andrew Bogott: openstack::designate::dns_floating_ip_updater: remove python-ipaddress dep [puppet] - 10https://gerrit.wikimedia.org/r/763639 [02:58:36] (03CR) 10Andrew Bogott: [C: 03+2] openstack::designate::dns_floating_ip_updater: remove python-ipaddress dep [puppet] - 10https://gerrit.wikimedia.org/r/763639 (owner: 10Andrew Bogott) [03:13:04] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:16:52] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:58:54] PROBLEM - Check unit status of netbox_ganeti_esams_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:00:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:36] RECOVERY - WDQS high update lag on wdqs1005 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.14e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [04:09:46] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:16] RECOVERY - Check unit status of netbox_ganeti_esams_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:25:44] (03PS1) 10Andrew Bogott: OpenStack nova: Increase timeout for check_flavor_properties [puppet] - 10https://gerrit.wikimedia.org/r/763640 [05:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:54:56] (03PS1) 10Legoktm: Use hash page in IRC message [software/klaxon] - 10https://gerrit.wikimedia.org/r/763643 [05:55:48] (03CR) 10jerkins-bot: [V: 04-1] Use hash page in IRC message [software/klaxon] - 10https://gerrit.wikimedia.org/r/763643 (owner: 10Legoktm) [05:57:47] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [06:31:20] (03PS1) 10Kevin Bazira: ml-services: add eswiki & eswikiquote editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/763647 (https://phabricator.wikimedia.org/T301415) [06:53:12] (03PS2) 10Legoktm: Use hash page in IRC message [software/klaxon] - 10https://gerrit.wikimedia.org/r/763643 [06:53:14] (03PS1) 10Legoktm: Unbreak tests by pinning itsdangerous [software/klaxon] - 10https://gerrit.wikimedia.org/r/763649 [06:59:33] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10elukey) [07:00:45] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10elukey) @papaul we are moving all the kubernetes hosts to Bullseye, I modified the task's description with this info + partitioning. Thanks! [07:01:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10elukey) [07:29:14] (03PS1) 10Elukey: helmfile.d: allow wikibooks/wiktionary in ml-serve's egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/763653 [07:38:44] (03CR) 10Elukey: [C: 03+2] helmfile.d: allow wikibooks/wiktionary in ml-serve's egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/763653 (owner: 10Elukey) [07:40:46] (03PS1) 10Majavah: metricsinfra: Use prometheus-configurator [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299) [07:41:36] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:41] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:06] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [07:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:19] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [07:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:43] (03CR) 10jerkins-bot: [V: 04-1] metricsinfra: Use prometheus-configurator [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299) (owner: 10Majavah) [07:44:37] (03PS2) 10Majavah: metricsinfra: Use prometheus-configurator [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299) [07:48:09] (03PS1) 10Elukey: helmfile.d: allow wikiquote.org in ml-serve's egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/763687 [07:54:14] (03PS2) 10Kevin Bazira: ml-services: add eswiki & eswikiquote editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/763647 (https://phabricator.wikimedia.org/T301415) [07:55:33] (03CR) 10Elukey: [C: 03+2] helmfile.d: allow wikiquote.org in ml-serve's egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/763687 (owner: 10Elukey) [07:57:27] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:33] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:43] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [07:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:57] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [07:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220218T0800) [08:02:00] (03CR) 10Filippo Giunchedi: "A quick note to mention that grafana-image-renderer 3.4 was released just yesterday" [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763335 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [08:02:53] (03CR) 10Filippo Giunchedi: [C: 03+1] use grafana api for worldmap plugin artifact [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763618 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [08:03:21] (03CR) 10Filippo Giunchedi: [C: 03+1] Update changelog and build instructions [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763619 (owner: 10Cwhite) [08:05:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks for taking a look!" [puppet] - 10https://gerrit.wikimedia.org/r/763541 (https://phabricator.wikimedia.org/T301657) (owner: 10MVernon) [08:11:45] (03CR) 10Elukey: [C: 03+2] ml-services: add eswiki & eswikiquote editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/763647 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [08:19:04] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [08:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:46] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [08:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:07] 10SRE, 10Icinga, 10WMF-Legal, 10observability: monitoring alert: legal footer change on en.wikipedia - due to creative commons license version change - https://phabricator.wikimedia.org/T302045 (10Peachey88) [08:24:43] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Unrack wmf3570 & wmf4579 - https://phabricator.wikimedia.org/T302034 (10Peachey88) [08:31:13] (03PS2) 10KartikMistry: Update cxserver to 2022-02-15-050044-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/762575 (https://phabricator.wikimedia.org/T301443) [08:32:15] Updating cxserver: Time bound deployment without any major changes. [08:35:01] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-02-15-050044-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/762575 (https://phabricator.wikimedia.org/T301443) (owner: 10KartikMistry) [08:38:32] (03Merged) 10jenkins-bot: Update cxserver to 2022-02-15-050044-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/762575 (https://phabricator.wikimedia.org/T301443) (owner: 10KartikMistry) [08:39:18] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [08:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:55] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [08:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:06] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [08:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1022.eqiad.wmnet with OS buster [08:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:40] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1022.eqiad.wmnet with OS buster [08:47:08] (03CR) 10MVernon: [C: 03+2] swift: use rsyslog-rotate to get rsyslog to close old files [puppet] - 10https://gerrit.wikimedia.org/r/763541 (https://phabricator.wikimedia.org/T301657) (owner: 10MVernon) [08:47:34] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [08:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:47] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [08:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:11] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [08:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:33] 10SRE-swift-storage, 10User-fgiunchedi: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10MatthewVernon) [08:52:54] 10SRE-swift-storage, 10Observability-Logging, 10Patch-For-Review, 10User-fgiunchedi: Missed swift log rotation can lead to full root filesystem - https://phabricator.wikimedia.org/T301657 (10MatthewVernon) 05Openβ†’03Resolved a:03MatthewVernon [08:53:52] !log Updated cxserver to 2022-02-15-050044-production (T301443) [08:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:57] T301443: Enable Flores for Occitan and Luganda - https://phabricator.wikimedia.org/T301443 [08:54:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2002.codfw.wmnet [08:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2002.codfw.wmnet [08:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:40] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1022.eqiad.wmnet with reason: host reimage [08:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:12] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10ayounsi) a:05ayounsiβ†’03RobH What's left to do on the network side: ` 1/ cr1-drmrs:et-0/0/2 currently connected to: asw1-b12-drmrs:et... [09:01:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2001.codfw.wmnet [09:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1022.eqiad.wmnet with reason: host reimage [09:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:29] (03CR) 10David Caro: [C: 03+2] parsoid: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751163 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:10:41] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [09:13:30] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) The current leftover modules are two patches waiting calmly for @mpopov to review (whenever back on the job), will retake then. [09:18:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10MMandere) Thank you @JBennett for the approval. @Michael.Hay please sign the [[ https://phabricator.wikimedia.org/L3 | L3 ]] acknowledgement form for us to proceed gran... [09:18:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10MMandere) [09:21:08] (03PS2) 10Filippo Giunchedi: am: remove Icinga/ prefix and add 'source' label [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763459 (https://phabricator.wikimedia.org/T300951) [09:21:10] (03PS1) 10Ayounsi: drmrs: Anycast tuning for Tata [homer/public] - 10https://gerrit.wikimedia.org/r/763696 [09:25:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10MMandere) Thank you @Milimetric for the approval. @Damiendf, perhaps you could go ahead and sign the [[ https://phabricator.wikimedia.org/L3 | L3 ]] acknowledgment form as... [09:25:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10MMandere) [09:28:20] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10MMandere) [09:30:01] (03PS2) 10Ayounsi: drmrs: Anycast tuning for Tata [homer/public] - 10https://gerrit.wikimedia.org/r/763696 [09:30:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10MMandere) Thank you @JBennett for the approval. @skyenet please sign the [[ https://phabricator.wikimedia.org/L3 | L3 ]] acknowledgment from for us to proceed processing you... [09:32:07] (03PS1) 10Filippo Giunchedi: thanos: add relabels to rule [puppet] - 10https://gerrit.wikimedia.org/r/763697 (https://phabricator.wikimedia.org/T300951) [09:32:09] (03PS1) 10Filippo Giunchedi: prometheus: inject 'source' label to alerts [puppet] - 10https://gerrit.wikimedia.org/r/763698 (https://phabricator.wikimedia.org/T300951) [09:33:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1022.eqiad.wmnet with OS buster [09:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:58] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1022.eqiad.wmnet with OS buster completed: - ganeti1022 (**PASS**)... [09:34:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10MMandere) @Tom_Magerlein, perhaps you could go ahead and sign the [[ https://phabricator.wikimedia.org/L3 | L3 ]] acknowledgment form as we await for @JBennett and @Milimetr... [09:35:04] !log draining instances off ganeti1009 [09:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:23] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Kiron Lebeck (klebeck-tmlt) - https://phabricator.wikimedia.org/T301680 (10MMandere) @Klebeck-tmlt , perhaps you could go ahead and sign the [[ https://phabricator.wikimedia.org/L3 | L3 ]] acknowledgment form as we await for @JBennet... [09:36:35] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33853/console" [puppet] - 10https://gerrit.wikimedia.org/r/763698 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi) [09:36:50] (03PS2) 10Filippo Giunchedi: thanos: add relabels to rule [puppet] - 10https://gerrit.wikimedia.org/r/763697 (https://phabricator.wikimedia.org/T300951) [09:36:52] (03PS2) 10Filippo Giunchedi: prometheus: inject 'source' label to alerts [puppet] - 10https://gerrit.wikimedia.org/r/763698 (https://phabricator.wikimedia.org/T300951) [09:38:01] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33854/console" [puppet] - 10https://gerrit.wikimedia.org/r/763697 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi) [09:40:24] (03PS1) 10JMeybohm: Revert "Enable nodePort 30021 for ingressgateway status" [deployment-charts] - 10https://gerrit.wikimedia.org/r/763700 (https://phabricator.wikimedia.org/T290966) [09:40:26] (03PS1) 10JMeybohm: Increase istiod replicas to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/763701 (https://phabricator.wikimedia.org/T290966) [09:42:44] (03CR) 10Ayounsi: "NOOP in eqsin, pushed to drmrs (even though we don't advertise anycast prefixes from there yet)." [homer/public] - 10https://gerrit.wikimedia.org/r/763696 (owner: 10Ayounsi) [09:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:44:04] (03CR) 10jerkins-bot: [V: 04-1] Increase istiod replicas to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/763701 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:49:08] (03PS2) 10JMeybohm: Increase istiod replicas to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/763701 (https://phabricator.wikimedia.org/T290966) [09:56:27] (03CR) 10JMeybohm: [C: 03+2] Revert "Enable nodePort 30021 for ingressgateway status" [deployment-charts] - 10https://gerrit.wikimedia.org/r/763700 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:56:30] (03CR) 10JMeybohm: [C: 03+2] Increase istiod replicas to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/763701 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:57:47] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [10:00:45] (03Merged) 10jenkins-bot: Revert "Enable nodePort 30021 for ingressgateway status" [deployment-charts] - 10https://gerrit.wikimedia.org/r/763700 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:00:52] (03Merged) 10jenkins-bot: Increase istiod replicas to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/763701 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:00:58] !log deploying schema change to s2 T300774 [10:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:04] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [10:01:29] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:01:30] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:35] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T300774)', diff saved to https://phabricator.wikimedia.org/P21011 and previous config saved to /var/cache/conftool/dbconfig/20220218-100135-kormat.json [10:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:16] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:17:52] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763703 (https://phabricator.wikimedia.org/T301234) (owner: 10Michael Große) [10:18:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Should be good to deploy on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763703 (https://phabricator.wikimedia.org/T301234) (owner: 10Michael Große) [10:19:24] (03CR) 10TAndic: "Hi Jon, Tanja here from GDI -" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [10:19:45] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T300774)', diff saved to https://phabricator.wikimedia.org/P21012 and previous config saved to /var/cache/conftool/dbconfig/20220218-101945-kormat.json [10:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:52] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [10:20:32] RECOVERY - Disk space on thanos-be2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [10:20:35] !log truncate /var/log/swift/server.log.1 to 30G due to full root fs - T301657 [10:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:40] T301657: Missed swift log rotation can lead to full root filesystem - https://phabricator.wikimedia.org/T301657 [10:20:42] Emperor: ^ [10:22:18] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:29] (03PS1) 10JMeybohm: Drop unused ports from istio-ingressgateway service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/763705 (https://phabricator.wikimedia.org/T290966) [10:30:45] (03CR) 10David Caro: [C: 03+1] "LGTM, there's some leftover hiera stuff on cloud-instance-puppet though:" [puppet] - 10https://gerrit.wikimedia.org/r/762829 (https://phabricator.wikimedia.org/T298191) (owner: 10Majavah) [10:31:43] (03CR) 10Madalina: [wmf-config]: Deploy the fawiki test safety survey to production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [10:32:25] godog: Do you want me to just rsyslog-rotate on all the backends that still have server.log.1 open? Fix rolled out today should take when tomorrow's logrotate runs, but we might want to sort it now... [10:32:57] 36 nodes [10:33:05] (36) ms-be[2028-2030,2032,2037-2038,2040,2046-2047,2050-2051,2053-2054,2057,2060,2063,2065].codfw.wmnet,ms-be[1028-1031,1035-1038,1042,1046,1048-1049,1054,1058-1060,1065,1067].eqiad.wmnet,thanos-be2001.codfw.wmnet [10:33:44] (03PS1) 10Jbond: wikimedia.org: update 0365 txt validation [dns] - 10https://gerrit.wikimedia.org/r/763706 (https://phabricator.wikimedia.org/T300076) [10:34:50] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P21013 and previous config saved to /var/cache/conftool/dbconfig/20220218-103449-kormat.json [10:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:27] Emperor: mmhh yeah I think sorting it out now is a good idea [10:35:57] will do; looks like only thanos-be2001 is getting fullish on / (74%) [10:36:13] (03CR) 10Jbond: [C: 03+2] wikimedia.org: update 0365 txt validation [dns] - 10https://gerrit.wikimedia.org/r/763706 (https://phabricator.wikimedia.org/T300076) (owner: 10Jbond) [10:37:28] !log rsyslog-rotate to clear held-open server.log.1 (ms-be[2028-2030,2032,2037-2038,2040,2046-2047,2050-2051,2053-2054,2057,2060,2063,2065].codfw.wmnet,ms-be[1028-1031,1035-1038,1042,1046,1048-1049,1054,1058-1060,1065,1067].eqiad.wmnet,thanos-be2001.codfw.wmnet) T301657 [10:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:33] T301657: Missed swift log rotation can lead to full root filesystem - https://phabricator.wikimedia.org/T301657 [10:37:45] (03PS1) 10Cathal Mooney: Add new Eqiad row E-F elastic servers to puppet with basic role [puppet] - 10https://gerrit.wikimedia.org/r/763708 (https://phabricator.wikimedia.org/T299609) [10:38:15] Emperor: ack, sounds good, yeah we can trim server.log.1 more if needed [10:38:42] done, and confirmed nothing holding it open now. [10:39:15] neat, yeah that should be the nail in the coffin [10:39:48] (03CR) 10David Caro: role::mariadb: remove unused role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751725 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:40:15] rsyslog should compress server.log.1 on thanos-be2001 overnight, so I think we're OK there? 14G available in / [10:40:24] Hm, though server.log.1 is 31G [10:41:02] yeah I think it is fine to truncate it even more [10:41:39] 10G? that's a bit brutal, but I'd rather not have it fill up over the weekend [10:42:39] sure why not, yeah I'd rather not have that either [10:42:44] (03CR) 10David Caro: backy2: on Bullseye, hack around a silly package name mismatch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763336 (https://phabricator.wikimedia.org/T301909) (owner: 10Andrew Bogott) [10:43:48] !log truncate swift/server.log.1 to 10G on thanos-be2001 T301657 [10:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:54] T301657: Missed swift log rotation can lead to full root filesystem - https://phabricator.wikimedia.org/T301657 [10:45:01] I go, and 'tis done [10:46:26] (03CR) 10JMeybohm: [C: 03+2] Drop unused ports from istio-ingressgateway service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/763705 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:49:37] neato [10:49:54] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P21014 and previous config saved to /var/cache/conftool/dbconfig/20220218-104954-kormat.json [10:49:56] (03Merged) 10jenkins-bot: Drop unused ports from istio-ingressgateway service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/763705 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [10:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:20] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review, 10WMSE (IT): Need Assistance adding DNS records to claim domain - https://phabricator.wikimedia.org/T300076 (10jbond) This is now in p[lace, please re-open if you still see issues ` lang=shell $ dig txt wikimedia.org @ns0.wikimedia.org... [10:50:47] !log installing zsh security updates on stretch [10:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:59] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T300774)', diff saved to https://phabricator.wikimedia.org/P21015 and previous config saved to /var/cache/conftool/dbconfig/20220218-110459-kormat.json [11:05:00] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [11:05:02] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [11:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:05] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [11:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:08] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T300774)', diff saved to https://phabricator.wikimedia.org/P21016 and previous config saved to /var/cache/conftool/dbconfig/20220218-110506-kormat.json [11:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33855/console" [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway) [11:07:29] (03CR) 10Jbond: [C: 03+2] Rakefile: Add sperate rake jobs for static/unit tests [puppet] - 10https://gerrit.wikimedia.org/r/763597 (owner: 10Jbond) [11:07:42] (03CR) 10Majavah: remove clush modules, profiles and roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762829 (https://phabricator.wikimedia.org/T298191) (owner: 10Majavah) [11:25:59] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300774)', diff saved to https://phabricator.wikimedia.org/P21017 and previous config saved to /var/cache/conftool/dbconfig/20220218-112558-kormat.json [11:26:00] (03CR) 10Jbond: [V: 03+1] "TBH I'm not a fan of including external/third party modules in the main mono repo. we already have some examples of this e.g. concat, std" [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway) [11:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:05] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [11:27:52] (03CR) 10Vgutierrez: [C: 04-1] R:varnish:instance: Add genral public cloud rate limiting (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [11:27:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1017.eqiad.wmnet with OS buster [11:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:02] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1017.eqiad.wmnet with OS buster [11:31:55] (03PS7) 10Jbond: R:varnish:instance: Add general public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) [11:32:15] (03PS8) 10Jbond: R:varnish:instance: Add general public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) [11:32:33] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [11:36:09] (03CR) 10Vgutierrez: "CR is currently missing (text|upload)_envoy roles" [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [11:37:42] (03CR) 10Ayounsi: "Discussed it over IRC, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/763708 (https://phabricator.wikimedia.org/T299609) (owner: 10Cathal Mooney) [11:37:46] (03CR) 10Ayounsi: [C: 03+1] Add new Eqiad row E-F elastic servers to puppet with basic role [puppet] - 10https://gerrit.wikimedia.org/r/763708 (https://phabricator.wikimedia.org/T299609) (owner: 10Cathal Mooney) [11:41:04] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P21018 and previous config saved to /var/cache/conftool/dbconfig/20220218-114103-kormat.json [11:41:05] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1017.eqiad.wmnet with reason: host reimage [11:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [11:42:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10hnowlan) >>! In T294372#7719367, @Cmjohnson wrote: > @hnowlan I am having issues with p... [11:43:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1017.eqiad.wmnet with reason: host reimage [11:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:15] (03CR) 10Cathal Mooney: [C: 03+2] Add new Eqiad row E-F elastic servers to puppet with basic role [puppet] - 10https://gerrit.wikimedia.org/r/763708 (https://phabricator.wikimedia.org/T299609) (owner: 10Cathal Mooney) [11:46:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [11:48:03] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10hnowlan) [11:54:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1017.eqiad.wmnet with OS buster [11:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:54] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1017.eqiad.wmnet with OS buster completed: - ganeti1017 (**PASS**)... [11:56:08] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P21019 and previous config saved to /var/cache/conftool/dbconfig/20220218-115608-kormat.json [11:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:29] (03CR) 10Vgutierrez: [C: 03+1] "LGTM but I could use a double check by @bblack :)" [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [11:58:43] (03PS33) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [11:58:54] 10SRE, 10SRE-Access-Requests: Requesting access to the production cluster as a deployer for HANNAH OKWELUM - https://phabricator.wikimedia.org/T301876 (10MMandere) 05Openβ†’03Resolved a:03MMandere Thank you @ArielGlenn for assisting with this task to the very end. I will go ahead and mark the task as resol... [12:05:11] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:59] (03PS1) 10JMeybohm: Add *.k8s-staging.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740) [12:08:07] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:08:10] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:32] (03CR) 10jerkins-bot: [V: 04-1] reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:11:08] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:13] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300774)', diff saved to https://phabricator.wikimedia.org/P21020 and previous config saved to /var/cache/conftool/dbconfig/20220218-121113-kormat.json [12:11:15] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:11:17] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:18] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:11:18] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [12:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:22] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:26] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T300774)', diff saved to https://phabricator.wikimedia.org/P21021 and previous config saved to /var/cache/conftool/dbconfig/20220218-121126-kormat.json [12:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:05] 10SRE, 10Traffic, 10WMF-Legal, 10Performance-Team (Radar), 10Privacy: Consider disabling Chrome Lite pages for Wikipedia on Chrome on mobile with Cache-Control: no-transform - https://phabricator.wikimedia.org/T218618 (10Peter) This is also interesting, the coming Chrome prefetch proxy on Android is roll... [12:12:41] (03PS13) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) [12:17:22] (03CR) 10Jbond: "updated fresh pcc running" [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [12:20:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 54): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33857/console" [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [12:20:35] (03PS9) 10Jbond: R:varnish:instance: Add general public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) [12:20:37] (03PS14) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) [12:22:10] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1093.eqiad.wmnet with OS bullseye [12:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye [12:25:08] (03CR) 10Muehlenhoff: Add nagios_core & mailalias_core modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway) [12:27:53] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T300774)', diff saved to https://phabricator.wikimedia.org/P21022 and previous config saved to /var/cache/conftool/dbconfig/20220218-122753-kormat.json [12:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:01] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [12:31:16] (03CR) 10CDanis: [C: 03+2] Use hash page in IRC message [software/klaxon] - 10https://gerrit.wikimedia.org/r/763643 (owner: 10Legoktm) [12:37:30] !log aborrero@apt1001:~$ sudo -i reprepro -C main includedeb buster-wikimedia /home/aborrero/prometheus-openstack-exporter_0.1.4-2_all.deb (T302050) [12:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:36] T302050: prometheus-openstack-exporter in Bullseye - https://phabricator.wikimedia.org/T302050 [12:37:51] !log aborrero@apt1001:~$ sudo -i reprepro -C main includedeb bullseye-wikimedia /home/aborrero/prometheus-openstack-exporter_0.1.4-2_all.deb (T302050) [12:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:58] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P21023 and previous config saved to /var/cache/conftool/dbconfig/20220218-124258-kormat.json [12:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:43] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:53:40] (03PS34) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [12:58:03] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P21024 and previous config saved to /var/cache/conftool/dbconfig/20220218-125802-kormat.json [12:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:16] 10SRE, 10SRE-Access-Requests: Requesting access to the production cluster as a deployer for HANNAH OKWELUM - https://phabricator.wikimedia.org/T301876 (10ArielGlenn) Thanks for closing, and Hannah tested access today and it works like a charm :-) [13:02:13] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1093.eqiad.wmnet with OS bullseye [13:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye executed... [13:03:11] (03CR) 10jerkins-bot: [V: 04-1] reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:12:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1012.eqiad.wmnet with OS buster [13:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:51] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1012.eqiad.wmnet with OS buster [13:13:07] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T300774)', diff saved to https://phabricator.wikimedia.org/P21025 and previous config saved to /var/cache/conftool/dbconfig/20220218-131307-kormat.json [13:13:09] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:13:10] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:13] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:13:15] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T300774)', diff saved to https://phabricator.wikimedia.org/P21026 and previous config saved to /var/cache/conftool/dbconfig/20220218-131315-kormat.json [13:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:45] (03PS1) 10Ayounsi: Add drmrs routers to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/763748 [13:26:05] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1012.eqiad.wmnet with reason: host reimage [13:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1012.eqiad.wmnet with reason: host reimage [13:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:03] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T300774)', diff saved to https://phabricator.wikimedia.org/P21027 and previous config saved to /var/cache/conftool/dbconfig/20220218-133003-kormat.json [13:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:09] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:31:44] !log restarting blazegraph on wdqs1012 (jvm stuck for 8hours) [13:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:49] (03CR) 10Cathal Mooney: [C: 03+1] "Nice work!" [homer/public] - 10https://gerrit.wikimedia.org/r/763696 (owner: 10Ayounsi) [13:41:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1012.eqiad.wmnet with OS buster [13:41:07] (03PS35) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [13:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:17] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1012.eqiad.wmnet with OS buster completed: - ganeti1012 (**PASS**)... [13:42:32] (03PS1) 10Ayounsi: Disable Junos alarms check by default [puppet] - 10https://gerrit.wikimedia.org/r/763750 [13:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:45:08] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P21028 and previous config saved to /var/cache/conftool/dbconfig/20220218-134508-kormat.json [13:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [13:51:25] (03CR) 10Ayounsi: [C: 03+2] Fix network upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/759549 (owner: 10Ayounsi) [13:54:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [13:54:55] (03PS2) 10Ayounsi: Fix network upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/759549 [13:55:48] (03CR) 10Jbond: "All green" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:56:14] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:57:48] (Processor usage over 85%) resolved: Device scs-oe16-esams.mgmt.esams.wmnet recovered from Processor usage over 85% - https://alerts.wikimedia.org [13:58:34] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 41, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:59:45] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [13:59:45] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99) [13:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:13] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P21029 and previous config saved to /var/cache/conftool/dbconfig/20220218-140012-kormat.json [14:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:58] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [14:00:58] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99) [14:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:06] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [14:01:06] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99) [14:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:21] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [14:02:21] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99) [14:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:58] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [14:03:58] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99) [14:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:47] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [14:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:18] (03CR) 10Bking: [V: 03+1] elasticsearch: upgrade deployment-prep to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763478 (https://phabricator.wikimedia.org/T301954) (owner: 10Gehel) [14:11:47] (03CR) 10Bking: [V: 03+1 C: 03+2] elasticsearch: upgrade deployment-prep to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763478 (https://phabricator.wikimedia.org/T301954) (owner: 10Gehel) [14:15:17] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T300774)', diff saved to https://phabricator.wikimedia.org/P21030 and previous config saved to /var/cache/conftool/dbconfig/20220218-141517-kormat.json [14:15:22] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:24] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [14:15:24] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:15:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [14:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:31] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [14:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:49] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:15:51] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Jclark-ctr) [14:17:17] (03PS1) 10DCausse: flink-session-cluster: increase task manager mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/763754 [14:17:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Jclark-ctr) ml-cache1001 E1 U23 ml-cache1002 E2 U23 ml-cache1003 F1 U23 [14:18:38] (03CR) 10Bking: [C: 03+2] elasticsearch: allow using elasticsearch v6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763477 (https://phabricator.wikimedia.org/T295666) (owner: 10Gehel) [14:19:01] (03CR) 10Bking: [C: 03+2] elasticsearch: upgrade cloudelastic to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763481 (https://phabricator.wikimedia.org/T301956) (owner: 10Gehel) [14:19:13] (03PS1) 10CDanis: Revert "Disable AbuseFilter throttling on enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763728 (https://phabricator.wikimedia.org/T302047) [14:19:20] (03PS2) 10CDanis: Revert "Disable AbuseFilter throttling on enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763728 (https://phabricator.wikimedia.org/T302047) [14:20:54] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10Jclark-ctr) @MoritzMuehlenhoff @wiki_willy Racking instructions have are row A we are almost out of power in old cage. Can these be racked in the new cage Row E & F? [14:28:16] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10MoritzMuehlenhoff) >>! In T299459#7721219, @Jclark-ctr wrote: > @MoritzMuehlenhoff @wiki_willy Racking instructions have are row A we are almost out of power in old cage. Can... [14:29:24] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [14:29:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [14:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [14:39:38] Is there a problem with the WMF oauth providers? [14:40:00] My tools which use oauth have stopped working, getting failures in the oauth flow. [14:40:31] hey roy649, which kind of failures? [14:40:48] roy649: your tools don't happen to send a user-agent of `python-requests/2.25.1`, do they? [14:41:03] ftr, my tool, https://massmailer.toolforge.org/, works fine, with an user-agent following the policy [14:41:04] Um...... [14:41:48] did they stop working around 12 hours ago? [14:42:10] I'm not calling requests directly, but I wouldn't be surprised if the python-social-auth module does, and doesn't set the UA. [14:42:28] I don't know exactly when they started failing [14:42:34] I know about the python-requests thing [14:42:36] roy649: can you try setting `requests.utils.default_user_agent = lambda: "YourOwnAgent"` in your bootstrapping code? [14:42:39] that's not impossible [14:43:10] unless python-social-auth passes an explicit user-agent header, it should force it to use YourOwnAgent as your UA [14:43:35] (03CR) 10DCausse: [C: 03+1] "lgtm!" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/763485 (https://phabricator.wikimedia.org/T299226) (owner: 10EJoseph) [14:43:54] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [14:43:55] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [14:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:59] ugh, looks like that's it [14:47:00] https://github.com/python-social-auth/social-core/blob/36f25c4218f17c1c74fbb22881716d3c4bdbd379/social_core/backends/mediawiki.py#L87 [14:49:22] roy649: okay, so what we did last night probably breaks a lot of tools. I'll roll that back soon too [14:49:46] yeah, I gather there was a bit of a fire drill. [14:50:10] and I'll also probably file a bug against python-social-auth to make it easier or ideally even required to set a user-agent other than the default [14:50:18] yeah, that too. [14:50:29] thanks for the report :) [14:50:40] except, sadness, python-social-auth doesn't seem to have much in the way of maintenance [14:50:43] these days [14:52:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Jclark-ctr) @elukey Could these be racked in 10g racks? [14:53:30] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1009.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [14:53:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1009.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [14:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10elukey) >>! In T294949#7721267, @Jclark-ctr wrote: > @elukey Could these be racked in 10g racks? Hi John! These hosts don't need 10g, so they can be... [14:57:27] (03PS1) 10JMeybohm: install_server: set new partman recipe for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763756 (https://phabricator.wikimedia.org/T300744) [14:57:29] (03PS1) 10JMeybohm: Add overlayfs settings for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763757 (https://phabricator.wikimedia.org/T300744) [14:57:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Jclark-ctr) @elukey we are at our limit for power in our old cage and these have 10g cards in them and our new cage will be live any day now so it cou... [14:58:14] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [14:58:16] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [14:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:20] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T300774)', diff saved to https://phabricator.wikimedia.org/P21031 and previous config saved to /var/cache/conftool/dbconfig/20220218-145820-kormat.json [14:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:28] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [14:58:46] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [14:58:58] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1009. This is the last one :-) [15:03:04] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 238, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:05:55] (03PS1) 10CDanis: Revert "block a bad UA" [puppet] - 10https://gerrit.wikimedia.org/r/763729 (https://phabricator.wikimedia.org/T302047) [15:06:11] (03PS2) 10CDanis: Revert "block a bad UA" [puppet] - 10https://gerrit.wikimedia.org/r/763729 (https://phabricator.wikimedia.org/T302047) [15:09:06] (03CR) 10CDanis: [C: 03+2] Revert "block a bad UA" [puppet] - 10https://gerrit.wikimedia.org/r/763729 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [15:11:36] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300774)', diff saved to https://phabricator.wikimedia.org/P21032 and previous config saved to /var/cache/conftool/dbconfig/20220218-151136-kormat.json [15:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:42] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:11:57] (03CR) 10CDanis: [C: 03+2] Revert "Disable AbuseFilter throttling on enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763728 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [15:13:06] (03Merged) 10jenkins-bot: Revert "Disable AbuseFilter throttling on enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763728 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [15:13:53] roy649: I think your tool should be working now [15:13:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) [15:14:22] yup, thanks! [15:14:27] !log cdanis@deploy1002 Synchronized wmf-config/InitialiseSettings.php: re-enable AbuseFilter throttling on enwiki 808d82dcd T302047 (duration: 00m 49s) [15:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-cache2001.mgmt.codfw.wmnet with reboot policy FORCED [15:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:38] cdanis: thanks for the quick response, and I totally understand about putting out fires. [15:16:39] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1093.eqiad.wmnet with OS bullseye [15:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:46] cdanis: ref T302047 I'm actively watching [15:17:08] TheresNoTime: ah perfect ty <3 [15:17:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye [15:18:05] (03PS1) 10CDanis: Revert "enable wmgEmergencyCaptcha for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763730 (https://phabricator.wikimedia.org/T302047) [15:18:17] (03PS2) 10CDanis: Revert "enable wmgEmergencyCaptcha for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763730 (https://phabricator.wikimedia.org/T302047) [15:18:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:17] (03CR) 10CDanis: [C: 03+2] Revert "enable wmgEmergencyCaptcha for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763730 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [15:19:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:19:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:01] (03Merged) 10jenkins-bot: Revert "enable wmgEmergencyCaptcha for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763730 (https://phabricator.wikimedia.org/T302047) (owner: 10CDanis) [15:20:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:21] !log cdanis@deploy1002 Synchronized wmf-config/InitialiseSettings.php: disable wmgEmergencyCaptcha for enwiki 286f99886 T302047 (duration: 00m 49s) [15:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:02] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:22:06] (03PS8) 10Eigyan: [wmf-config]: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) [15:24:45] cdanis: mind a quick PM? [15:25:08] TheresNoTime: please [15:25:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:43] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P21033 and previous config saved to /var/cache/conftool/dbconfig/20220218-152641-kormat.json [15:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:27:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:56] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-cache2001.mgmt.codfw.wmnet with reboot policy FORCED [15:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-cache2002.mgmt.codfw.wmnet with reboot policy FORCED [15:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:33] (03CR) 10Herron: [C: 03+1] am: remove Icinga/ prefix and add 'source' label [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763459 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi) [15:38:47] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [15:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:04] (03CR) 10CDanis: [C: 03+2] Unbreak tests by pinning itsdangerous [software/klaxon] - 10https://gerrit.wikimedia.org/r/763649 (owner: 10Legoktm) [15:41:34] (03Merged) 10jenkins-bot: Unbreak tests by pinning itsdangerous [software/klaxon] - 10https://gerrit.wikimedia.org/r/763649 (owner: 10Legoktm) [15:41:36] (03Merged) 10jenkins-bot: Use hash page in IRC message [software/klaxon] - 10https://gerrit.wikimedia.org/r/763643 (owner: 10Legoktm) [15:41:47] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P21034 and previous config saved to /var/cache/conftool/dbconfig/20220218-154147-kormat.json [15:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:07] (03CR) 10Herron: [C: 03+1] thanos: add relabels to rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763697 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi) [15:49:34] (03PS1) 10Cmjohnson: Updating netboot.cfg for new restbase hosts to use different recipe [puppet] - 10https://gerrit.wikimedia.org/r/763764 (https://phabricator.wikimedia.org/T294372) [15:50:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-cache2002.mgmt.codfw.wmnet with reboot policy FORCED [15:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:40] (03PS1) 10Cathal Mooney: Add new Eqiad private and analytics subnets to dhcp.conf [puppet] - 10https://gerrit.wikimedia.org/r/763766 (https://phabricator.wikimedia.org/T299758) [15:52:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-cache2003.mgmt.codfw.wmnet with reboot policy FORCED [15:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:24] (03CR) 10Cmjohnson: [C: 03+2] Updating netboot.cfg for new restbase hosts to use different recipe [puppet] - 10https://gerrit.wikimedia.org/r/763764 (https://phabricator.wikimedia.org/T294372) (owner: 10Cmjohnson) [15:55:09] (03CR) 10RhinosF1: [C: 04-1] [wmf-config]: Deploy the fawiki test safety survey to production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [15:55:24] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:56:50] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1093.eqiad.wmnet with OS bullseye [15:56:52] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300774)', diff saved to https://phabricator.wikimedia.org/P21035 and previous config saved to /var/cache/conftool/dbconfig/20220218-155652-kormat.json [15:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:53] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [15:56:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye executed... [15:56:55] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [15:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:59] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:57:00] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T300774)', diff saved to https://phabricator.wikimedia.org/P21036 and previous config saved to /var/cache/conftool/dbconfig/20220218-155659-kormat.json [15:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:19] (03CR) 10Herron: [C: 03+1] prometheus: inject 'source' label to alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763698 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi) [15:59:07] (03CR) 10Jbond: [C: 03+1] Add new Eqiad private and analytics subnets to dhcp.conf [puppet] - 10https://gerrit.wikimedia.org/r/763766 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [15:59:13] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T300774)', diff saved to https://phabricator.wikimedia.org/P21037 and previous config saved to /var/cache/conftool/dbconfig/20220218-155912-kormat.json [15:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:03:42] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10Gehel) 05Resolvedβ†’03Open This same problem is happening again on `deployment-elastic07`. This needs to be further investigated. [16:05:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:07:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-cache2003.mgmt.codfw.wmnet with reboot policy FORCED [16:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-cache2001.mgmt.codfw.wmnet with reboot policy FORCED [16:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:32] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) 05Openβ†’03Resolved @MoritzMuehlenhoff All completed, resolving this task [16:11:41] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) [16:13:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:13:30] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-cache2001.mgmt.codfw.wmnet with reboot policy FORCED [16:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:17] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P21038 and previous config saved to /var/cache/conftool/dbconfig/20220218-161417-kormat.json [16:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:49] (03PS1) 10Cmjohnson: fixing netboot.cfg for restbase servers, included existing servers in recipe [puppet] - 10https://gerrit.wikimedia.org/r/763770 (https://phabricator.wikimedia.org/T294372) [16:15:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:16:37] (03CR) 10Cmjohnson: [C: 03+2] fixing netboot.cfg for restbase servers, included existing servers in recipe [puppet] - 10https://gerrit.wikimedia.org/r/763770 (https://phabricator.wikimedia.org/T294372) (owner: 10Cmjohnson) [16:18:00] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:22:53] (03PS1) 10Accraze: ml-services: add etwiki and fawiki editquality [deployment-charts] - 10https://gerrit.wikimedia.org/r/763773 (https://phabricator.wikimedia.org/T301415) [16:23:00] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:23:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2019.mgmt.codfw.wmnet with reboot policy FORCED [16:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:22] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:28:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) [16:29:22] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P21039 and previous config saved to /var/cache/conftool/dbconfig/20220218-162922-kormat.json [16:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:51] (03CR) 10Cathal Mooney: [C: 03+2] Add new Eqiad private and analytics subnets to dhcp.conf [puppet] - 10https://gerrit.wikimedia.org/r/763766 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [16:30:22] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:34:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2019.mgmt.codfw.wmnet with reboot policy FORCED [16:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:14] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1093.eqiad.wmnet with OS bullseye [16:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye [16:34:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2020.mgmt.codfw.wmnet with reboot policy FORCED [16:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:54] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:38:19] (03PS1) 10C. Scott Ananian: Revert "Enable Parsoid API everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763779 (https://phabricator.wikimedia.org/T302081) [16:40:34] (03CR) 10Subramanya Sastry: [C: 03+1] Revert "Enable Parsoid API everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763779 (https://phabricator.wikimedia.org/T302081) (owner: 10C. Scott Ananian) [16:42:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2020.mgmt.codfw.wmnet with reboot policy FORCED [16:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:27] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T300774)', diff saved to https://phabricator.wikimedia.org/P21040 and previous config saved to /var/cache/conftool/dbconfig/20220218-164427-kormat.json [16:44:28] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [16:44:30] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [16:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:35] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T300774)', diff saved to https://phabricator.wikimedia.org/P21041 and previous config saved to /var/cache/conftool/dbconfig/20220218-164434-kormat.json [16:44:45] (03PS3) 10Cwhite: update grafana-image-renderer to 3.4.0 [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763335 (https://phabricator.wikimedia.org/T282863) [16:44:47] (03PS3) 10Cwhite: update grafana-simple-json-datasource to 1.4.2 [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763337 (https://phabricator.wikimedia.org/T282863) [16:44:49] (03PS2) 10Cwhite: use grafana api for worldmap plugin artifact [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763618 (https://phabricator.wikimedia.org/T282863) [16:44:51] (03PS2) 10Cwhite: Update changelog and build instructions [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763619 [16:45:33] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [16:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2021.mgmt.codfw.wmnet with reboot policy FORCED [16:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:42] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:53:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2021.mgmt.codfw.wmnet with reboot policy FORCED [16:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2022.mgmt.codfw.wmnet with reboot policy FORCED [16:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:55:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:04] (03CR) 10Jforrester: [C: 03+1] Revert "Enable Parsoid API everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763779 (https://phabricator.wikimedia.org/T302081) (owner: 10C. Scott Ananian) [16:56:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:26] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T300774)', diff saved to https://phabricator.wikimedia.org/P21042 and previous config saved to /var/cache/conftool/dbconfig/20220218-170125-kormat.json [17:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:02:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2022.mgmt.codfw.wmnet with reboot policy FORCED [17:02:33] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [17:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:52] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10dancy) [17:03:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:03:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:12] (03CR) 10Bearloga: [C: 03+1] "Yup, makes sense. Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/751710 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [17:12:41] (03CR) 10Bearloga: [C: 03+1] "Original goal for these was to be able to spin up instances on Cloud VPS for one-off computing tasks/projects (hence the various roles whi" [puppet] - 10https://gerrit.wikimedia.org/r/751704 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [17:16:31] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P21043 and previous config saved to /var/cache/conftool/dbconfig/20220218-171630-kormat.json [17:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:50] (03Abandoned) 10Bearloga: Continue decommissioning legacy Discovery dashboards [puppet] - 10https://gerrit.wikimedia.org/r/739564 (https://phabricator.wikimedia.org/T227782) (owner: 10Bearloga) [17:22:25] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack nova: Increase timeout for check_flavor_properties [puppet] - 10https://gerrit.wikimedia.org/r/763640 (owner: 10Andrew Bogott) [17:23:24] (03CR) 10ArielGlenn: [C: 03+2] do flow dumps in multiple pieces and concat them together [dumps] - 10https://gerrit.wikimedia.org/r/762127 (https://phabricator.wikimedia.org/T300760) (owner: 10ArielGlenn) [17:24:40] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:24:42] (03Merged) 10jenkins-bot: do flow dumps in multiple pieces and concat them together [dumps] - 10https://gerrit.wikimedia.org/r/762127 (https://phabricator.wikimedia.org/T300760) (owner: 10ArielGlenn) [17:25:07] (03CR) 10Bearloga: [C: 04-1] "Actually, modules/r_lang/files/biocLite.R should also be deleted as part of this" [puppet] - 10https://gerrit.wikimedia.org/r/751710 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [17:26:18] !log ariel@deploy1002 Started deploy [dumps/dumps@f7c16d4]: noop script, dup jobname check for api jobs, do flow dumps in pieces like stubs [17:26:21] !log ariel@deploy1002 Finished deploy [dumps/dumps@f7c16d4]: noop script, dup jobname check for api jobs, do flow dumps in pieces like stubs (duration: 00m 03s) [17:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:04] (03CR) 10Jdlrobson: [wmf-config]: Deploy the fawiki test safety survey to production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [17:31:35] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P21044 and previous config saved to /var/cache/conftool/dbconfig/20220218-173135-kormat.json [17:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:14] (03PS1) 10Andrew Bogott: cloudcontrol1004: override profile::openstack::eqiad1::keystone::wsgi_server: [puppet] - 10https://gerrit.wikimedia.org/r/763791 (https://phabricator.wikimedia.org/T281276) [17:34:39] (03PS1) 10Bearloga: discovery_dashboards: remove unused profiles/roles [puppet] - 10https://gerrit.wikimedia.org/r/763792 (https://phabricator.wikimedia.org/T227782) [17:36:06] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:37:09] (03CR) 10Andrew Bogott: [C: 03+2] cloudcontrol1004: override profile::openstack::eqiad1::keystone::wsgi_server: [puppet] - 10https://gerrit.wikimedia.org/r/763791 (https://phabricator.wikimedia.org/T281276) (owner: 10Andrew Bogott) [17:38:26] (03PS2) 10Bearloga: r_lang::bioc: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751710 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [17:39:15] (03CR) 10Bearloga: [C: 03+1] "Okay, now we're good to go" [puppet] - 10https://gerrit.wikimedia.org/r/751710 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [17:40:28] (03CR) 10RhinosF1: [wmf-config]: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [17:46:21] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10mpopov) @dcaro Thanks for waiting for me and for following up on this! Left comments and +1s on the patches, also uploaded a related clean-up patch for your... [17:46:40] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T300774)', diff saved to https://phabricator.wikimedia.org/P21045 and previous config saved to /var/cache/conftool/dbconfig/20220218-174640-kormat.json [17:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:46] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [17:48:06] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:51:08] (03CR) 10Jdlrobson: [wmf-config]: Deploy the fawiki test safety survey to production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [17:56:41] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:58:17] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:06:06] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1093.eqiad.wmnet with OS bullseye [18:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye executed... [18:29:47] 10SRE, 10Icinga, 10WMF-Legal, 10observability: monitoring alert: legal footer change on en.wikipedia - due to creative commons license version change - https://phabricator.wikimedia.org/T302045 (10Slaporte) > does it matter to you that the version changed to 3.0? [This change](https://en.wikipedia.org/w/i... [18:29:54] 10SRE, 10Observability-Metrics, 10User-CDanis: Find links to grafana.wikimedia.org and change them to use the new URL format - https://phabricator.wikimedia.org/T211982 (10colewhite) Updated wikitech links where updates seemed most appropriate. [18:29:59] 10SRE, 10Observability-Metrics, 10User-CDanis: Find links to grafana.wikimedia.org and change them to use the new URL format - https://phabricator.wikimedia.org/T211982 (10colewhite) [18:31:00] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [18:32:56] (03CR) 10Krinkle: [C: 04-1] "These are referenced in private settings and would break." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761443 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [18:39:21] 10SRE, 10MediaWiki-extensions-CentralAuth, 10Platform Engineering, 10TimedMediaHandler, and 4 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10brion) 05Openβ†’03Resolved a:03brion Marking this resolved... [18:40:25] 10SRE, 10Observability-Metrics, 10User-CDanis: Find links to grafana.wikimedia.org and change them to use the new URL format - https://phabricator.wikimedia.org/T211982 (10colewhite) [18:46:17] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) CS0451899 > Support, > > We recently filed CS0334187 for the installation of two of our routers. During that installation, it... [18:48:21] (03CR) 10Zabe: filebackend: migrate $wmfSwift* to $wmgSwift* (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761443 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [18:48:48] (03PS1) 10Andrew Bogott: Move service name openstack.eqiad1.wikimediacloud.org to cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/763805 (https://phabricator.wikimedia.org/T281276) [18:50:10] (03PS1) 10JHathaway: run_ci_locally.sh: add podman support [puppet] - 10https://gerrit.wikimedia.org/r/763807 [18:50:49] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/763807 (owner: 10JHathaway) [18:53:14] (03CR) 10Razzi: [C: 03+2] opensearch: make curator version bullseye compatible [puppet] - 10https://gerrit.wikimedia.org/r/763587 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [19:03:59] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:04:32] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Majavah) [19:04:42] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Majavah) [19:07:14] 10SRE, 10Icinga, 10WMF-Legal, 10observability: monitoring alert: legal footer change on en.wikipedia - due to creative commons license version change - https://phabricator.wikimedia.org/T302045 (10Dzahn) a:03Dzahn [19:13:17] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:13:49] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10bking) a:05dcausseβ†’03bking [19:18:35] 10SRE, 10Icinga, 10WMF-Legal, 10observability: monitoring alert: legal footer change on en.wikipedia - due to creative commons license version change - https://phabricator.wikimedia.org/T302045 (10Dzahn) 05Openβ†’03In progress [19:25:03] (03PS1) 10Dzahn: icinga: fix "legal check" monitoring footer HTML on en.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/763810 (https://phabricator.wikimedia.org/T302045) [19:27:08] (03CR) 10Dzahn: [C: 03+2] "fyi: I asked legal if this is still used etc and the answers are here https://phabricator.wikimedia.org/T302045#7721945 it is" [puppet] - 10https://gerrit.wikimedia.org/r/763810 (https://phabricator.wikimedia.org/T302045) (owner: 10Dzahn) [19:33:49] RECOVERY - Ensure legal html en.wp on en.wikipedia.org is OK: all html is present. https://phabricator.wikimedia.org/project/members/28/ [19:46:48] 10SRE, 10Icinga, 10WMF-Legal, 10observability, 10Patch-For-Review: monitoring alert: legal footer change on en.wikipedia - due to creative commons license version change - https://phabricator.wikimedia.org/T302045 (10Dzahn) 05In progressβ†’03Resolved >>! In T302045#7721945, @Slaporte wrote: >> does it... [19:46:58] 10SRE, 10Icinga, 10WMF-Legal, 10observability, 10Patch-For-Review: monitoring alert: legal footer change on en.wikipedia - due to creative commons license version change - https://phabricator.wikimedia.org/T302045 (10Dzahn) {F34957333} [19:52:55] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:57:22] (03CR) 10Krinkle: [C: 03+1] "No, only wmfSwiftConfig. Which you already skipped. I was confused by the subject line. My bad. This is good to go. We'll need to do more " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761443 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [19:59:51] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:02:15] (03CR) 10Zabe: filebackend: migrate $wmfSwift* to $wmgSwift* (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761443 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:25:55] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10bking) After the upgrade from 6.5 to 6.8, the service wouldn't restart. Initial error (when ES's rundir is missing): `Feb 17 10:34:07 deployment-elastic07 elasticsearch[1260... [20:33:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [20:34:13] (03PS1) 10Razzi: analytics_cluster::datahub::opensearch: Enable syslog transport [puppet] - 10https://gerrit.wikimedia.org/r/763815 (https://phabricator.wikimedia.org/T301382) [20:48:47] (03CR) 10Dzahn: role::mariadb: remove unused role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751725 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [20:50:14] (03PS1) 10Papaul: Add ml-cache200[1-3] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/763818 (https://phabricator.wikimedia.org/T299433) [20:50:23] (03PS1) 10Cwhite: profile: update graphite mediawiki grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763819 (https://phabricator.wikimedia.org/T211982) [20:50:25] (03PS1) 10Cwhite: profile: update host monitoring grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763820 (https://phabricator.wikimedia.org/T211982) [20:50:27] (03PS1) 10Cwhite: k8s: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763821 (https://phabricator.wikimedia.org/T211982) [20:50:29] (03PS1) 10Cwhite: maps: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763822 (https://phabricator.wikimedia.org/T211982) [20:50:31] (03PS1) 10Cwhite: search: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763823 (https://phabricator.wikimedia.org/T211982) [20:50:33] (03PS1) 10Cwhite: zuul: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763824 (https://phabricator.wikimedia.org/T211982) [20:50:35] (03PS1) 10Cwhite: zookeeper: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763825 (https://phabricator.wikimedia.org/T211982) [20:50:37] (03PS1) 10Cwhite: kafka: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763826 (https://phabricator.wikimedia.org/T211982) [20:50:39] (03PS1) 10Cwhite: eventlogging: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763827 (https://phabricator.wikimedia.org/T211982) [20:50:41] (03PS1) 10Cwhite: labstore: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763828 (https://phabricator.wikimedia.org/T211982) [20:50:43] (03PS1) 10Cwhite: pybal: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763829 (https://phabricator.wikimedia.org/T211982) [20:50:45] (03PS1) 10Cwhite: caches: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763830 (https://phabricator.wikimedia.org/T211982) [20:50:47] (03PS1) 10Cwhite: hadoop: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763831 (https://phabricator.wikimedia.org/T211982) [20:50:49] (03PS1) 10Cwhite: graphite: update grafana dashboards links [puppet] - 10https://gerrit.wikimedia.org/r/763832 (https://phabricator.wikimedia.org/T211982) [20:51:21] (03CR) 10Dzahn: [C: 03+1] Add ml-cache200[1-3] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/763818 (https://phabricator.wikimedia.org/T299433) (owner: 10Papaul) [20:51:58] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/763815 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [20:52:16] (03CR) 10Cwhite: [C: 03+1] am: remove Icinga/ prefix and add 'source' label [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763459 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi) [20:52:32] (03CR) 10Papaul: [C: 03+2] Add ml-cache200[1-3] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/763818 (https://phabricator.wikimedia.org/T299433) (owner: 10Papaul) [20:53:38] (03CR) 10Cwhite: [C: 03+1] thanos: add relabels to rule [puppet] - 10https://gerrit.wikimedia.org/r/763697 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi) [20:54:27] (03CR) 10Cwhite: [C: 03+1] prometheus: inject 'source' label to alerts [puppet] - 10https://gerrit.wikimedia.org/r/763698 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi) [20:56:29] (03CR) 10Cwhite: [V: 03+2 C: 03+2] remove deprecated piechart plugin [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763334 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [20:56:49] (03CR) 10Cwhite: [V: 03+2 C: 03+2] update grafana-image-renderer to 3.4.0 [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763335 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [20:56:54] (03CR) 10Cwhite: [V: 03+2 C: 03+2] update grafana-simple-json-datasource to 1.4.2 [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763337 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [20:57:08] (03CR) 10Cwhite: [V: 03+2 C: 03+2] use grafana api for worldmap plugin artifact [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763618 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [20:57:22] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Update changelog and build instructions [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763619 (owner: 10Cwhite) [20:57:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ml-cache2002.codfw.wmnet with OS bullseye [20:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:40] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-cache2002.codfw.wmnet with O... [20:59:26] (03PS4) 10JHathaway: Add nagios_core & mailalias_core modules [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) [21:00:20] (03CR) 10Razzi: [C: 03+2] analytics_cluster::datahub::opensearch: Enable syslog transport [puppet] - 10https://gerrit.wikimedia.org/r/763815 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [21:04:27] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:05:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) @elukey can you please double check again the partman i am getting the error below ` Failed to retrieve the preconfig... [21:08:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Dzahn) @Papaul It's just missing the ".cfg" file ending. https://apt.wikimedia.org/autoinstall/partman/ [21:09:55] (03PS1) 10Dzahn: partman: add missing .cfg file extension to recipe used by ml-cache [puppet] - 10https://gerrit.wikimedia.org/r/763837 (https://phabricator.wikimedia.org/T299433) [21:10:32] (03CR) 10Dzahn: [C: 03+2] partman: add missing .cfg file extension to recipe used by ml-cache [puppet] - 10https://gerrit.wikimedia.org/r/763837 (https://phabricator.wikimedia.org/T299433) (owner: 10Dzahn) [21:11:25] (03CR) 10jerkins-bot: [V: 04-1] Add nagios_core & mailalias_core modules [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway) [21:12:08] (03CR) 10Dzahn: [V: 03+2 C: 03+2] partman: add missing .cfg file extension to recipe used by ml-cache [puppet] - 10https://gerrit.wikimedia.org/r/763837 (https://phabricator.wikimedia.org/T299433) (owner: 10Dzahn) [21:14:30] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Dzahn) @Papaul deployed fix and ran puppet on apt1001. try again now [21:14:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) thanks [21:21:27] (03PS1) 10Razzi: opensearch: change log4j appender.ship_to_logstash.layout [puppet] - 10https://gerrit.wikimedia.org/r/763844 (https://phabricator.wikimedia.org/T301382) [21:22:14] (03CR) 10jerkins-bot: [V: 04-1] opensearch: change log4j appender.ship_to_logstash.layout [puppet] - 10https://gerrit.wikimedia.org/r/763844 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [21:22:30] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-cache2002.codfw.wmnet with OS bullseye [21:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:40] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-cache2002.codfw.wmnet with OS bu... [21:24:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ml-cache2002.codfw.wmnet with OS bullseye [21:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-cache2002.codfw.wmnet with O... [21:24:47] (03PS2) 10Razzi: opensearch: change log4j appender.ship_to_logstash.layout [puppet] - 10https://gerrit.wikimedia.org/r/763844 (https://phabricator.wikimedia.org/T301382) [21:35:28] (03CR) 10Cwhite: [C: 03+1] opensearch: change log4j appender.ship_to_logstash.layout [puppet] - 10https://gerrit.wikimedia.org/r/763844 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [21:41:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [21:43:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2002.codfw.wmnet with reason: host reimage [21:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache2002.codfw.wmnet with reason: host reimage [21:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [21:55:05] (03CR) 10TAndic: [wmf-config]: Deploy the fawiki test safety survey to production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [21:56:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2002.codfw.wmnet with OS bullseye [21:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-cache2002.codfw.wmnet with OS bu... [21:59:01] (03CR) 10Jdlrobson: [C: 03+1] "Eigyan: Removing myself from reviewers here. Feel free to re-add me if you are having problems verifying the survey is running next time r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [21:59:08] (03PS3) 10Razzi: opensearch: change log4j appender.ship_to_logstash.layout [puppet] - 10https://gerrit.wikimedia.org/r/763844 (https://phabricator.wikimedia.org/T301382) [21:59:33] Jdlrobson: thanks for your help [21:59:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ml-cache2003.codfw.wmnet with OS bullseye [21:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-cache2003.codfw.wmnet with O... [22:04:11] (03CR) 10Cwhite: opensearch: change log4j appender.ship_to_logstash.layout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763844 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [22:04:51] (03CR) 10Razzi: opensearch: change log4j appender.ship_to_logstash.layout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763844 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [22:06:06] (03PS4) 10Razzi: opensearch: change log4j appender.ship_to_logstash.layout [puppet] - 10https://gerrit.wikimedia.org/r/763844 (https://phabricator.wikimedia.org/T301382) [22:07:07] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33858/console" [puppet] - 10https://gerrit.wikimedia.org/r/763844 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [22:10:37] (03CR) 10Cwhite: opensearch: change log4j appender.ship_to_logstash.layout (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/763844 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [22:17:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2003.codfw.wmnet with reason: host reimage [22:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:40] (03PS5) 10Razzi: opensearch: change log4j appender.ship_to_logstash.layout [puppet] - 10https://gerrit.wikimedia.org/r/763844 (https://phabricator.wikimedia.org/T301382) [22:18:25] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33859/console" [puppet] - 10https://gerrit.wikimedia.org/r/763844 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [22:20:03] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/763844 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [22:20:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-cache2001.mgmt.codfw.wmnet with reboot policy FORCED [22:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache2003.codfw.wmnet with reason: host reimage [22:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:03] (03CR) 10Razzi: [V: 03+1 C: 03+2] opensearch: change log4j appender.ship_to_logstash.layout [puppet] - 10https://gerrit.wikimedia.org/r/763844 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [22:30:06] (03PS5) 10JHathaway: Add nagios_core & mailalias_core modules [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) [22:30:40] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [22:36:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2003.codfw.wmnet with OS bullseye [22:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:27] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-cache2003.codfw.wmnet with OS bu... [22:42:34] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [22:43:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-cache2001.mgmt.codfw.wmnet with reboot policy FORCED [22:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:42] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:46:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ml-cache2001.codfw.wmnet with OS bullseye [22:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-cache2001.codfw.wmnet with O... [22:54:18] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) Hey @Dzahn I heard back from Advancement and they'd like to hold off on adjusting their Ze... [23:04:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2001.codfw.wmnet with reason: host reimage [23:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:03] (03PS1) 10JHathaway: run_ci_locally.sh: merge duplicate args [puppet] - 10https://gerrit.wikimedia.org/r/763856 [23:06:27] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/763856 (owner: 10JHathaway) [23:07:09] urbanecm (or any other sysadmins) can we get editfilter throttling disabled on enwiki again? [23:07:16] botnet spam has restarted [23:07:30] filter 1184 appears to be throttled [23:08:06] have raised [23:08:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache2001.codfw.wmnet with reason: host reimage [23:08:15] no ack yet [23:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:17] going to again annoy people, apologies [23:09:41] #page enwp botnet is back, see T302047, need to potentially re-enable the mitigations [23:10:40] hashar, James_F: ^ [23:12:57] TheresNoTime: you should be able to really page via klaxon.wikimedia.org using the 'samtar' ldap account [23:13:00] done [23:13:10] if you click wake up an sre [23:13:16] already done :) [23:13:19] yep [23:13:22] i see it now [23:13:40] 10SRE, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Maps (Geoshapes), and 2 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10MSantos) @akosiaris and @jijiki how can we move forward with this? For context: - [[ https://gerrit.wikimedia.org/r/plugins/gitiles/op... [23:14:31] ack, online [23:14:48] jhathaway: thank you, T302047 [23:15:08] looking now, though I will probably need other folks help, as I am fairly new here [23:15:17] I think it's the first time I saw someone actually performing a manual page [23:15:57] zabe: I escalate as nicely as possible when it comes to waking people up :) [23:17:03] heh :) [23:18:27] jhathaway: you can use deploy-commands.toolforge.org to help [23:18:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2001.codfw.wmnet with OS bullseye [23:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-cache2001.codfw.wmnet with OS bu... [23:19:10] ok, looks like cdanis rolled back the commit possibly, but I am not sure where that commit lives... [23:20:31] here πŸ‘‹ [23:20:46] thank you :) [23:22:07] * thcipriani catching up [23:23:06] TheresNoTime: to confirm, we want to both re-disable AF throttling and re-enable captchas for IP edits, is that right? [23:23:23] rzl: agreed [23:23:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) [23:23:54] as we already had the filters ready, we've managed to flatten out the attack fairly quickly https://redwarn.toolforge.org/tools/rpm/ [23:24:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) [23:25:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) 05Openβ†’03Resolved @elukey complete [23:25:34] 10SRE, 10ops-codfw, 10decommission-hardware: decommission prometheus2004.codfw.wmnet - https://phabricator.wikimedia.org/T301852 (10Papaul) a:03Papaul [23:27:33] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:11] (03PS1) 10Thcipriani: Revert "Revert "Disable AbuseFilter throttling on enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763736 [23:28:58] (03PS1) 10Thcipriani: Revert "Revert "enable wmgEmergencyCaptcha for enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763737 [23:29:00] (03CR) 10RLazarus: [C: 03+1] Revert "Revert "Disable AbuseFilter throttling on enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763736 (owner: 10Thcipriani) [23:29:13] (03CR) 10RLazarus: [C: 03+1] Revert "Revert "enable wmgEmergencyCaptcha for enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763737 (owner: 10Thcipriani) [23:29:18] TheresNoTime: we are working on putting back in the CAPTCHA and Disable AbuseFilter throttling on enwiki [23:29:20] (03PS2) 10Thcipriani: Revert "Revert "Disable AbuseFilter throttling on enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763736 [23:29:27] (03CR) 10Thcipriani: [C: 03+2] Revert "Revert "Disable AbuseFilter throttling on enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763736 (owner: 10Thcipriani) [23:29:40] (03CR) 10Thcipriani: [C: 03+2] Revert "Revert "enable wmgEmergencyCaptcha for enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763737 (owner: 10Thcipriani) [23:29:41] jhathaway: thank you [23:30:17] (03Merged) 10jenkins-bot: Revert "Revert "Disable AbuseFilter throttling on enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763736 (owner: 10Thcipriani) [23:31:02] (03PS2) 10Thcipriani: Revert "Revert "enable wmgEmergencyCaptcha for enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763737 [23:31:05] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:31:05] (03CR) 10Thcipriani: Revert "Revert "enable wmgEmergencyCaptcha for enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763737 (owner: 10Thcipriani) [23:31:08] (03CR) 10Thcipriani: [C: 03+2] Revert "Revert "enable wmgEmergencyCaptcha for enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763737 (owner: 10Thcipriani) [23:32:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:56] 10SRE, 10ops-codfw, 10decommission-hardware: decommission prometheus2004.codfw.wmnet - https://phabricator.wikimedia.org/T301852 (10Papaul) [23:33:04] (03Merged) 10jenkins-bot: Revert "Revert "enable wmgEmergencyCaptcha for enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763737 (owner: 10Thcipriani) [23:33:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:36] 10SRE, 10ops-codfw, 10decommission-hardware: decommission prometheus2004.codfw.wmnet - https://phabricator.wikimedia.org/T301852 (10Papaul) 05Openβ†’03Resolved complete [23:34:30] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:763737|Revert "Revert "enable wmgEmergencyCaptcha for enwiki""]] (duration: 00m 50s) [23:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:34:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:06] TheresNoTime: both mitigations should be live, let us know if everything doesn't look right :) [23:35:36] rzl: the attack seems to have now stopped :) [23:35:47] πŸ‘ [23:35:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:42] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10Papaul) Thanks [23:37:47] TheresNoTime: we're debating whether to leave these emergency mitigations in place over the weekend -- any opinion? [23:38:44] captchas on all logged-out edits is kind of clearly suboptimal, but our response time might be a little slower if this recurs outside of working hours -- and now this seems like an established "not just once" kind of event [23:38:57] rzl: the CAPTCHA does cause some fairly serious accessibility concerns (and people do complain!) - I'm not sure leaving it in place this weekend is the best idea, but appreciate y'all won't be necessarily be around [23:39:07] yeah [23:39:28] we'll definitely be *here* :) it just might not be quite as snappy, hm [23:39:50] well yesterday we held off paging for a few hours so.. I'm sure we can manage :) [23:40:08] (this time I wasn't going to wait as long, sorry!) [23:40:40] no you did that correctly, and thank you! [23:40:49] :) [23:41:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:33] TheresNoTime: how would you feel if we turned the captchas back off, in a little while, but left abusefilter unthrottled? [23:41:45] (03PS1) 10Papaul: Add kubernetes2019, kubernetes202[0-2] to site.pp and netboox [puppet] - 10https://gerrit.wikimedia.org/r/763863 (https://phabricator.wikimedia.org/T299470) [23:41:46] we're discussing what levels of emergency mitigation we're comfortable leaving in place [23:41:55] leaving wmgEmergencyCaptcha enabled for the weekend is quite heavy, but I can't see the task so I guess I don't know all the infos [23:42:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:42:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:42:15] rzl: to be honest, the abuse filters worked very effectively this time [23:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:18] leaving AF unthrottled is probably safe enough [23:42:36] and also buys time to wait for other mitigations [23:42:38] zabe: yeah it would definitely be a big hammer and we're trying not to use it, just assessing what our options are :) [23:42:46] (03CR) 10Papaul: [C: 03+2] Add kubernetes2019, kubernetes202[0-2] to site.pp and netboox [puppet] - 10https://gerrit.wikimedia.org/r/763863 (https://phabricator.wikimedia.org/T299470) (owner: 10Papaul) [23:43:00] TheresNoTime, AntiComposite: nod [23:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:38] TheresNoTime: and just to be clear, do please press that button again 24/7 if you see this situation again [23:43:53] rzl: will do :) [23:44:06] hopefully we'll be able to work out a long-term fix that doesn't require so much hands-on reaction, but in the meantime [23:44:53] I'm not quite sure what that would be :) the UA blocking caused issues [23:45:05] but yes, thank you all for the prompt response! [23:45:07] yeah, that's a longer term design problem [23:45:32] UA? [23:45:37] user agent [23:45:39] ah [23:45:51] we can block by UA? Huh [23:46:04] Varnish can do many amazing things [23:46:11] also many not-amazing things [23:46:12] we have a bunch of options at the CDN layer, yeah :) [23:46:15] ahh [23:46:19] e.g. https://gerrit.wikimedia.org/r/c/operations/puppet/+/763635 [23:47:06] For some reason I was thinking that was possible on the MW level; that makes more sense. [23:47:27] rzl: may be worth someone (else) making a note at https://phabricator.wikimedia.org/T302047 ? [23:47:33] TheresNoTime: our tentative plan is to leave things as is for the next hour, then switch wmgEmergencyCaptcha back off if conditions permit -- would you like us to check in with you before we do? [23:47:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2019.codfw.wmnet with OS bullseye [23:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:50] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, and 2 others: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2019.codfw.wmnet with OS bullseye [23:48:03] yeah we'll make sure that's up to date for sure [23:48:15] rzl: sounds good to me :) [23:50:06] rzl: for redundancy, could you ping GenNotability and I? [23:50:17] sounds good, will do [23:50:32] o/ [23:50:50] * AntiComposite wonders aloud why https://grafana.wikimedia.org/d/000000370/captcha-failure-rates?orgId=1&from=now-24h&to=now doesn't show more of a change with wmgEmergencyCaptcha [23:51:46] I see captchas so it's probably fine [23:51:48] ah, it only considers login and account creation [23:52:02] ah [23:53:12] I wasn't aware that the emergency captcha doesn't offer any accessible alternative... at all. That seems rather bad. [23:53:26] But I'm sure that's a discussion for elsewhere :) [23:53:33] perryprog: yeah, it really is a "break glass" thing :) [23:54:12] I'll make sure we track that as an action item for this incident -- if we had a more accessible captcha we could use it more freely in this kind of situation [23:54:44] obviously a longer-term project but one worth doing [23:54:48] rzl: want to peruse the shouty history of the CAPTCHA ticket? :P [23:54:57] (in my opinion as not the person who'd be doing the work) [23:55:02] ha [23:55:06] (T6845) [23:55:06] T6845: CAPTCHA doesn't work for people with visual impairments - https://phabricator.wikimedia.org/T6845 [23:55:11] best kind of decision-making! [23:55:39] too bad we don't have the NotACaptcha ready to test here :) [23:55:40] GenNotability: in my most diplomatic voice: thanks for the link! I'll make sure it's included in the tracking item, for full context [23:56:09] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:56:16] rzl: https://phabricator.wikimedia.org/T302138 related? [23:56:21] T302138 [23:56:22] T302138: Massive proxy-vandalism attack on frwiki - https://phabricator.wikimedia.org/T302138 [23:56:30] ehm [23:57:05] wuh oh, looks plausible [23:57:24] the edits seem very similar [23:57:35] I don't super want to wmgEmergencyCaptcha everywhere but it looks like that's a likely next move [23:57:39] it was bound to happen [23:57:52] compare https://fr.wikipedia.org/w/index.php?title=Ram%C3%A9e_(bi%C3%A8re)&curid=7827370&diff=190981990&oldid=106158343 and https://en.wikipedia.org/wiki/Special:AbuseLog/31970375 [23:58:23] jhathaway: ^ fyi [23:58:32] i just escalated the task to security given the enwp one is [23:58:39] rzl: thanks [23:59:33] rzl: I don't have permission to see the second link, https://en.wikipedia.org/wiki/Special:AbuseLog/31970375 [23:59:41] yup, that looks like them...if they're going to start jumping wikis, I could slap together a global filter for tracking? to [23:59:52] Do that for sure [23:59:52] GenNotability: please do, yes [23:59:56] jhathaway: adding "Russia has invaded Ukraine." to the end of an article