[00:03:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:09:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:13:45] (Primary outbound port utilisation over 80% #page) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [00:14:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:15:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:19:20] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:25:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:30:06] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:38:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/993467 [00:38:34] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/993467 (owner: 10TrainBranchBot) [00:43:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:48:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:53:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:53:45] (Primary outbound port utilisation over 80% #page) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [00:55:38] (ProbeDown) firing: (8) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:58:45] (Primary outbound port utilisation over 80% #page) resolved: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:02:28] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/993467 (owner: 10TrainBranchBot) [01:04:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:09:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:09:46] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_generatecaptcha.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:16:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:18:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:21:06] (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:21:30] (03PS1) 10Superpes15: [enwiktionary] Remove the Concordance namespace and its talk space [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993457 (https://phabricator.wikimedia.org/T354813) [01:28:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:29:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:33:31] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:38:31] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:38:54] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [01:41:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:46:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:47:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:49:57] (03PS1) 10Superpes15: [enwikiquote] Add a draft namespace and its talk space [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993458 (https://phabricator.wikimedia.org/T355195) [01:52:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:54:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:04:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:09:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:10:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:15:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:31:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:36:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:37:05] (PuppetFailure) firing: Puppet has failed on debmonitor2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:39:23] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:55:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:00:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:01:20] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:23] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:33:22] PROBLEM - Disk space on build2001 is CRITICAL: DISK CRITICAL - free space: / 13055 MB (5% inode=65%): /tmp 13055 MB (5% inode=65%): /var/tmp 13055 MB (5% inode=65%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops [03:39:40] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:55:38] (ProbeDown) firing: (8) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:21:06] (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:38:54] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [05:47:43] (03PS1) 10Marostegui: mariadb: Decommission db1134 [puppet] - 10https://gerrit.wikimedia.org/r/993506 (https://phabricator.wikimedia.org/T355740) [05:49:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1134.eqiad.wmnet [05:53:05] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1134 [puppet] - 10https://gerrit.wikimedia.org/r/993506 (https://phabricator.wikimedia.org/T355740) (owner: 10Marostegui) [05:54:38] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [05:56:37] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1134.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [05:57:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1134.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [05:57:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:57:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1134.eqiad.wmnet [05:58:21] 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1134.eqiad.wmnet - https://phabricator.wikimedia.org/T355740 (10Marostegui) a:05Marostegui→03None [05:58:30] 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1134.eqiad.wmnet - https://phabricator.wikimedia.org/T355740 (10Marostegui) Ready for #dc-ops [06:03:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [06:03:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [06:03:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:03:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:04:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T355609)', diff saved to https://phabricator.wikimedia.org/P55745 and previous config saved to /var/cache/conftool/dbconfig/20240129-060400-marostegui.json [06:04:06] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:09:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T355609)', diff saved to https://phabricator.wikimedia.org/P55746 and previous config saved to /var/cache/conftool/dbconfig/20240129-060907-marostegui.json [06:09:13] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:24:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P55747 and previous config saved to /var/cache/conftool/dbconfig/20240129-062414-marostegui.json [06:33:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2129', diff saved to https://phabricator.wikimedia.org/P55750 and previous config saved to /var/cache/conftool/dbconfig/20240129-063302-marostegui.json [06:37:05] (PuppetFailure) firing: Puppet has failed on debmonitor2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:38:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P55751 and previous config saved to /var/cache/conftool/dbconfig/20240129-063836-root.json [06:39:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P55752 and previous config saved to /var/cache/conftool/dbconfig/20240129-063920-marostegui.json [06:53:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P55754 and previous config saved to /var/cache/conftool/dbconfig/20240129-065341-root.json [06:54:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T355609)', diff saved to https://phabricator.wikimedia.org/P55755 and previous config saved to /var/cache/conftool/dbconfig/20240129-065427-marostegui.json [06:54:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [06:54:35] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:54:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [06:54:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T355609)', diff saved to https://phabricator.wikimedia.org/P55756 and previous config saved to /var/cache/conftool/dbconfig/20240129-065450-marostegui.json [07:00:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T355609)', diff saved to https://phabricator.wikimedia.org/P55757 and previous config saved to /var/cache/conftool/dbconfig/20240129-065959-marostegui.json [07:00:12] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:08:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P55758 and previous config saved to /var/cache/conftool/dbconfig/20240129-070847-root.json [07:15:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P55760 and previous config saved to /var/cache/conftool/dbconfig/20240129-071506-marostegui.json [07:23:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P55761 and previous config saved to /var/cache/conftool/dbconfig/20240129-072352-root.json [07:25:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [07:28:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:29:11] (03CR) 10Muehlenhoff: [C: 03+1] "@Arnold: The current access would be fine for Superset access, but they likely need more. I've pinged the task to ask for more context." [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [07:29:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10MoritzMuehlenhoff) @amastilovic @Ahoelzl Can you clarify what access you need specifically: https://wikitech.wikimedia.org/wiki/Analytics/Data_access#... [07:30:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P55762 and previous config saved to /var/cache/conftool/dbconfig/20240129-073012-marostegui.json [07:33:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:38:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:38:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P55763 and previous config saved to /var/cache/conftool/dbconfig/20240129-073857-root.json [07:41:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:45:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T355609)', diff saved to https://phabricator.wikimedia.org/P55764 and previous config saved to /var/cache/conftool/dbconfig/20240129-074519-marostegui.json [07:45:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [07:45:25] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:45:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [07:45:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T355609)', diff saved to https://phabricator.wikimedia.org/P55765 and previous config saved to /var/cache/conftool/dbconfig/20240129-074541-marostegui.json [07:46:04] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993489 [07:46:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:47:24] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10SLyngshede-WMF) @Arinaigu Your account should be fixed now. Please try to login to https://wikitech.wikimedia.org/ using "Arinaigum" as your username. [07:48:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:50:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T355609)', diff saved to https://phabricator.wikimedia.org/P55766 and previous config saved to /var/cache/conftool/dbconfig/20240129-075044-marostegui.json [07:50:50] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:58:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:59:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:59:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/993191 (owner: 10JHathaway) [08:00:05] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240129T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:04:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:05:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:05:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P55767 and previous config saved to /var/cache/conftool/dbconfig/20240129-080550-marostegui.json [08:07:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks. You can just go ahead and merge, the change will land in the deb package soon with the next release." [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/993183 (owner: 10Scott French) [08:10:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:13:41] (03CR) 10Muehlenhoff: Puppet: Routed Ganeti support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:15:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:17:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:20:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P55768 and previous config saved to /var/cache/conftool/dbconfig/20240129-082057-marostegui.json [08:22:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:27:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:27:18] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993489 (owner: 10Marostegui) [08:28:00] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993489 (owner: 10Marostegui) [08:29:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:29:16] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:993489|Revert "ProductionServices.php: Promote pc2014"]] [08:34:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:34:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:36:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T355609)', diff saved to https://phabricator.wikimedia.org/P55769 and previous config saved to /var/cache/conftool/dbconfig/20240129-083603-marostegui.json [08:36:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance [08:36:09] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:36:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance [08:36:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T355609)', diff saved to https://phabricator.wikimedia.org/P55770 and previous config saved to /var/cache/conftool/dbconfig/20240129-083627-marostegui.json [08:38:03] (03PS1) 10Marostegui: Revert "pc2014: Move it to pc2" [puppet] - 10https://gerrit.wikimedia.org/r/993490 [08:39:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:39:33] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:993489|Revert "ProductionServices.php: Promote pc2014"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:39:37] !log marostegui@deploy2002 marostegui: Continuing with sync [08:40:39] (03CR) 10Marostegui: [C: 03+2] Revert "pc2014: Move it to pc2" [puppet] - 10https://gerrit.wikimedia.org/r/993490 (owner: 10Marostegui) [08:41:21] (03PS1) 10Marostegui: Revert "pc2: Enable notifications on the master" [puppet] - 10https://gerrit.wikimedia.org/r/993491 [08:41:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T355609)', diff saved to https://phabricator.wikimedia.org/P55771 and previous config saved to /var/cache/conftool/dbconfig/20240129-084143-marostegui.json [08:41:48] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:42:42] (03CR) 10Marostegui: [C: 03+2] Revert "pc2: Enable notifications on the master" [puppet] - 10https://gerrit.wikimedia.org/r/993491 (owner: 10Marostegui) [08:44:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:46:30] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:993489|Revert "ProductionServices.php: Promote pc2014"]] (duration: 17m 13s) [08:49:18] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) [08:55:39] (ProbeDown) firing: (8) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:56:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P55772 and previous config saved to /var/cache/conftool/dbconfig/20240129-085649-marostegui.json [08:57:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:59:36] 10SRE-swift-storage, 10Commons, 10Internet-Archive: Error 503, Backend fetch failed while uploading file from Internet Archive - https://phabricator.wikimedia.org/T352215 (10MatthewVernon) That was due to an incident - T356022 [09:05:37] (03CR) 10Ayounsi: "<3" [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:07:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:10:53] RECOVERY - Host ml-serve2004 is UP: PING OK - Packet loss = 0%, RTA = 36.34 ms [09:11:03] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 241, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:11:15] PROBLEM - Check systemd state on ml-serve2004 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P55773 and previous config saved to /var/cache/conftool/dbconfig/20240129-091156-marostegui.json [09:11:57] (03PS1) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [09:12:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:12:41] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve2004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:13:07] (03CR) 10CI reject: [V: 04-1] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [09:13:14] !log disable Puppet on all the ganeti servers for CR990968 deployment - T300152 [09:13:17] !log upgrading python-pymysql in S7 DB hosts to 1.0.2-2~wmf11u1 T355531 [09:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:23] T300152: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 [09:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:29] T355531: Migrate all db-* scripts to Bookworm - https://phabricator.wikimedia.org/T355531 [09:15:11] RECOVERY - Check systemd state on ml-serve2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:13] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve2004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:15:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:15:50] (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:17:57] !log mark for deletetion and cleanup replicated thanos blocks for prometheus=ops, older than 3 months, all resolutions - T351927 [09:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:02] T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927 [09:20:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:22:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:27:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T355609)', diff saved to https://phabricator.wikimedia.org/P55775 and previous config saved to /var/cache/conftool/dbconfig/20240129-092702-marostegui.json [09:27:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance [09:27:09] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:27:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:27:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance [09:27:24] 10SRE, 10Wikimedia-Incident: 2024-01-28 (UTC) - Error 503: Our servers are currently under maintenance or experiencing a technical problem - https://phabricator.wikimedia.org/T356022 (10LSobanski) 05Open→03Resolved a:03LSobanski Resolving as services have been stable since the last update. This outage wa... [09:27:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T355609)', diff saved to https://phabricator.wikimedia.org/P55776 and previous config saved to /var/cache/conftool/dbconfig/20240129-092724-marostegui.json [09:29:20] (03PS3) 10Slyngshede: D:service::docker Run Docker prune on pull. [puppet] - 10https://gerrit.wikimedia.org/r/991353 (https://phabricator.wikimedia.org/T321851) [09:30:33] (03PS4) 10Slyngshede: D:service::docker Run Docker prune on pull. [puppet] - 10https://gerrit.wikimedia.org/r/991353 (https://phabricator.wikimedia.org/T321851) [09:31:55] (03CR) 10Slyngshede: D:service::docker Run Docker prune on pull. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/991353 (https://phabricator.wikimedia.org/T321851) (owner: 10Slyngshede) [09:32:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:32:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T355609)', diff saved to https://phabricator.wikimedia.org/P55777 and previous config saved to /var/cache/conftool/dbconfig/20240129-093216-marostegui.json [09:32:22] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:33:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:37:00] (03CR) 10Ayounsi: [C: 03+2] Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:38:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:38:54] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [09:40:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:41:23] (03PS1) 10Filippo Giunchedi: sre: move MediaWikiEditFailures alert to global [alerts] - 10https://gerrit.wikimedia.org/r/993661 (https://phabricator.wikimedia.org/T350597) [09:45:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:45:50] PROBLEM - Check systemd state on ml-serve1003 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:20] PROBLEM - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P55778 and previous config saved to /var/cache/conftool/dbconfig/20240129-094722-marostegui.json [09:51:02] RECOVERY - Disk space on ms-be1068 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1068&var-datasource=eqiad+prometheus/ops [09:52:55] 10Puppet, 10Wikidata, 10wmde-wikidata-tech, 10Technical-Debt, 10Wikidata Analytics (Kanban): Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072 (10Manuel) [09:53:39] (03PS1) 10Muehlenhoff: ganeti: Create /var/lib/ganeti/rapi in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/993662 (https://phabricator.wikimedia.org/T300152) [09:54:07] (03PS2) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [09:54:49] (03CR) 10CI reject: [V: 04-1] ganeti: Create /var/lib/ganeti/rapi in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/993662 (https://phabricator.wikimedia.org/T300152) (owner: 10Muehlenhoff) [09:54:56] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Disk (sdl) failed in ms-be1068 - https://phabricator.wikimedia.org/T356033 (10MatthewVernon) [09:55:16] (03CR) 10CI reject: [V: 04-1] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [09:55:41] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Disk (sdl) failed in ms-be1068 - https://phabricator.wikimedia.org/T356033 (10MatthewVernon) p:05Triage→03High [09:56:17] (03PS2) 10Muehlenhoff: ganeti: Create /var/lib/ganeti/rapi in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/993662 (https://phabricator.wikimedia.org/T300152) [09:56:28] !log enable Puppet on all the ganeti servers for CR990968 deployment - T300152 [09:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:34] T300152: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 [09:57:42] (03CR) 10Ayounsi: [C: 03+1] ganeti: Create /var/lib/ganeti/rapi in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/993662 (https://phabricator.wikimedia.org/T300152) (owner: 10Muehlenhoff) [10:00:58] !log upload prometheus-ganeti-exporter 0.3+deb12u1 to apt.wikimedia.org T300152 [10:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P55779 and previous config saved to /var/cache/conftool/dbconfig/20240129-100229-marostegui.json [10:04:54] PROBLEM - Check systemd state on ganeti2033 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:05:24] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:07:14] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Create /var/lib/ganeti/rapi in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/993662 (https://phabricator.wikimedia.org/T300152) (owner: 10Muehlenhoff) [10:08:41] (03PS1) 10MVernon: swift: remove drained ms-be20[44-50] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/993664 (https://phabricator.wikimedia.org/T353149) [10:09:50] (03CR) 10CI reject: [V: 04-1] swift: remove drained ms-be20[44-50] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/993664 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon) [10:10:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:12:20] (03PS2) 10MVernon: swift: remove drained ms-be20[44-50] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/993664 (https://phabricator.wikimedia.org/T353149) [10:12:24] (03PS1) 10Muehlenhoff: ganeti/rapi: Relax permissions for rapi directory [puppet] - 10https://gerrit.wikimedia.org/r/993665 [10:13:53] (03CR) 10Ayounsi: [C: 03+1] ganeti/rapi: Relax permissions for rapi directory [puppet] - 10https://gerrit.wikimedia.org/r/993665 (owner: 10Muehlenhoff) [10:15:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:17:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T355609)', diff saved to https://phabricator.wikimedia.org/P55780 and previous config saved to /var/cache/conftool/dbconfig/20240129-101735-marostegui.json [10:17:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1213.eqiad.wmnet with reason: Maintenance [10:17:41] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:17:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1213.eqiad.wmnet with reason: Maintenance [10:17:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213:3316 (T355609)', diff saved to https://phabricator.wikimedia.org/P55781 and previous config saved to /var/cache/conftool/dbconfig/20240129-101757-marostegui.json [10:18:14] (03CR) 10Muehlenhoff: [C: 03+2] ganeti/rapi: Relax permissions for rapi directory [puppet] - 10https://gerrit.wikimedia.org/r/993665 (owner: 10Muehlenhoff) [10:19:26] RECOVERY - Check systemd state on ms-be1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:57] (03PS3) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [10:20:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:21:07] (03CR) 10CI reject: [V: 04-1] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [10:23:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:24:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T355609)', diff saved to https://phabricator.wikimedia.org/P55782 and previous config saved to /var/cache/conftool/dbconfig/20240129-102414-marostegui.json [10:24:20] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:25:27] (03CR) 10Marostegui: [C: 03+1] swift: remove drained ms-be20[44-50] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/993664 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon) [10:25:50] (03CR) 10MVernon: [C: 03+2] swift: remove drained ms-be20[44-50] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/993664 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon) [10:28:00] (03PS1) 10Btullis: Add the wmde instance to cumin A:analytics-airflow alias [puppet] - 10https://gerrit.wikimedia.org/r/993667 (https://phabricator.wikimedia.org/T340648) [10:28:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:29:46] (03CR) 10Ayounsi: [C: 03+2] Spicerack: Add support for routed Ganeti [software/spicerack] - 10https://gerrit.wikimedia.org/r/991325 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:31:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:32:08] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/993667 (https://phabricator.wikimedia.org/T340648) (owner: 10Btullis) [10:32:29] PROBLEM - Check systemd state on ganeti2034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:11] (03PS2) 10Muehlenhoff: Remove obsolete Hiera entries for Ganeti PKI support [puppet] - 10https://gerrit.wikimedia.org/r/993099 (https://phabricator.wikimedia.org/T350686) [10:34:55] (03PS2) 10Effie Mouzeli: mw-mcrouter: add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) [10:35:41] (03PS3) 10Slyngshede: P:debmonitor::server_package install Debmonitor from package. [puppet] - 10https://gerrit.wikimedia.org/r/993086 [10:35:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993099 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [10:36:05] (03CR) 10CI reject: [V: 04-1] mw-mcrouter: add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:36:44] (03Merged) 10jenkins-bot: Spicerack: Add support for routed Ganeti [software/spicerack] - 10https://gerrit.wikimedia.org/r/991325 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:36:52] (03CR) 10Muehlenhoff: "I don't think we should rename the role, this is already covered by a Hiera option?" [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [10:37:03] (03CR) 10Btullis: [C: 03+2] Add the wmde instance to cumin A:analytics-airflow alias [puppet] - 10https://gerrit.wikimedia.org/r/993667 (https://phabricator.wikimedia.org/T340648) (owner: 10Btullis) [10:37:05] (PuppetFailure) firing: Puppet has failed on debmonitor2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:37:37] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [10:38:24] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [10:39:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P55783 and previous config saved to /var/cache/conftool/dbconfig/20240129-103920-marostegui.json [10:39:43] (03PS3) 10Effie Mouzeli: mw-mcrouter: add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) [10:43:48] (03CR) 10Slyngshede: "Okay, seemed a lot cleaner to just do a new role and remove the old one later, but I'm good either way." [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [10:44:14] (03PS4) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [10:44:28] (03CR) 10Clément Goubert: [C: 03+1] sre: move MediaWikiEditFailures alert to global [alerts] - 10https://gerrit.wikimedia.org/r/993661 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [10:44:51] (03CR) 10Ayounsi: [C: 03+1] "lgtm, nothing seems to use that hiera key anymore." [puppet] - 10https://gerrit.wikimedia.org/r/993099 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [10:45:23] (03CR) 10CI reject: [V: 04-1] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [10:45:44] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera entries for Ganeti PKI support [puppet] - 10https://gerrit.wikimedia.org/r/993099 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [10:46:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:47:06] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-airflow1007.eqiad.wmnet with OS bullseye [10:47:07] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1054.eqiad.wmnet [10:47:12] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2054.codfw.wmnet [10:49:02] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: move MediaWikiEditFailures alert to global [alerts] - 10https://gerrit.wikimedia.org/r/993661 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [10:49:49] (03CR) 10Muehlenhoff: "Yeah, let's stick with the role as-is and re-use the existing OS bookworm conditional, if we rename the role this is quite disruptive in g" [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [10:50:14] (03PS1) 10Ayounsi: Add routed ganeti VIP A record [dns] - 10https://gerrit.wikimedia.org/r/993669 (https://phabricator.wikimedia.org/T300152) [10:51:26] (03PS2) 10Ayounsi: Add routed ganeti VIP A record [dns] - 10https://gerrit.wikimedia.org/r/993669 (https://phabricator.wikimedia.org/T300152) [10:52:14] (03CR) 10Muehlenhoff: [C: 03+2] puppet::agent: Remove path condition for /run/puppet/disabled [puppet] - 10https://gerrit.wikimedia.org/r/993063 (owner: 10Muehlenhoff) [10:53:08] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2054.codfw.wmnet [10:53:19] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1054.eqiad.wmnet [10:54:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P55784 and previous config saved to /var/cache/conftool/dbconfig/20240129-105427-marostegui.json [10:56:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/993669 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:56:39] (03CR) 10Ayounsi: [C: 03+2] Add routed ganeti VIP A record [dns] - 10https://gerrit.wikimedia.org/r/993669 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:58:39] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240129T1100) [11:00:51] (03PS4) 10Slyngshede: P:debmonitor::server install Debmonitor from package. [puppet] - 10https://gerrit.wikimedia.org/r/993086 [11:01:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:01:20] (03CR) 10Effie Mouzeli: "(CC @JMeybohm) I see your points in terms of naming, security, and general readability, albeit there is very little chance we will need an" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:01:38] (03PS5) 10Slyngshede: P:debmonitor::server install Debmonitor from package. [puppet] - 10https://gerrit.wikimedia.org/r/993086 [11:01:52] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-airflow1007.eqiad.wmnet with reason: host reimage [11:02:33] RECOVERY - Check systemd state on ganeti2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:51] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:24] (03Abandoned) 10Effie Mouzeli: (DNM) Switch Mediawiki main memcache clusters to puppet 7: all hosts [puppet] - 10https://gerrit.wikimedia.org/r/990661 (https://phabricator.wikimedia.org/T349619) (owner: 10Effie Mouzeli) [11:04:50] (03CR) 10Effie Mouzeli: [C: 03+1] "I will merge after we are done rebooting all mc hosts" [puppet] - 10https://gerrit.wikimedia.org/r/992738 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:05:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-airflow1007.eqiad.wmnet with reason: host reimage [11:06:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:08:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [11:09:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T355609)', diff saved to https://phabricator.wikimedia.org/P55785 and previous config saved to /var/cache/conftool/dbconfig/20240129-110933-marostegui.json [11:09:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1224.eqiad.wmnet with reason: Maintenance [11:09:39] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:09:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1224.eqiad.wmnet with reason: Maintenance [11:09:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T355609)', diff saved to https://phabricator.wikimedia.org/P55786 and previous config saved to /var/cache/conftool/dbconfig/20240129-110955-marostegui.json [11:10:43] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1221/console" [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [11:10:52] (03PS3) 10Effie Mouzeli: deployment_server: add mw-mcrouter service 1 [puppet] - 10https://gerrit.wikimedia.org/r/979339 (https://phabricator.wikimedia.org/T346690) [11:11:06] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:11:51] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1222/console" [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [11:12:15] (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore1004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989628 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [11:12:51] (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore1005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989629 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [11:13:16] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1223/co" [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [11:13:27] (03PS3) 10Effie Mouzeli: Add namespace for mw-mcrouter service 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979340 (https://phabricator.wikimedia.org/T346690) [11:13:33] (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore1006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989630 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [11:14:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:14:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T355609)', diff saved to https://phabricator.wikimedia.org/P55787 and previous config saved to /var/cache/conftool/dbconfig/20240129-111434-marostegui.json [11:14:36] (03PS9) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) [11:14:42] (03PS6) 10Slyngshede: P:debmonitor::server install Debmonitor from package. [puppet] - 10https://gerrit.wikimedia.org/r/993086 [11:14:46] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:16:03] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1224/co" [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [11:19:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:19:54] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:02] PROBLEM - Check systemd state on ganeti2034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:25:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:26:32] (03CR) 10Muehlenhoff: P:debmonitor::server install Debmonitor from package. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [11:27:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:28:07] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-airflow1007.eqiad.wmnet with OS bullseye [11:29:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P55788 and previous config saved to /var/cache/conftool/dbconfig/20240129-112940-marostegui.json [11:30:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:32:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:32:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:32:40] RECOVERY - Check systemd state on ganeti2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:44] (03Abandoned) 10Fabfur: Add missing netmapper for abuse_networks [puppet] - 10https://gerrit.wikimedia.org/r/991409 (https://phabricator.wikimedia.org/T355158) (owner: 10Fabfur) [11:37:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:38:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:38:27] !log upload ganeti 3.0.2-3+wmf1 (bookworm package of Ganeti plus backport for SSL chain handling in RAPI) to apt.wikimedia.org T300152 [11:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:32] T300152: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 [11:39:49] !log T354700 - Ran mwscript maintenance/sql.php --wiki=testwiki ~/T354700-create-table.sql [11:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:55] T354700: Draft: Add columns user_autocreate_serial.uas_year and global_user_autocreate_serial.uas_year - https://phabricator.wikimedia.org/T354700 [11:41:06] !log T354700 - Running `foreachwiki maintenance/sql.php ~/T354700-create-table.sql` [11:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:10] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Lucas_Werkmeister_WMDE) [11:42:29] (03PS1) 10Majavah: hieradata: cloudweb: enable envoy services_proxy on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/993673 (https://phabricator.wikimedia.org/T255568) [11:43:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:44:40] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P55789 and previous config saved to /var/cache/conftool/dbconfig/20240129-114446-marostegui.json [11:44:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:45:08] !log sql.php finished for T354700 [11:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:13] T354700: Draft: Add columns user_autocreate_serial.uas_year and global_user_autocreate_serial.uas_year - https://phabricator.wikimedia.org/T354700 [11:45:30] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1226/co" [puppet] - 10https://gerrit.wikimedia.org/r/993673 (https://phabricator.wikimedia.org/T255568) (owner: 10Majavah) [11:48:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:49:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:49:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:53:04] RECOVERY - Check systemd state on ml-serve1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:18] !log Running mwscript maintenance/sql.php --wiki=testwiki --wikidb=centralauth ~/T354700-create-table-global.sql for T354700 [11:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:23] T354700: Draft: Add columns user_autocreate_serial.uas_year and global_user_autocreate_serial.uas_year - https://phabricator.wikimedia.org/T354700 [11:53:24] RECOVERY - Check systemd state on ml-serve1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:54:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:55:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:58:03] (03PS1) 10Stevemunene: Add dummy keytabs for new an-worker1157-1175 [labs/private] - 10https://gerrit.wikimedia.org/r/993675 (https://phabricator.wikimedia.org/T353776) [11:59:04] PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-phabricator-repos.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T355609)', diff saved to https://phabricator.wikimedia.org/P55790 and previous config saved to /var/cache/conftool/dbconfig/20240129-115953-marostegui.json [11:59:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [11:59:59] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:00:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:00:17] RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [12:00:21] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::airflow::wmde [12:01:29] (03PS1) 10Muehlenhoff: Switch airflow/wmde to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993676 (https://phabricator.wikimedia.org/T349619) [12:05:22] (03CR) 10Slyngshede: P:debmonitor::server install Debmonitor from package. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [12:06:07] (03PS7) 10Slyngshede: P:debmonitor::server install Debmonitor from package. [puppet] - 10https://gerrit.wikimedia.org/r/993086 [12:06:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1231.eqiad.wmnet with reason: Maintenance [12:06:09] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:06:15] (03CR) 10Slyngshede: P:debmonitor::server install Debmonitor from package. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [12:06:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1231.eqiad.wmnet with reason: Maintenance [12:06:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T355609)', diff saved to https://phabricator.wikimedia.org/P55791 and previous config saved to /var/cache/conftool/dbconfig/20240129-120628-marostegui.json [12:06:33] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:09:12] (03CR) 10Muehlenhoff: [C: 03+2] Switch airflow/wmde to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993676 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:09:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:10:54] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [12:12:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T355609)', diff saved to https://phabricator.wikimedia.org/P55792 and previous config saved to /var/cache/conftool/dbconfig/20240129-121205-marostegui.json [12:12:15] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:13:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:14:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::airflow::wmde [12:17:53] (03CR) 10Slyngshede: [C: 03+2] P:debmonitor::server install Debmonitor from package. [puppet] - 10https://gerrit.wikimedia.org/r/993086 (owner: 10Slyngshede) [12:18:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:19:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:20:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:20:39] RECOVERY - Check systemd state on ganeti2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:49] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:21:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-airflow1007.eqiad.wmnet [12:25:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:25:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1007.eqiad.wmnet [12:26:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:27:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P55793 and previous config saved to /var/cache/conftool/dbconfig/20240129-122713-marostegui.json [12:27:21] 10SRE, 10LDAP: Missing Release Engineering members in LDAP group - https://phabricator.wikimedia.org/T356043 (10jnuche) [12:31:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:33:47] !log installing openssh security updates [12:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:09] (03CR) 10Brouberol: [C: 03+1] Update the spark-operator image name and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/993012 (https://phabricator.wikimedia.org/T354273) (owner: 10Btullis) [12:37:21] (03CR) 10FNegri: "This looks good, but I'm confused by the difference with the "-standalone" images. I left a comment in the task." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991595 (https://phabricator.wikimedia.org/T355231) (owner: 10Majavah) [12:41:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:42:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P55794 and previous config saved to /var/cache/conftool/dbconfig/20240129-124220-marostegui.json [12:42:24] (03CR) 10Hnowlan: [C: 03+2] kubernetes: make 5 jobrunners kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/992973 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [12:50:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:51:50] (PuppetFailure) resolved: Puppet has failed on debmonitor2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:55:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:56:11] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2260.codfw.wmnet with OS bullseye [12:56:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:56:25] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2260.codfw.wmnet with OS bullseye [12:57:23] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2355.codfw.wmnet with OS bullseye [12:57:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T355609)', diff saved to https://phabricator.wikimedia.org/P55795 and previous config saved to /var/cache/conftool/dbconfig/20240129-125726-marostegui.json [12:57:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [12:57:32] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:57:38] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2355.codfw.wmnet with OS bullseye [12:57:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [12:57:56] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/993090 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [12:58:44] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2381.codfw.wmnet with OS bullseye [12:58:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [12:58:57] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2381.codfw.wmnet with OS bullseye [12:59:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [12:59:21] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2429.codfw.wmnet with OS bullseye [12:59:21] (ProbeDown) firing: (8) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:59:34] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2429.codfw.wmnet with OS bullseye [13:00:34] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2445.codfw.wmnet with OS bullseye [13:00:46] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2445.codfw.wmnet with OS bullseye [13:01:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:02:33] PROBLEM - Check systemd state on kubernetes2055 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:06:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:07:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [13:07:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [13:07:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2117 (T355609)', diff saved to https://phabricator.wikimedia.org/P55796 and previous config saved to /var/cache/conftool/dbconfig/20240129-130724-marostegui.json [13:07:26] !log brouberol@cumin1002 START - Cookbook sre.hosts.reimage for host an-tool1009.eqiad.wmnet with OS bullseye [13:07:34] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [13:09:05] (03PS1) 10Volans: setup.py: add missing classifier for Python 3.11 [software/spicerack] - 10https://gerrit.wikimedia.org/r/993687 [13:09:07] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/993688 [13:10:23] (03PS2) 10Volans: CHANGELOG: add changelogs for release v8.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/993688 [13:10:32] (03PS2) 10Slyngshede: Add dependencies for Jquery and debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993083 [13:10:38] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10Jeff_G) >>! In T355433#9485168, @MikhasikRV wrote: >>>! In T355433#9484879, @Jeff_G wrote: >> >> I was able to download the file as F 1-74-0217.PDF. In case one of us gets it to upload... [13:11:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:12:21] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2260.codfw.wmnet with reason: host reimage [13:13:25] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2355.codfw.wmnet with reason: host reimage [13:13:28] (03PS3) 10Slyngshede: Add dependencies for Jquery and debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993083 [13:13:55] (03CR) 10Slyngshede: Add dependencies for Jquery and debmonitor-client (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993083 (owner: 10Slyngshede) [13:14:48] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2381.codfw.wmnet with reason: host reimage [13:15:31] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2260.codfw.wmnet with reason: host reimage [13:16:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T355609)', diff saved to https://phabricator.wikimedia.org/P55797 and previous config saved to /var/cache/conftool/dbconfig/20240129-131623-marostegui.json [13:16:32] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [13:16:35] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2429.codfw.wmnet with reason: host reimage [13:16:43] !log brouberol@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-tool1009.eqiad.wmnet with reason: host reimage [13:17:25] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2445.codfw.wmnet with reason: host reimage [13:18:11] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2381.codfw.wmnet with reason: host reimage [13:20:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:20:56] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2429.codfw.wmnet with reason: host reimage [13:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:22:05] (03CR) 10Volans: [C: 03+2] setup.py: add missing classifier for Python 3.11 [software/spicerack] - 10https://gerrit.wikimedia.org/r/993687 (owner: 10Volans) [13:23:06] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v8.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/993688 (owner: 10Volans) [13:23:10] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-tool1009.eqiad.wmnet with reason: host reimage [13:23:16] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2055 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:23:21] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-airflow1006.eqiad.wmnet with OS bullseye [13:25:56] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2445.codfw.wmnet with reason: host reimage [13:26:33] !log Restarting ferm.service on k8s node kubernetes2055 - T354855 [13:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:37] T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855 [13:27:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:27:34] RECOVERY - Check systemd state on kubernetes2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:12] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2355.codfw.wmnet with reason: host reimage [13:29:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993083 (owner: 10Slyngshede) [13:30:26] (03Merged) 10jenkins-bot: setup.py: add missing classifier for Python 3.11 [software/spicerack] - 10https://gerrit.wikimedia.org/r/993687 (owner: 10Volans) [13:30:28] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/993688 (owner: 10Volans) [13:30:36] (03CR) 10Slyngshede: [C: 03+2] Add dependencies for Jquery and debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993083 (owner: 10Slyngshede) [13:31:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P55798 and previous config saved to /var/cache/conftool/dbconfig/20240129-133129-marostegui.json [13:32:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:33:22] (03Merged) 10jenkins-bot: Add dependencies for Jquery and debmonitor-client [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/993083 (owner: 10Slyngshede) [13:33:58] (03PS1) 10Hashar: wm-checks-api: direct link to build when only one failed [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/993689 (https://phabricator.wikimedia.org/T355774) [13:35:00] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2260.codfw.wmnet with OS bullseye [13:35:08] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2260.codfw.wmnet with OS bullseye completed: - mw2260 (**PASS**) - Downtimed on Icinga/Alertma... [13:36:05] (03PS1) 10Volans: Upstream release v8.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/993690 [13:36:46] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-airflow1006.eqiad.wmnet with reason: host reimage [13:37:14] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2381.codfw.wmnet with OS bullseye [13:37:23] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2381.codfw.wmnet with OS bullseye completed: - mw2381 (**PASS**) - Downtimed on Icinga/Alertma... [13:38:55] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [13:39:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:40:02] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-airflow1006.eqiad.wmnet with reason: host reimage [13:40:09] (03PS1) 10Jcrespo: dbbackups: Productionize the grants needed to backup ipoid database [puppet] - 10https://gerrit.wikimedia.org/r/993691 (https://phabricator.wikimedia.org/T355884) [13:40:10] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2429.codfw.wmnet with OS bullseye [13:40:19] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2429.codfw.wmnet with OS bullseye completed: - mw2429 (**WARN**) - Downtimed on Icinga/Alertma... [13:42:17] (03CR) 10Arnaudb: [C: 03+1] dbbackups: Productionize the grants needed to backup ipoid database [puppet] - 10https://gerrit.wikimedia.org/r/993691 (https://phabricator.wikimedia.org/T355884) (owner: 10Jcrespo) [13:43:41] (03CR) 10Volans: [C: 03+2] Upstream release v8.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/993690 (owner: 10Volans) [13:44:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:44:21] (ProbeDown) firing: (8) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:45:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:46:07] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2445.codfw.wmnet with OS bullseye [13:46:18] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2445.codfw.wmnet with OS bullseye completed: - mw2445 (**PASS**) - Downtimed on Icinga/Alertma... [13:46:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P55799 and previous config saved to /var/cache/conftool/dbconfig/20240129-134636-marostegui.json [13:48:05] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2355.codfw.wmnet with OS bullseye [13:48:13] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2355.codfw.wmnet with OS bullseye completed: - mw2355 (**PASS**) - Downtimed on Icinga/Alertma... [13:49:09] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Productionize the grants needed to backup ipoid database [puppet] - 10https://gerrit.wikimedia.org/r/993691 (https://phabricator.wikimedia.org/T355884) (owner: 10Jcrespo) [13:50:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:50:23] (03PS2) 10Anzx: hewikinews: remove wgExtraGenderNamespaces and add wgNamespaceAliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993494 (https://phabricator.wikimedia.org/T349581) [13:50:32] (03Merged) 10jenkins-bot: Upstream release v8.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/993690 (owner: 10Volans) [13:51:46] (03PS4) 10Anzx: knwiki: add portal namespace and fix talkpagenames of draft and module namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992783 (https://phabricator.wikimedia.org/T355662) [13:52:20] (03PS14) 10Anzx: uzwiki: revert temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) [13:53:34] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2055 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:53:48] RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor2003 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 680 bytes in 0.165 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [13:54:14] !log uploaded spicerack_8.3.0 to apt.wikimedia.org bullseye-wikimedia [13:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:36] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:18] (03CR) 10Gehel: [C: 03+1] Update the spark-operator image name and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/993012 (https://phabricator.wikimedia.org/T354273) (owner: 10Btullis) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240129T1400). [14:00:05] anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] o/ [14:01:12] (03CR) 10Marostegui: [C: 03+1] "Let's start some manual runs to see how this goes before scheduling it for a daily run" [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [14:01:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T355609)', diff saved to https://phabricator.wikimedia.org/P55801 and previous config saved to /var/cache/conftool/dbconfig/20240129-140142-marostegui.json [14:01:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2124.codfw.wmnet with reason: Maintenance [14:01:52] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [14:01:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2124.codfw.wmnet with reason: Maintenance [14:02:06] o/ [14:02:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T355609)', diff saved to https://phabricator.wikimedia.org/P55802 and previous config saved to /var/cache/conftool/dbconfig/20240129-140205-marostegui.json [14:02:23] (03CR) 10Hashar: [C: 03+2] wm-checks-api: direct link to build when only one failed [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/993689 (https://phabricator.wikimedia.org/T355774) (owner: 10Hashar) [14:02:55] (03Merged) 10jenkins-bot: wm-checks-api: direct link to build when only one failed [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/993689 (https://phabricator.wikimedia.org/T355774) (owner: 10Hashar) [14:03:01] I would like to do some backports in this window which I can self serve [14:03:13] I will do my Gerrit plugin update once the backport window has completed [14:03:21] I guess I’ll start with anzx’ changes then [14:03:27] 👍 [14:03:38] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:03:55] (03PS15) 10Lucas Werkmeister (WMDE): uzwiki: revert temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [14:04:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [14:04:26] wow that sure is a lot of T356024 in logspam-watch [14:04:26] T356024: TypeError: Argument 4 passed to Wikimedia\Parsoid\Utils\Title::__construct() must be of the type string, null given, called in /srv/mediawiki/php-1.42.0-wmf.15/vendor/wikimedia/parsoid/src/Utils/Title.php on line 392 - https://phabricator.wikimedia.org/T356024 [14:04:47] (03Merged) 10jenkins-bot: uzwiki: revert temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [14:04:59] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:992371|uzwiki: revert temporary logo for the 20th anniversary (T353723)]] [14:05:20] T353723: Requesting temporary logo change for uz.wikipedia.org - https://phabricator.wikimedia.org/T353723 [14:07:23] 10SRE, 10Infrastructure-Foundations: Updated java security policy in OpenJDK 8 u265 - https://phabricator.wikimedia.org/T261196 (10MoritzMuehlenhoff) p:05Triage→03Low [14:07:24] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and anzx: Backport for [[gerrit:992371|uzwiki: revert temporary logo for the 20th anniversary (T353723)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:07:26] Lucas_WMDE: checking [14:07:29] ok [14:07:33] looking at the knwiki change at the moment [14:08:30] Lucas_WMDE: looks good [14:09:11] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and anzx: Continuing with sync [14:09:37] (03PS1) 10Brouberol: hue: rename python-snappy apt dependency [puppet] - 10https://gerrit.wikimedia.org/r/993692 (https://phabricator.wikimedia.org/T349400) [14:10:22] (03CR) 10Lucas Werkmeister (WMDE): knwiki: add portal namespace and fix talkpagenames of draft and module namespace (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992783 (https://phabricator.wikimedia.org/T355662) (owner: 10Anzx) [14:10:37] (03CR) 10Gehel: [C: 03+1] hue: rename python-snappy apt dependency [puppet] - 10https://gerrit.wikimedia.org/r/993692 (https://phabricator.wikimedia.org/T349400) (owner: 10Brouberol) [14:10:54] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-airflow1006.eqiad.wmnet with OS bullseye [14:11:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T355609)', diff saved to https://phabricator.wikimedia.org/P55803 and previous config saved to /var/cache/conftool/dbconfig/20240129-141111-marostegui.json [14:11:12] (03PS1) 10Majavah: P:toolforge: mailrelay: workaround Exim 4.94 taints [puppet] - 10https://gerrit.wikimedia.org/r/993693 (https://phabricator.wikimedia.org/T311910) [14:11:16] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [14:12:42] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1227/co" [puppet] - 10https://gerrit.wikimedia.org/r/993693 (https://phabricator.wikimedia.org/T311910) (owner: 10Majavah) [14:13:52] (03CR) 10Brouberol: [C: 03+2] hue: rename python-snappy apt dependency [puppet] - 10https://gerrit.wikimedia.org/r/993692 (https://phabricator.wikimedia.org/T349400) (owner: 10Brouberol) [14:14:51] (03PS5) 10Anzx: knwiki: add portal namespace and fix talkpagenames of draft and module namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992783 (https://phabricator.wikimedia.org/T355662) [14:14:59] (03CR) 10Anzx: knwiki: add portal namespace and fix talkpagenames of draft and module namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992783 (https://phabricator.wikimedia.org/T355662) (owner: 10Anzx) [14:15:20] (03CR) 10CI reject: [V: 04-1] P:toolforge: mailrelay: workaround Exim 4.94 taints [puppet] - 10https://gerrit.wikimedia.org/r/993693 (https://phabricator.wikimedia.org/T311910) (owner: 10Majavah) [14:15:22] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:01] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:992371|uzwiki: revert temporary logo for the 20th anniversary (T353723)]] (duration: 11m 01s) [14:16:06] T353723: Requesting temporary logo change for uz.wikipedia.org - https://phabricator.wikimedia.org/T353723 [14:16:27] (03PS1) 10Hashar: gerrit: move soy templates files to unique namespaces [puppet] - 10https://gerrit.wikimedia.org/r/993694 [14:17:01] (03PS1) 10Dreamy Jazz: Send email if file is uploaded that is already a match [extensions/MediaModeration] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/993499 (https://phabricator.wikimedia.org/T355357) [14:17:08] (03PS2) 10Majavah: P:toolforge: mailrelay: workaround Exim 4.94 taints [puppet] - 10https://gerrit.wikimedia.org/r/993693 (https://phabricator.wikimedia.org/T311910) [14:17:12] (03PS1) 10Dreamy Jazz: Make the email subject unique for positive match emails [extensions/MediaModeration] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/993500 (https://phabricator.wikimedia.org/T355752) [14:17:33] !log upgraded spicerack to 8.3.0 on cumin2002 [14:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:06] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1228/co" [puppet] - 10https://gerrit.wikimedia.org/r/993693 (https://phabricator.wikimedia.org/T311910) (owner: 10Majavah) [14:18:58] (03PS6) 10Lucas Werkmeister (WMDE): knwiki: add portal namespace and fix talkpagenames of draft and module namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992783 (https://phabricator.wikimedia.org/T355662) (owner: 10Anzx) [14:19:43] 10SRE, 10Infrastructure-Foundations, 10serviceops: httpbb needs to be setup on cumin1002 and removed from cumin1001 - https://phabricator.wikimedia.org/T356054 (10Volans) [14:20:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992783 (https://phabricator.wikimedia.org/T355662) (owner: 10Anzx) [14:20:57] (03Merged) 10jenkins-bot: knwiki: add portal namespace and fix talkpagenames of draft and module namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992783 (https://phabricator.wikimedia.org/T355662) (owner: 10Anzx) [14:21:12] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:992783|knwiki: add portal namespace and fix talkpagenames of draft and module namespace (T355662 T346583)]] [14:21:18] T355662: Create portal namespace on kannada wikipedia - https://phabricator.wikimedia.org/T355662 [14:21:19] T346583: Change namespace names for Kannada Language - https://phabricator.wikimedia.org/T346583 [14:21:28] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:22:31] !log lucaswerkmeister-wmde@deploy2002 anzx and lucaswerkmeister-wmde: Backport for [[gerrit:992783|knwiki: add portal namespace and fix talkpagenames of draft and module namespace (T355662 T346583)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:22:36] (03PS1) 10Hashar: gerrit: sync soy email template with version 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/993695 (https://phabricator.wikimedia.org/T355259) [14:22:42] Checking [14:23:36] Lucas_WMDE: looks good [14:23:40] ok! [14:23:44] !log lucaswerkmeister-wmde@deploy2002 anzx and lucaswerkmeister-wmde: Continuing with sync [14:23:58] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ceph2001.codfw.wmnet with OS bullseye [14:24:40] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:24:42] anzx: for hewikinews – if I understand correctly, they just wanted those extra words (מש etc.) to be additional aliases for the namespace, but instead they accidentally overrode the default gendered namespace from MediaWiki? [14:25:34] Lucas_WMDE: yes, they asked for aliases only [14:26:14] (03PS2) 10Dreamy Jazz: Send email if file is uploaded that is already a match [extensions/MediaModeration] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/993499 (https://phabricator.wikimedia.org/T355357) [14:26:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P55804 and previous config saved to /var/cache/conftool/dbconfig/20240129-142617-marostegui.json [14:26:29] and $wgExtraGenderNamespaces overrides https://gerrit.wikimedia.org/g/mediawiki/core/+/4637824f68/languages/messages/MessagesHe.php#31 ? [14:26:41] oops, https://gerrit.wikimedia.org/g/mediawiki/core/+/4637824f68/languages/messages/MessagesHe.php#35 (wrong line number) [14:27:18] hm, but https://he.wikinews.org/wiki/משתמש:Lucas_Werkmeister_(WMDE) still shows משתמש, not מש [14:28:18] (03PS1) 10Majavah: P:toolforge::mailrelay: add Authentication-Results header [puppet] - 10https://gerrit.wikimedia.org/r/993697 (https://phabricator.wikimedia.org/T354112) [14:30:11] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:992783|knwiki: add portal namespace and fix talkpagenames of draft and module namespace (T355662 T346583)]] (duration: 08m 58s) [14:30:17] T355662: Create portal namespace on kannada wikipedia - https://phabricator.wikimedia.org/T355662 [14:30:18] T346583: Change namespace names for Kannada Language - https://phabricator.wikimedia.org/T346583 [14:30:20] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::airflow::analytics_product [14:31:32] Lucas_WMDE: user reported that viewing short word alias on recent changes https://phabricator.wikimedia.org/T349581#9490150 [14:31:42] (03PS1) 10Muehlenhoff: Switch analytics_cluster::airflow::analytics_product to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993699 (https://phabricator.wikimedia.org/T349619) [14:32:58] I don’t really understand it but let’s try it anyways I guess [14:33:05] (03PS3) 10Lucas Werkmeister (WMDE): hewikinews: remove wgExtraGenderNamespaces and add wgNamespaceAliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993494 (https://phabricator.wikimedia.org/T349581) (owner: 10Anzx) [14:33:11] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [14:33:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993494 (https://phabricator.wikimedia.org/T349581) (owner: 10Anzx) [14:33:57] (03Merged) 10jenkins-bot: hewikinews: remove wgExtraGenderNamespaces and add wgNamespaceAliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993494 (https://phabricator.wikimedia.org/T349581) (owner: 10Anzx) [14:34:12] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:993494|hewikinews: remove wgExtraGenderNamespaces and add wgNamespaceAliases (T349581)]] [14:34:27] T349581: Create draft namespace and add namespaces aliases for hewikinews - https://phabricator.wikimedia.org/T349581 [14:35:08] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::airflow::analytics_product to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993699 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:36:03] !log lucaswerkmeister-wmde@deploy2002 anzx and lucaswerkmeister-wmde: Backport for [[gerrit:993494|hewikinews: remove wgExtraGenderNamespaces and add wgNamespaceAliases (T349581)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:36:17] Lucas_WMDE: checking [14:37:11] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [14:37:25] !log brouberol@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-tool1009.eqiad.wmnet with OS bullseye [14:38:40] Lucas_WMDE: all aliases are working [14:38:41] (03PS1) 10Brouberol: Revert "hue: rename python-snappy apt dependency" [puppet] - 10https://gerrit.wikimedia.org/r/993501 [14:39:09] 10SRE, 10ops-codfw, 10Data-Persistence, 10Data-Persistence-Backup, and 2 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10Marostegui) [14:39:23] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:23] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10Marostegui) [14:39:38] (03CR) 10Btullis: [C: 03+1] Revert "hue: rename python-snappy apt dependency" [puppet] - 10https://gerrit.wikimedia.org/r/993501 (owner: 10Brouberol) [14:39:44] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10Marostegui) [14:40:10] !log lucaswerkmeister-wmde@deploy2002 anzx and lucaswerkmeister-wmde: Continuing with sync [14:40:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::airflow::analytics_product [14:41:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P55806 and previous config saved to /var/cache/conftool/dbconfig/20240129-144124-marostegui.json [14:41:30] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Marostegui) [14:41:41] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:41:48] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10Marostegui) [14:42:05] !log ayounsi@cumin2002 START - Cookbook sre.ganeti.makevm for new host sretest1005.eqiad.wmnet [14:42:07] !log ayounsi@cumin2002 START - Cookbook sre.dns.netbox [14:42:47] (03CR) 10CI reject: [V: 04-1] Revert "hue: rename python-snappy apt dependency" [puppet] - 10https://gerrit.wikimedia.org/r/993501 (owner: 10Brouberol) [14:42:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-airflow1006.eqiad.wmnet [14:43:10] (03CR) 10Joal: cassandra: create template for aqsloader role & grants (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993102 (https://phabricator.wikimedia.org/T355917) (owner: 10Eevans) [14:43:55] Going to +2 both my backports to get them through CI [14:44:24] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874 (10Marostegui) [14:44:34] I probably have a third too, but gerrit doesn't let me cherry-pick it until the others are done first due to merge conflicts. [14:44:36] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 (10Marostegui) [14:44:44] (03CR) 10Dreamy Jazz: [C: 03+2] Make the email subject unique for positive match emails [extensions/MediaModeration] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/993500 (https://phabricator.wikimedia.org/T355752) (owner: 10Dreamy Jazz) [14:44:46] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871 (10Marostegui) [14:44:47] ack [14:44:56] (03CR) 10Dreamy Jazz: [C: 03+2] Send email if file is uploaded that is already a match [extensions/MediaModeration] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/993499 (https://phabricator.wikimedia.org/T355357) (owner: 10Dreamy Jazz) [14:44:57] (03PS1) 10Hnowlan: tegola-vector-tiles: add maps primaries to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/993700 (https://phabricator.wikimedia.org/T355892) [14:44:59] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873 (10Marostegui) [14:45:15] (03CR) 10Muehlenhoff: Revert "hue: rename python-snappy apt dependency" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993501 (owner: 10Brouberol) [14:45:18] re “This means that the inbox of the email addresses displays each report as a reply to previous reports”, I’m tempted to say that the solution is to stop using email clients that hallucinate In-Reply-To headers :P [14:45:21] but who am I kidding [14:45:33] google does whatever google wants [14:45:43] and everyone else just has to live with it [14:45:50] Yup :D [14:46:14] * Lucas_WMDE is not at all mad about T355712 either [14:46:42] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:993494|hewikinews: remove wgExtraGenderNamespaces and add wgNamespaceAliases (T349581)]] (duration: 12m 29s) [14:46:46] Dreamy_Jazz: over to you [14:46:48] T349581: Create draft namespace and add namespaces aliases for hewikinews - https://phabricator.wikimedia.org/T349581 [14:46:49] Lucas_WMDE: thank you [14:46:53] (fyi hashar) [14:46:58] Lucas_WMDE: Thanks. [14:46:59] anzx: np :) [14:47:06] thank you! [14:47:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/993500 (https://phabricator.wikimedia.org/T355752) (owner: 10Dreamy Jazz) [14:47:23] (03Merged) 10jenkins-bot: Make the email subject unique for positive match emails [extensions/MediaModeration] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/993500 (https://phabricator.wikimedia.org/T355752) (owner: 10Dreamy Jazz) [14:47:26] (03Merged) 10jenkins-bot: Send email if file is uploaded that is already a match [extensions/MediaModeration] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/993499 (https://phabricator.wikimedia.org/T355357) (owner: 10Dreamy Jazz) [14:47:32] (03PS2) 10Brouberol: Revert "hue: rename python-snappy apt dependency" [puppet] - 10https://gerrit.wikimedia.org/r/993501 [14:47:36] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:993500|Make the email subject unique for positive match emails (T355752)]] [14:47:41] T355752: Make the email subject unique for MediaModeration emails - https://phabricator.wikimedia.org/T355752 [14:47:42] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10Marostegui) db2142 - x2 master db2103 - s1 master es2020 - es4 master [14:48:19] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874 (10Marostegui) db2146 - slave db2106 - slave [14:48:38] (03PS1) 10Hnowlan: conftool: restore maps primary servers to kartotherian pool [puppet] - 10https://gerrit.wikimedia.org/r/993702 (https://phabricator.wikimedia.org/T355892) [14:48:41] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 (10Marostegui) db2108 - slave db2123 - slave es2021 - es4 master [14:48:56] !log ayounsi@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM sretest1005.eqiad.wmnet - ayounsi@cumin2002" [14:49:30] (03PS1) 10Dreamy Jazz: Follow-up changes for MediaModerationEmailer service [extensions/MediaModeration] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/993502 (https://phabricator.wikimedia.org/T351407) [14:49:49] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM sretest1005.eqiad.wmnet - ayounsi@cumin2002" [14:49:49] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:49:50] !log ayounsi@cumin2002 START - Cookbook sre.dns.wipe-cache sretest1005.eqiad.wmnet on all recursors [14:49:53] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest1005.eqiad.wmnet on all recursors [14:50:17] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10Marostegui) db2183 - codfw backup master @jcrespo [14:50:19] !log ayounsi@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM sretest1005.eqiad.wmnet - ayounsi@cumin2002" [14:50:27] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2055.codfw.wmnet [14:51:05] scap backport is waiting a while on `K8s images build/push output redirected to /home/dreamyjazz/scap-image-build-and-push-log` [14:51:10] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM sretest1005.eqiad.wmnet - ayounsi@cumin2002" [14:51:50] !log dreamyjazz@deploy2002 sync-world aborted: Backport for [[gerrit:993500|Make the email subject unique for positive match emails (T355752)]] (duration: 04m 13s) [14:51:51] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Marostegui) db2121 - slave db2132 m1 master (not used) db2145 - slave db2104 - m2 master db2153 - slave db2154 - slave db2... [14:52:00] ohhhh, it touched i18n/ [14:52:05] that might make for a larger sync than usual [14:52:14] though I think the worst offender of this was fixed recently-ish [14:52:16] not sure [14:52:17] Oh I see. [14:52:21] (03PS3) 10Brouberol: Revert "hue: rename python-snappy apt dependency" [puppet] - 10https://gerrit.wikimedia.org/r/993501 [14:52:38] !log ayounsi@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [14:52:38] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:993500|Make the email subject unique for positive match emails (T355752)]] [14:52:46] T355752: Make the email subject unique for MediaModeration emails - https://phabricator.wikimedia.org/T355752 [14:53:00] I'll be patient then :) [14:53:25] (03CR) 10Paladox: [C: 03+1] gerrit: move soy templates files to unique namespaces [puppet] - 10https://gerrit.wikimedia.org/r/993694 (owner: 10Hashar) [14:53:29] (03CR) 10CI reject: [V: 04-1] Revert "hue: rename python-snappy apt dependency" [puppet] - 10https://gerrit.wikimedia.org/r/993501 (owner: 10Brouberol) [14:53:33] (03CR) 10Paladox: [C: 03+1] gerrit: sync soy email template with version 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/993695 (https://phabricator.wikimedia.org/T355259) (owner: 10Hashar) [14:53:37] (03CR) 10Brouberol: Revert "hue: rename python-snappy apt dependency" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993501 (owner: 10Brouberol) [14:53:46] !log ayounsi@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1005.eqiad.wmnet with OS bookworm [14:53:46] !log ayounsi@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host sretest1005.eqiad.wmnet [14:54:07] !log scap backport is also backporting 993499 for T355357 [14:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:13] T355357: Send an email to indicate a match if a file is uploaded that is already marked as a match - https://phabricator.wikimedia.org/T355357 [14:54:31] (03PS4) 10Brouberol: Revert "hue: rename python-snappy apt dependency" [puppet] - 10https://gerrit.wikimedia.org/r/993501 [14:55:40] (03CR) 10CI reject: [V: 04-1] Revert "hue: rename python-snappy apt dependency" [puppet] - 10https://gerrit.wikimedia.org/r/993501 (owner: 10Brouberol) [14:56:11] (03CR) 10Paladox: [C: 03+1] gerrit: sync soy email template with version 3.7 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993695 (https://phabricator.wikimedia.org/T355259) (owner: 10Hashar) [14:56:23] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2055.codfw.wmnet [14:56:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T355609)', diff saved to https://phabricator.wikimedia.org/P55807 and previous config saved to /var/cache/conftool/dbconfig/20240129-145630-marostegui.json [14:56:32] !log ayounsi@cumin2002 START - Cookbook sre.hosts.decommission for hosts sretest1005.eqiad.wmnet [14:56:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [14:56:36] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [14:56:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [14:56:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2129 (T355609)', diff saved to https://phabricator.wikimedia.org/P55808 and previous config saved to /var/cache/conftool/dbconfig/20240129-145652-marostegui.json [14:57:22] dammit, I can’t find the task I remember that made scap faster in certain situations where i18n was touched [14:57:28] so far I’ve only found T307277 which is still open [14:57:28] T307277: Make it easier to deploy backports with i18n changes - https://phabricator.wikimedia.org/T307277 [14:57:53] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp1001.eqiad.wmnet [14:58:31] It has now proceeded to docker_pull_k8s [14:58:37] !log hashar@deploy2002 Started deploy [gerrit/gerrit@5594608]: wm-checks-api: direct link to build when only one failed - T355774 [14:58:42] T355774: One-click access to build logs gone after upgrade - https://phabricator.wikimedia.org/T355774 [14:58:45] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@5594608]: wm-checks-api: direct link to build when only one failed - T355774 (duration: 00m 07s) [14:58:49] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10Marostegui) db2155 - slave db2156 - slave db2097 - backups slave @jcrespo db2105 - s3 master db2122 - slave db2133 - m2 ma... [14:59:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T355609)', diff saved to https://phabricator.wikimedia.org/P55809 and previous config saved to /var/cache/conftool/dbconfig/20240129-145902-marostegui.json [14:59:07] (03CR) 10Effie Mouzeli: [C: 03+2] mc: Switch to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/992738 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:59:23] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:36] !log ayounsi@cumin2002 START - Cookbook sre.dns.netbox [15:03:12] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871 (10Marostegui) db2098 - backup slave @jcrespo db2110 - slave db2111 - slave db2124 - slave db2134 - m3 master (not used) db20... [15:04:17] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:993500|Make the email subject unique for positive match emails (T355752)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:04:20] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [15:04:22] T355752: Make the email subject unique for MediaModeration emails - https://phabricator.wikimedia.org/T355752 [15:04:25] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1001.eqiad.wmnet [15:04:55] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873 (10Marostegui) db2148 - slave db2163 - slave db2185 zarcillo dc master (nothing required) db2164 - slave db2189 - slave es2029... [15:05:03] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10jcrespo) Thank you, I will shutdown media backups anyway every time one host is affected, not just this one, to minimize fa... [15:07:27] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2112 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/993469 (https://phabricator.wikimedia.org/T356059) [15:07:31] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/993470 (https://phabricator.wikimedia.org/T356059) [15:08:04] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10Marostegui) [15:09:27] (03CR) 10Hashar: "I have made a diff between upstream and our Puppet files but had the files inverted in my diff editor :)" [puppet] - 10https://gerrit.wikimedia.org/r/993695 (https://phabricator.wikimedia.org/T355259) (owner: 10Hashar) [15:09:43] (03PS2) 10Hashar: gerrit: sync soy email template with version 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/993695 (https://phabricator.wikimedia.org/T355259) [15:10:16] (03Abandoned) 10Dreamy Jazz: Follow-up changes for MediaModerationEmailer service [extensions/MediaModeration] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/993502 (https://phabricator.wikimedia.org/T351407) (owner: 10Dreamy Jazz) [15:11:39] (03PS5) 10Brouberol: Revert "hue: rename python-snappy apt dependency" [puppet] - 10https://gerrit.wikimedia.org/r/993501 [15:11:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1006.eqiad.wmnet [15:12:01] 10SRE, 10Wikimedia-Mailing-lists: Request for BHL-WIKI Group List - https://phabricator.wikimedia.org/T355941 (10JJFord_BHL) Thank you!! [15:12:08] !log ayounsi@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin2002" [15:13:02] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin2002" [15:13:02] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:13:03] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts sretest1005.eqiad.wmnet [15:13:13] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin2002 for hosts: `sretest1005.eqiad.wmnet` - sretest1005.eqiad.wmnet (... [15:13:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/993501 (owner: 10Brouberol) [15:13:59] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:993500|Make the email subject unique for positive match emails (T355752)]] (duration: 21m 21s) [15:14:06] T355752: Make the email subject unique for MediaModeration emails - https://phabricator.wikimedia.org/T355752 [15:14:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P55810 and previous config saved to /var/cache/conftool/dbconfig/20240129-151409-marostegui.json [15:14:41] hashar: That's my backports deployed [15:15:01] (03PS1) 10Ilias Sarantopoulos: ml-services: test GPU with article-descriptions model [deployment-charts] - 10https://gerrit.wikimedia.org/r/993707 [15:15:09] !log afternoon UTC backport window done [15:15:12] 21 miuntes doh [15:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:26] anyway happy to see that has completed [15:16:18] (03CR) 10Brouberol: [C: 03+2] Revert "hue: rename python-snappy apt dependency" [puppet] - 10https://gerrit.wikimedia.org/r/993501 (owner: 10Brouberol) [15:17:01] !log brouberol@cumin1002 START - Cookbook sre.hosts.reimage for host an-tool1009.eqiad.wmnet with OS buster [15:17:19] !log Stopping mediamoderation scanning script [15:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:30] !log Running `foreachwikiindblist group2.dblist extensions/MediaModeration/maintenance/resendMatchEmails.php 20200405` [15:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:12] (03CR) 10Klausman: [C: 03+1] ml-services: test GPU with article-descriptions model [deployment-charts] - 10https://gerrit.wikimedia.org/r/993707 (owner: 10Ilias Sarantopoulos) [15:21:25] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: test GPU with article-descriptions model [deployment-charts] - 10https://gerrit.wikimedia.org/r/993707 (owner: 10Ilias Sarantopoulos) [15:21:33] !log Running `foreachwikiindblist group1.dblist extensions/MediaModeration/maintenance/resendMatchEmails.php 20200405 --verbose` [15:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:17] (03Merged) 10jenkins-bot: ml-services: test GPU with article-descriptions model [deployment-charts] - 10https://gerrit.wikimedia.org/r/993707 (owner: 10Ilias Sarantopoulos) [15:23:18] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [15:24:11] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:26:09] !log Running MediaModeration scanning script using `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30-no-render-now.txt` on a tmux session. [15:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:17] (03PS1) 10Brouberol: Build hue for Debian Bullseye by default [debs/hue] - 10https://gerrit.wikimedia.org/r/993708 (https://phabricator.wikimedia.org/T349400) [15:28:21] (03PS1) 10Esanders: DiscussionTools: Enable permalinks frontend everywhere except en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993709 (https://phabricator.wikimedia.org/T356063) [15:29:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P55811 and previous config saved to /var/cache/conftool/dbconfig/20240129-152915-marostegui.json [15:30:52] (03CR) 10Muehlenhoff: Build hue for Debian Bullseye by default (032 comments) [debs/hue] - 10https://gerrit.wikimedia.org/r/993708 (https://phabricator.wikimedia.org/T349400) (owner: 10Brouberol) [15:30:55] (03CR) 10Clément Goubert: [C: 03+1] tegola-vector-tiles: add maps primaries to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/993700 (https://phabricator.wikimedia.org/T355892) (owner: 10Hnowlan) [15:31:02] (03CR) 10Clément Goubert: [C: 03+1] conftool: restore maps primary servers to kartotherian pool [puppet] - 10https://gerrit.wikimedia.org/r/993702 (https://phabricator.wikimedia.org/T355892) (owner: 10Hnowlan) [15:31:39] !log brouberol@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-tool1009.eqiad.wmnet with reason: host reimage [15:34:50] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-tool1009.eqiad.wmnet with reason: host reimage [15:35:32] 10SRE, 10Infrastructure-Foundations, 10Mail: Puppetry - https://phabricator.wikimedia.org/T325395 (10jhathaway) p:05Triage→03Medium [15:35:35] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10xcollazo) Hey @Dzahn. I did receive your test email. However, I do not see it on https://groups.google.com/a/wikimedia.org/g/ops-dumps, so it doesn’t seem like i... [15:36:26] 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-inbound-lists - https://phabricator.wikimedia.org/T325404 (10jhathaway) p:05Triage→03Low [15:36:36] 10SRE, 10Infrastructure-Foundations, 10Mail: Provision mta-outbound-lists - https://phabricator.wikimedia.org/T325405 (10jhathaway) p:05Triage→03Medium [15:36:55] 10SRE, 10Infrastructure-Foundations, 10Mail: MTA Provisioning - https://phabricator.wikimedia.org/T325403 (10jhathaway) p:05Triage→03Medium [15:37:11] 10SRE, 10Infrastructure-Foundations, 10Mail: Replace Exim with Postfix on mail servers - https://phabricator.wikimedia.org/T325394 (10jhathaway) p:05Triage→03Medium [15:38:43] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: admin: Add validation checks for missing realname and email in data.yaml - https://phabricator.wikimedia.org/T320937 (10jhathaway) p:05Triage→03Low [15:39:17] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: SSH Key type expiry - https://phabricator.wikimedia.org/T347572 (10joanna_borun) p:05Triage→03Medium [15:40:20] 10SRE, 10Traffic: create a puppetized abstraction for haproxy blocklist hysteresis - https://phabricator.wikimedia.org/T329331 (10CDanis) @Fabfur just wanted to make sure you've seen this task, it is decent documentation of the existing mechanism and probably helpful for doing T353910 [15:40:33] 10SRE, 10Bitu, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [15:40:41] 10SRE, 10Infrastructure-Foundations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815 (10SLyngshede-WMF) [15:40:46] 10SRE, 10Bitu, 10Infrastructure-Foundations: Create an IDM for Wikimedia developer accounts - https://phabricator.wikimedia.org/T319405 (10SLyngshede-WMF) [15:40:54] (03CR) 10Dzahn: [C: 03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/993454 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [15:41:07] 10SRE, 10Bitu, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [15:41:15] 10SRE, 10Infrastructure-Foundations: Further enhancements for nftables support in profile::firewall - https://phabricator.wikimedia.org/T348498 (10MoritzMuehlenhoff) [15:41:19] 10SRE, 10Infrastructure-Foundations, 10Traffic: NEL: don't alert on domains we don't control - https://phabricator.wikimedia.org/T349807 (10CDanis) p:05Triage→03Medium [15:41:25] 10SRE, 10Infrastructure-Foundations: Monitoring check for nftables - https://phabricator.wikimedia.org/T348499 (10MoritzMuehlenhoff) 05Open→03In progress p:05Triage→03Medium a:03MoritzMuehlenhoff [15:42:04] (03PS3) 10Clément Goubert: httpbb: Migrate to cumin1002 [puppet] - 10https://gerrit.wikimedia.org/r/993710 (https://phabricator.wikimedia.org/T356054) [15:42:06] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/993454 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [15:42:24] 10SRE, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: Restrict traffic from instances to private IPs on cloudgw level - https://phabricator.wikimedia.org/T350132 (10joanna_borun) [15:42:51] (03PS1) 10Marostegui: db-production.php: Disable writes on es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993711 (https://phabricator.wikimedia.org/T356064) [15:43:39] 10SRE, 10Infrastructure-Foundations, 10netops: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) p:05Triage→03Medium [15:44:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T355609)', diff saved to https://phabricator.wikimedia.org/P55814 and previous config saved to /var/cache/conftool/dbconfig/20240129-154422-marostegui.json [15:44:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance [15:44:31] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [15:44:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance [15:44:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T355609)', diff saved to https://phabricator.wikimedia.org/P55815 and previous config saved to /var/cache/conftool/dbconfig/20240129-154444-marostegui.json [15:46:24] 10SRE, 10Security-Team, 10WMF-General-or-Unknown, 10Wikimedia-Apache-configuration, and 3 others: Add security.txt to Wikimedia sites? (2023 edition) - https://phabricator.wikimedia.org/T337949 (10joanna_borun) [15:47:04] (03CR) 10Effie Mouzeli: [C: 03+2] Remove outdated stretch exclusion for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/979319 (owner: 10Awight) [15:48:07] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10cmooney) [15:48:17] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2127 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/993472 (https://phabricator.wikimedia.org/T356069) [15:48:17] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/993473 (https://phabricator.wikimedia.org/T356069) [15:48:49] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10Marostegui) [15:49:13] (03PS1) 10Hnowlan: kubernetes: make 5 jobrunners kubernetes workers │ [puppet] - 10https://gerrit.wikimedia.org/r/993714 (https://phabricator.wikimedia.org/T354791) [15:49:37] (03PS2) 10Hnowlan: kubernetes: make 5 jobrunners kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/993714 (https://phabricator.wikimedia.org/T354791) [15:51:37] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) Hey @xcollazo I am actually not sure if we expect it to show up in that group inbox. As far as I know there are different options in Google, shared inbox... [15:52:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 22.78% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:52:45] (03PS2) 10Effie Mouzeli: Remove outdated stretch exclusion for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/979319 (owner: 10Awight) [15:53:16] (03CR) 10Effie Mouzeli: [C: 03+2] Remove outdated stretch exclusion for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/979319 (owner: 10Awight) [15:53:28] (03CR) 10Scott French: [C: 03+2] "Thanks, Moritz!" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/993183 (owner: 10Scott French) [15:53:35] 10SRE-swift-storage, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10joanna_borun) [15:53:44] (03CR) 10Scott French: [V: 03+2 C: 03+2] Ensure ssh-agent services are also enabled [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/993183 (owner: 10Scott French) [15:54:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T355609)', diff saved to https://phabricator.wikimedia.org/P55816 and previous config saved to /var/cache/conftool/dbconfig/20240129-155406-marostegui.json [15:54:15] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [15:54:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:55:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad appserver GET/200: 0.42748183472170714s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:58:09] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-tool1009.eqiad.wmnet with OS buster [15:59:16] (MediaWikiLatencyExceeded) resolved: (2) Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:00:02] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: migrate distributed locking to etcd v3 - https://phabricator.wikimedia.org/T352155 (10Volans) p:05Triage→03Medium [16:00:08] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: adapt conftool module for etcd v3 - https://phabricator.wikimedia.org/T352153 (10Volans) p:05Triage→03Medium [16:00:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad appserver GET/200: 0.42748183472170714s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceed [16:00:29] (03CR) 10DLynch: [C: 03+1] DiscussionTools: Enable permalinks frontend everywhere except en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993709 (https://phabricator.wikimedia.org/T356063) (owner: 10Esanders) [16:01:39] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10MoritzMuehlenhoff) [16:02:04] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Migrate existing cookbooks related to rolling restarts/reboots to SREBatchBase - https://phabricator.wikimedia.org/T317855 (10MoritzMuehlenhoff) 05Open→03In progress p:05Triage→03Low [16:02:15] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 47.78% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:03:11] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10cloud-services-team, 10LDAP: Create a single application to provision and manage developer (LDAP) accounts - https://phabricator.wikimedia.org/T179463 (10SLyngshede-WMF) 05Open→03Declined We're already working on Bitu, which has at least some overlap wit... [16:03:14] 10sre-alert-triage, 10Infrastructure-Foundations: Alert triage: overdue alert [warning] Systemd units failing on debmonitor2003 - https://phabricator.wikimedia.org/T343897 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The migration of debmonitor/bookworm/packaged debmonitor has now progre... [16:03:53] (03PS1) 10Dbrant: [WIP] Add labs config to test Contact page for account vanishing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993718 (https://phabricator.wikimedia.org/T343536) [16:06:05] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) @SLyngshede-WMF it worked! I can login to wikitech now. [16:06:21] (03CR) 10Ladsgroup: [C: 03+1] db-production.php: Disable writes on es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993711 (https://phabricator.wikimedia.org/T356064) (owner: 10Marostegui) [16:06:37] 10SRE, 10Infrastructure-Foundations: Set nofail for raid0 recipes - https://phabricator.wikimedia.org/T350461 (10joanna_borun) p:05Triage→03Low [16:07:44] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1224 crashed - hardware error - https://phabricator.wikimedia.org/T354591 (10Jclark-ctr) @Marostegui Dell has requested firmware updates and reseating device NetXtreme BCM5720 Gigabit Ethernet PCIe on bus 4. When is a good time to take server down for reseating and... [16:07:47] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2019.codfw.wmnet with reason: Decommissioning — T352469 [16:07:52] T352469: Decommission restbase20[13-20]) - https://phabricator.wikimedia.org/T352469 [16:08:01] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2019.codfw.wmnet with reason: Decommissioning — T352469 [16:08:33] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1224 crashed - hardware error - https://phabricator.wikimedia.org/T354591 (10Marostegui) @Jclark-ctr I can switch it off any day starting tomorrow, when would it work for you? [16:09:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P55817 and previous config saved to /var/cache/conftool/dbconfig/20240129-160913-marostegui.json [16:09:16] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1224 crashed - hardware error - https://phabricator.wikimedia.org/T354591 (10Jclark-ctr) Yes that works for me Thanks [16:10:00] !log decommissioning restbase2019/cassandra-{a,b,c} — T352469 [16:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:24] jouncebot: nowandnext [16:10:24] No deployments scheduled for the next 0 hour(s) and 19 minute(s) [16:10:24] In 0 hour(s) and 19 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240129T1630) [16:10:33] (03PS2) 10Ladsgroup: Drop old virtual domain for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992129 [16:10:35] (03CR) 10Ladsgroup: [C: 03+2] Drop old virtual domain for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992129 (owner: 10Ladsgroup) [16:11:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992129 (owner: 10Ladsgroup) [16:11:16] (03Abandoned) 10Ebernhardson: cirrus: Disable cloudelastic writes on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979146 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [16:11:24] (03Merged) 10jenkins-bot: Drop old virtual domain for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992129 (owner: 10Ladsgroup) [16:11:30] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:992129|Drop old virtual domain for url shortener]] [16:11:45] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10joanna_borun) p:05Triage→03Medium [16:12:58] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:992129|Drop old virtual domain for url shortener]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:13:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:13:32] (03PS2) 10Ebernhardson: cirrus: Disable cloudelastic writes to testwiki and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992974 (https://phabricator.wikimedia.org/T352335) [16:13:54] 10SRE, 10Infrastructure-Foundations: Remove cumin1001 from router ACLs - https://phabricator.wikimedia.org/T353525 (10MoritzMuehlenhoff) p:05Triage→03Medium [16:14:29] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [16:14:42] 10SRE, 10Infrastructure-Foundations: Migrate Spicerack logs from cumin1001 to cumin1002? - https://phabricator.wikimedia.org/T353523 (10MoritzMuehlenhoff) p:05Triage→03Medium [16:17:27] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918 (10cmooney) [16:18:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:18:34] 10SRE-swift-storage: unstable device mapping of SSDs causing swift/puppet problems - example reimage - https://phabricator.wikimedia.org/T308644 (10joanna_borun) [16:18:52] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920 (10cmooney) [16:19:14] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: NetworkProbeLimit cookie should set samesite attribute - https://phabricator.wikimedia.org/T342624 (10CDanis) p:05Triage→03Low a:03CDanis [16:19:37] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Consider deprecation of WMF styleguide checks - https://phabricator.wikimedia.org/T353648 (10MoritzMuehlenhoff) p:05Triage→03Medium [16:20:21] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) p:05Triage→03Low a:05JameelKaisar→03CDanis [16:20:55] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:992129|Drop old virtual domain for url shortener]] (duration: 09m 24s) [16:23:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:24:08] 10SRE-tools, 10Infrastructure-Foundations: Package pyGNMI and dictdiffer to be used by cookbooks - https://phabricator.wikimedia.org/T340045 (10MoritzMuehlenhoff) p:05Triage→03Medium [16:24:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P55819 and previous config saved to /var/cache/conftool/dbconfig/20240129-162420-marostegui.json [16:24:59] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1224 crashed - hardware error - https://phabricator.wikimedia.org/T354591 (10Marostegui) Great, I will comment on this task once it is off. Thank you! [16:28:00] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: service::docker with 'latest' version behaves poorly if the host runs out of disk space - https://phabricator.wikimedia.org/T321851 (10SLyngshede-WMF) p:05Triage→03Low [16:30:04] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240129T1630). [16:32:40] (03PS1) 10Muehlenhoff: airflow/analytics_product: Keep Python 2 [puppet] - 10https://gerrit.wikimedia.org/r/993727 (https://phabricator.wikimedia.org/T335261) [16:33:40] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993728 (https://phabricator.wikimedia.org/T128546) [16:34:56] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Disk (sdl) failed in ms-be1068 - https://phabricator.wikimedia.org/T356033 (10Jclark-ctr) Started case with dell ordered replacement drive. You have successfully submitted request SR184210022. In mean time i have swapped 8tb failed drive with one we ha... [16:35:09] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993728 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:35:18] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Disk (sdl) failed in ms-be1068 - https://phabricator.wikimedia.org/T356033 (10Jclark-ctr) p:05High→03Low a:03Jclark-ctr [16:35:54] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993728 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:36:52] !log installed spicerack 8.3.0 on cumin1002, cumin1001 [16:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:05] (03CR) 10Ayounsi: [C: 03+2] sre.ganeti: add support for routed Ganeti [cookbooks] - 10https://gerrit.wikimedia.org/r/991348 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [16:39:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T355609)', diff saved to https://phabricator.wikimedia.org/P55820 and previous config saved to /var/cache/conftool/dbconfig/20240129-163926-marostegui.json [16:39:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance [16:39:32] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:39:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance [16:39:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [16:39:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [16:40:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T355609)', diff saved to https://phabricator.wikimedia.org/P55821 and previous config saved to /var/cache/conftool/dbconfig/20240129-164005-marostegui.json [16:40:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [16:41:12] 🎉 [16:41:45] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Disk (sdl) failed in ms-be1068 - https://phabricator.wikimedia.org/T356033 (10MatthewVernon) @Jclark-ctr thank you for the quick swap, much appreciated :-) [16:43:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 43.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:44:44] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:993728| Bumping portals to master (T128546)]] (duration: 07m 04s) [16:44:56] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:45:00] (03Merged) 10jenkins-bot: sre.ganeti: add support for routed Ganeti [cookbooks] - 10https://gerrit.wikimedia.org/r/991348 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [16:46:13] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10bcampbell) @jhathaway I do not know what CNAME record 4 is for. I can ask Sandra to connect me with Sh... [16:47:11] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic: Ensure that store.wikimedia.org complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355835 (10bcampbell) Thanks @ssingh. All looks good on the Shopify end for this instance. It says are domain is authenticating... [16:48:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.11% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:48:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T355609)', diff saved to https://phabricator.wikimedia.org/P55822 and previous config saved to /var/cache/conftool/dbconfig/20240129-164846-marostegui.json [16:48:53] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:50:15] (03PS1) 10Lucas Werkmeister (WMDE): Log more information on LexemePatcher errors [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/993503 (https://phabricator.wikimedia.org/T284061) [16:50:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:50:27] ^ I’d like to deploy this backport (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseLexeme/+/993503) in ten minutes or so if nobody objects [16:50:31] should be a harmless logging improvement [16:51:13] Lucas_WMDE: Please wait, the infrastructure is under stress right now as you can see from the PHPFPMTooBusy alert above [16:51:17] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic: Ensure that store.wikimedia.org complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355835 (10ssingh) 05Open→03Resolved a:03ssingh Thanks for letting us know @bcampbell. I am marking this as resolved; in... [16:51:22] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:993728| Bumping portals to master (T128546)]] (duration: 06m 37s) [16:51:27] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:51:41] claime: ok [16:52:32] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic: Ensure that store.wikimedia.org complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355835 (10bcampbell) Thanks @ssingh . The other Shopify instance still needs the CNAME records added it looks like, but we are... [16:54:51] (03PS1) 10Ilias Sarantopoulos: ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/993729 [16:55:16] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 45.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:56:16] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10VRiley-WMF) a:03VRiley-WMF [16:56:57] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10VRiley-WMF) 05Open→03Resolved [16:57:08] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/993729 (owner: 10Ilias Sarantopoulos) [16:57:22] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1006.eqiad.wmnet - https://phabricator.wikimedia.org/T354743 (10VRiley-WMF) This has been removed and decommissioned [16:57:55] 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1134.eqiad.wmnet - https://phabricator.wikimedia.org/T355740 (10VRiley-WMF) a:03VRiley-WMF [16:58:24] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10jhathaway) It appears to point to an SPF record: ` u13504486.wl237.sendgrid.net. 1740 IN TXT "v=spf... [16:59:36] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10bcampbell) @jhathaway I reached out to Sandra requesting that I be connected with our Shopify rep for... [17:01:52] 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1134.eqiad.wmnet - https://phabricator.wikimedia.org/T355740 (10VRiley-WMF) 05Open→03Resolved [17:03:04] 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1134.eqiad.wmnet - https://phabricator.wikimedia.org/T355740 (10VRiley-WMF) This server has been removed and decommissioned. [17:03:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P55823 and previous config saved to /var/cache/conftool/dbconfig/20240129-170353-marostegui.json [17:04:03] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10VRiley-WMF) a:03VRiley-WMF [17:06:13] (03PS179) 10Arnaudb: mariadb: cookbook draft to clone multiinstance [cookbooks] - 10https://gerrit.wikimedia.org/r/976709 (https://phabricator.wikimedia.org/T343674) [17:06:15] (03CR) 10Arnaudb: "I've tried to take note of all previous remarks" [cookbooks] - 10https://gerrit.wikimedia.org/r/976709 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [17:06:50] (03PS1) 10Reedy: Fix casing of Mediawiki to MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/993732 [17:07:29] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10VRiley-WMF) This server has been removed and decommissioned. [17:07:35] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1005.eqiad.wmnet - https://phabricator.wikimedia.org/T354742 (10VRiley-WMF) 05Open→03Resolved [17:13:43] 10SRE, 10Wikimedia-Site-requests: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10taavi) [17:14:22] * Lucas_WMDE off, will not deploy that backport today (maybe tomorrow, we’ll see) [17:19:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P55824 and previous config saved to /var/cache/conftool/dbconfig/20240129-171859-marostegui.json [17:20:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fix casing of Mediawiki to MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/993732 (owner: 10Reedy) [17:25:50] (03PS1) 10Reedy: Fix casing of Mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993738 [17:34:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T355609)', diff saved to https://phabricator.wikimedia.org/P55828 and previous config saved to /var/cache/conftool/dbconfig/20240129-173406-marostegui.json [17:34:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [17:34:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [17:34:12] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:34:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [17:34:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [17:34:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171:3316 (T355609)', diff saved to https://phabricator.wikimedia.org/P55829 and previous config saved to /var/cache/conftool/dbconfig/20240129-173435-marostegui.json [17:35:53] (03PS1) 10Stevemunene: hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/993742 (https://phabricator.wikimedia.org/T353776) [17:35:55] (03PS1) 10Stevemunene: hdfs: Assign the right role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/993743 (https://phabricator.wikimedia.org/T353776) [17:38:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:38:55] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [17:39:03] did something change recently about the HTTPS certificates used by wmcloud.org? [17:39:23] i'm getting verification errors in a script that worked before [17:42:19] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [17:42:51] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:42:52] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:43:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:43:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T355609)', diff saved to https://phabricator.wikimedia.org/P55830 and previous config saved to /var/cache/conftool/dbconfig/20240129-174327-marostegui.json [17:43:30] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:43:31] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:43:37] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:43:58] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:45:13] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF [17:45:24] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10VRiley-WMF) This server has been removed and decommissioned [17:45:39] (ProbeDown) firing: (6) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:45:47] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission druid1004.eqiad.wmnet - https://phabricator.wikimedia.org/T354741 (10VRiley-WMF) [17:58:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P55831 and previous config saved to /var/cache/conftool/dbconfig/20240129-175833-marostegui.json [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240129T1800) [18:00:05] ryankemper: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240129T1800). [18:11:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:11:58] (03CR) 10Volans: "change LGTM but this will not remove the existing timers from cumin1001. Is there an easy way to absent the resources in the current puppe" [puppet] - 10https://gerrit.wikimedia.org/r/993710 (https://phabricator.wikimedia.org/T356054) (owner: 10Clément Goubert) [18:13:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P55832 and previous config saved to /var/cache/conftool/dbconfig/20240129-181340-marostegui.json [18:14:31] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Dzahn) > The only node left to relocation is gitlab2002. downtime of gitlab announced for tomorrow, Jan 30, 8:30 to 8:40 PST and banner added, for moving gitlab2002 [18:16:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:21:36] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/993729 (owner: 10Ilias Sarantopoulos) [18:23:07] (03Merged) 10jenkins-bot: ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/993729 (owner: 10Ilias Sarantopoulos) [18:23:50] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [18:24:50] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [18:28:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T355609)', diff saved to https://phabricator.wikimedia.org/P55833 and previous config saved to /var/cache/conftool/dbconfig/20240129-182846-marostegui.json [18:28:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [18:28:52] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:29:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [18:29:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T355609)', diff saved to https://phabricator.wikimedia.org/P55834 and previous config saved to /var/cache/conftool/dbconfig/20240129-182909-marostegui.json [18:32:39] (03PS1) 10Ebernhardson: cirrus updater: Remove consumer-devnull service [deployment-charts] - 10https://gerrit.wikimedia.org/r/993754 (https://phabricator.wikimedia.org/T352335) [18:32:41] (03PS1) 10Ebernhardson: cirrus: Expand production deployment wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/993755 (https://phabricator.wikimedia.org/T352335) [18:34:09] (03PS2) 10Dbrant: [WIP] Add labs config to test Contact page for account vanishing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993718 (https://phabricator.wikimedia.org/T343536) [18:37:14] (03PS10) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) [18:41:58] (03PS11) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) [18:45:46] (03PS3) 10Ebernhardson: cirrus: Disable cloudelastic writes to testwiki and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992974 (https://phabricator.wikimedia.org/T352335) [18:46:58] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Remove consumer-devnull service [deployment-charts] - 10https://gerrit.wikimedia.org/r/993754 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [18:47:07] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/993089 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [18:47:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T355609)', diff saved to https://phabricator.wikimedia.org/P55835 and previous config saved to /var/cache/conftool/dbconfig/20240129-184735-marostegui.json [18:47:41] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:47:41] (03CR) 10Ayounsi: [C: 03+2] Homer-public: add Ganeti BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/993090 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [18:47:45] (03Merged) 10jenkins-bot: cirrus updater: Remove consumer-devnull service [deployment-charts] - 10https://gerrit.wikimedia.org/r/993754 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [18:49:42] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:49:53] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:49:54] !log brouberol@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop test cluster: Restart of jvm daemons. [18:52:47] (03PS12) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) [18:54:20] (03CR) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [18:54:34] (03Merged) 10jenkins-bot: Homer-public: add Ganeti BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/993090 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [18:56:05] (03CR) 10Ayounsi: [C: 03+2] wmf-netbox: add Ganeti BGP group support [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/993089 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [18:56:58] (03PS2) 10Brouberol: Build hue for Debian Bullseye by default [debs/hue] - 10https://gerrit.wikimedia.org/r/993708 (https://phabricator.wikimedia.org/T349400) [18:57:08] (03CR) 10Ebernhardson: [C: 03+2] cirrus: Expand production deployment wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/993755 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [18:58:04] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1001-1002].eqiad.wmnet with reason: CR993089 - ayounsi@cumin1002 [18:58:13] (03Merged) 10jenkins-bot: cirrus: Expand production deployment wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/993755 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [18:59:51] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:59:55] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:00:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1001-1002].eqiad.wmnet with reason: CR993089 - ayounsi@cumin1002 [19:01:22] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:01:24] (03PS1) 10EoghanGaffney: [phabricator] Fix commenting on tasks by email [puppet] - 10https://gerrit.wikimedia.org/r/993759 (https://phabricator.wikimedia.org/T356077) [19:01:31] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:02:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P55836 and previous config saved to /var/cache/conftool/dbconfig/20240129-190241-marostegui.json [19:03:56] (03PS2) 10Jdlrobson: Use desktop history page HTML everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991424 (https://phabricator.wikimedia.org/T353388) [19:04:53] (03PS1) 10Ayounsi: vms_import policy: fix typo [homer/public] - 10https://gerrit.wikimedia.org/r/993760 (https://phabricator.wikimedia.org/T300152) [19:05:41] (03PS4) 10Paladox: Phabricator: switch python to python3 in phab_epipe [puppet] - 10https://gerrit.wikimedia.org/r/993766 (https://phabricator.wikimedia.org/T356077) [19:06:42] (03CR) 10Ayounsi: [C: 03+2] vms_import policy: fix typo [homer/public] - 10https://gerrit.wikimedia.org/r/993760 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [19:07:22] (03Merged) 10jenkins-bot: vms_import policy: fix typo [homer/public] - 10https://gerrit.wikimedia.org/r/993760 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [19:07:58] (03Abandoned) 10Paladox: Phabricator: switch python to python3 in phab_epipe [puppet] - 10https://gerrit.wikimedia.org/r/993766 (https://phabricator.wikimedia.org/T356077) (owner: 10Paladox) [19:11:17] (03CR) 10Dzahn: [C: 03+1] [phabricator] Fix commenting on tasks by email [puppet] - 10https://gerrit.wikimedia.org/r/993759 (https://phabricator.wikimedia.org/T356077) (owner: 10EoghanGaffney) [19:17:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P55837 and previous config saved to /var/cache/conftool/dbconfig/20240129-191748-marostegui.json [19:18:27] (03PS1) 10Bking: cloudelastic: Add migration canary to cloudelastic cluster [puppet] - 10https://gerrit.wikimedia.org/r/993764 (https://phabricator.wikimedia.org/T355617) [19:19:48] (03CR) 10Dzahn: admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [19:20:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993764 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:21:36] (03Abandoned) 10Bking: cloudelastic: apply cloudelastic role to canary [puppet] - 10https://gerrit.wikimedia.org/r/993148 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:21:51] (03Abandoned) 10Bking: cloudelastic: use CFSSL for TLS on canary [puppet] - 10https://gerrit.wikimedia.org/r/993103 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:22:06] (03CR) 10EoghanGaffney: [C: 03+2] [phabricator] Fix commenting on tasks by email [puppet] - 10https://gerrit.wikimedia.org/r/993759 (https://phabricator.wikimedia.org/T356077) (owner: 10EoghanGaffney) [19:24:13] (03PS1) 10Zabe: Start reading from af_actor/afh_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993765 (https://phabricator.wikimedia.org/T355616) [19:25:16] jouncebot: nowandnext [19:25:16] No deployments scheduled for the next 1 hour(s) and 34 minute(s) [19:25:16] In 1 hour(s) and 34 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240129T2100) [19:26:03] (03CR) 10Zabe: [C: 03+2] Start reading from af_actor/afh_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993765 (https://phabricator.wikimedia.org/T355616) (owner: 10Zabe) [19:26:50] (03Merged) 10jenkins-bot: Start reading from af_actor/afh_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993765 (https://phabricator.wikimedia.org/T355616) (owner: 10Zabe) [19:27:08] !log zabe@deploy2002 Started scap: Backport for [[gerrit:993765|Start reading from af_actor/afh_actor everywhere (T355616)]] [19:27:14] T355616: Start reading from af_actor/afh_actor - https://phabricator.wikimedia.org/T355616 [19:28:31] !log zabe@deploy2002 zabe: Backport for [[gerrit:993765|Start reading from af_actor/afh_actor everywhere (T355616)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:29:58] !log zabe@deploy2002 zabe: Continuing with sync [19:31:10] (03CR) 10Dzahn: [C: 03+2] contint: Remove obsolete firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/993072 (owner: 10Muehlenhoff) [19:32:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T355609)', diff saved to https://phabricator.wikimedia.org/P55838 and previous config saved to /var/cache/conftool/dbconfig/20240129-193254-marostegui.json [19:32:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance [19:33:00] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:33:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance [19:33:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T355609)', diff saved to https://phabricator.wikimedia.org/P55839 and previous config saved to /var/cache/conftool/dbconfig/20240129-193317-marostegui.json [19:36:18] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:993765|Start reading from af_actor/afh_actor everywhere (T355616)]] (duration: 09m 09s) [19:36:25] T355616: Start reading from af_actor/afh_actor - https://phabricator.wikimedia.org/T355616 [19:41:25] (03PS1) 10Ebernhardson: cirrus updater: Apply consumer throughput configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/993788 (https://phabricator.wikimedia.org/T352335) [19:41:45] (03CR) 10CI reject: [V: 04-1] cirrus updater: Apply consumer throughput configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/993788 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [19:42:08] (03CR) 10Ebernhardson: [C: 03+1] cloudelastic: Add migration canary to cloudelastic cluster [puppet] - 10https://gerrit.wikimedia.org/r/993764 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:42:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T355609)', diff saved to https://phabricator.wikimedia.org/P55840 and previous config saved to /var/cache/conftool/dbconfig/20240129-194218-marostegui.json [19:42:24] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:43:25] (03PS2) 10Ebernhardson: cirrus updater: Apply consumer throughput configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/993788 (https://phabricator.wikimedia.org/T352335) [19:53:45] (03PS3) 10Ebernhardson: cirrus updater: Apply consumer throughput configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/993788 (https://phabricator.wikimedia.org/T352335) [19:57:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P55841 and previous config saved to /var/cache/conftool/dbconfig/20240129-195725-marostegui.json [20:01:16] (03CR) 10Gehel: [C: 03+1] "LGTM, worst case is probably either the server does not join the cluster at all (our pybal check should remove that server from rotation i" [puppet] - 10https://gerrit.wikimedia.org/r/993764 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:12:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P55842 and previous config saved to /var/cache/conftool/dbconfig/20240129-201233-marostegui.json [20:27:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T355609)', diff saved to https://phabricator.wikimedia.org/P55843 and previous config saved to /var/cache/conftool/dbconfig/20240129-202740-marostegui.json [20:27:46] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [20:29:17] (03PS4) 10Ebernhardson: cirrus updater: Apply consumer throughput configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/993788 (https://phabricator.wikimedia.org/T352335) [20:31:19] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Apply consumer throughput configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/993788 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [20:32:17] (03Merged) 10jenkins-bot: cirrus updater: Apply consumer throughput configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/993788 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [20:33:41] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:33:49] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:37:11] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:37:19] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:45:06] (03PS2) 10Eevans: cassandra: create template for aqsloader role & grants [puppet] - 10https://gerrit.wikimedia.org/r/993102 (https://phabricator.wikimedia.org/T355917) [20:50:21] (03PS2) 10Jdlrobson: Begin capturing errors for Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992931 [20:50:26] (03PS3) 10Jdlrobson: Use desktop history page HTML everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991424 (https://phabricator.wikimedia.org/T353388) [20:50:35] (03PS3) 10Jdlrobson: Begin capturing errors for Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992931 [20:51:40] (03CR) 10Eevans: cassandra: create template for aqsloader role & grants (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993102 (https://phabricator.wikimedia.org/T355917) (owner: 10Eevans) [20:53:54] (03PS1) 10JHathaway: reposync: don't enforce ownership after init [puppet] - 10https://gerrit.wikimedia.org/r/993797 [20:56:25] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993797 (owner: 10JHathaway) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240129T2100) [21:00:04] ebernhardson and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:16] \o [21:00:17] o/ [21:00:44] I can deploy in a minute, just heating up lunch [21:01:28] present [21:01:39] Looks like we just have config deployments in this window anyway, so it seems we wouldn't be on a time crunch. [21:07:43] (03PS4) 10Catrope: cirrus: Disable cloudelastic writes to testwiki and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992974 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [21:08:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992974 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [21:08:53] (03Merged) 10jenkins-bot: cirrus: Disable cloudelastic writes to testwiki and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992974 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [21:09:05] !log catrope@deploy2002 Started scap: Backport for [[gerrit:992974|cirrus: Disable cloudelastic writes to testwiki and mw.org (T352335)]] [21:09:11] T352335: Deploy the new Cirrus Updater to update select wikis in cloudelastic - https://phabricator.wikimedia.org/T352335 [21:10:27] !log catrope@deploy2002 ebernhardson and catrope: Backport for [[gerrit:992974|cirrus: Disable cloudelastic writes to testwiki and mw.org (T352335)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:10:57] ebernhardson: Please test on the mwdebug servers (if possible/applicable) and let me know whether to proceed [21:11:11] RoanKattouw: it only changes job runner stuff, go ahead and proceed [21:11:21] !log catrope@deploy2002 ebernhardson and catrope: Continuing with sync [21:12:14] (03PS1) 10Brennen Bearnes: phabricator: tools: install python3-pymsql for public_task_dump.py [puppet] - 10https://gerrit.wikimedia.org/r/993799 (https://phabricator.wikimedia.org/T355574) [21:17:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:17:46] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:992974|cirrus: Disable cloudelastic writes to testwiki and mw.org (T352335)]] (duration: 08m 40s) [21:17:51] T352335: Deploy the new Cirrus Updater to update select wikis in cloudelastic - https://phabricator.wikimedia.org/T352335 [21:22:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:23:12] (03PS2) 10Catrope: DiscussionTools: Enable permalinks frontend everywhere except en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993709 (https://phabricator.wikimedia.org/T356063) (owner: 10Esanders) [21:23:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993709 (https://phabricator.wikimedia.org/T356063) (owner: 10Esanders) [21:24:03] (03CR) 10Catrope: [C: 03+1] foreachwikiindblist: Return early when no arg is passed [puppet] - 10https://gerrit.wikimedia.org/r/992263 (owner: 10Zabe) [21:24:16] (03Merged) 10jenkins-bot: DiscussionTools: Enable permalinks frontend everywhere except en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993709 (https://phabricator.wikimedia.org/T356063) (owner: 10Esanders) [21:24:28] !log catrope@deploy2002 Started scap: Backport for [[gerrit:993709|DiscussionTools: Enable permalinks frontend everywhere except en.wiki (T356063)]] [21:24:33] T356063: Deploy talk page permalinks to all wikis except en.wiki - https://phabricator.wikimedia.org/T356063 [21:25:48] !log catrope@deploy2002 catrope and esanders: Backport for [[gerrit:993709|DiscussionTools: Enable permalinks frontend everywhere except en.wiki (T356063)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:28:28] Kemayo: Sorry forgot to ping: your patch is now ready for testing [21:28:45] (the bot pinged Ed instead because he authored the patch) [21:28:49] RoanKattouw: I will check into it. [21:29:55] RoanKattouw: It's working fine, sync away [21:30:25] !log catrope@deploy2002 catrope and esanders: Continuing with sync [21:30:48] (03CR) 10Dzahn: [C: 03+2] phabricator: tools: install python3-pymsql for public_task_dump.py [puppet] - 10https://gerrit.wikimedia.org/r/993799 (https://phabricator.wikimedia.org/T355574) (owner: 10Brennen Bearnes) [21:32:00] RoanKattouw: have I got 7 minutes to make a coffee? [21:32:46] Yes go for it [21:33:30] (03CR) 10Dzahn: [C: 03+2] "there is a typo in the package name. following up to fix it. python3-pymysql" [puppet] - 10https://gerrit.wikimedia.org/r/993799 (https://phabricator.wikimedia.org/T355574) (owner: 10Brennen Bearnes) [21:34:37] (03CR) 10Brennen Bearnes: "Gah, sorry about that." [puppet] - 10https://gerrit.wikimedia.org/r/993799 (https://phabricator.wikimedia.org/T355574) (owner: 10Brennen Bearnes) [21:35:25] (03PS1) 10Dzahn: phabricator: fix typo in python3-pymysql package name [puppet] - 10https://gerrit.wikimedia.org/r/993801 (https://phabricator.wikimedia.org/T355574) [21:35:46] (03CR) 10Dzahn: [C: 03+2] phabricator: fix typo in python3-pymysql package name [puppet] - 10https://gerrit.wikimedia.org/r/993801 (https://phabricator.wikimedia.org/T355574) (owner: 10Dzahn) [21:35:58] (03CR) 10Dzahn: [V: 03+2 C: 03+2] phabricator: fix typo in python3-pymysql package name [puppet] - 10https://gerrit.wikimedia.org/r/993801 (https://phabricator.wikimedia.org/T355574) (owner: 10Dzahn) [21:36:47] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:993709|DiscussionTools: Enable permalinks frontend everywhere except en.wiki (T356063)]] (duration: 12m 19s) [21:36:52] T356063: Deploy talk page permalinks to all wikis except en.wiki - https://phabricator.wikimedia.org/T356063 [21:37:35] (03PS4) 10Catrope: Use desktop history page HTML everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991424 (https://phabricator.wikimedia.org/T353388) (owner: 10Jdlrobson) [21:38:05] Jdlrobson: Ready to start your patches whenever, ping me when you're back/ready [21:38:56] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [21:39:09] (03CR) 10Dzahn: [C: 03+2] "installed now on phab servers. feel free to test. ii python3-pymysql" [puppet] - 10https://gerrit.wikimedia.org/r/993799 (https://phabricator.wikimedia.org/T355574) (owner: 10Brennen Bearnes) [21:40:38] RoanKattouw: yep here [21:40:42] and you can push them out together [21:40:50] (03CR) 10Catrope: [C: 03+2] Use desktop history page HTML everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991424 (https://phabricator.wikimedia.org/T353388) (owner: 10Jdlrobson) [21:40:56] (03PS4) 10Catrope: Begin capturing errors for Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992931 (owner: 10Jdlrobson) [21:41:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991424 (https://phabricator.wikimedia.org/T353388) (owner: 10Jdlrobson) [21:41:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992931 (owner: 10Jdlrobson) [21:41:39] (03Merged) 10jenkins-bot: Use desktop history page HTML everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991424 (https://phabricator.wikimedia.org/T353388) (owner: 10Jdlrobson) [21:41:58] (03Merged) 10jenkins-bot: Begin capturing errors for Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992931 (owner: 10Jdlrobson) [21:42:14] !log catrope@deploy2002 Started scap: Backport for [[gerrit:991424|Use desktop history page HTML everywhere (T353388)]], [[gerrit:992931|Begin capturing errors for Wikivoyage]] [21:42:19] T353388: Enable desktop history HTML on mobile - https://phabricator.wikimedia.org/T353388 [21:43:34] !log catrope@deploy2002 catrope and jdlrobson: Backport for [[gerrit:991424|Use desktop history page HTML everywhere (T353388)]], [[gerrit:992931|Begin capturing errors for Wikivoyage]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:43:44] Jdlrobson: Please test on the mwdebug servers [21:45:39] (ProbeDown) firing: (6) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:45:52] RoanKattouw: looking now [21:47:59] RoanKattouw: yep you can merge that. [21:48:04] !log catrope@deploy2002 catrope and jdlrobson: Continuing with sync [21:52:37] (03PS1) 10BCornwall: ncredir: Set fifo_log_demux/nginx as wanted_by [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) [21:54:20] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:991424|Use desktop history page HTML everywhere (T353388)]], [[gerrit:992931|Begin capturing errors for Wikivoyage]] (duration: 12m 05s) [21:54:25] T353388: Enable desktop history HTML on mobile - https://phabricator.wikimedia.org/T353388 [21:54:32] Alright that's it, all done [21:56:24] Thanks RoanKattouw [21:58:26] RoanKattouw: ah one follow up if you have the time? [21:58:27] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10amastilovic) Hi @MoritzMuehlenhoff , I need access to the following (from the wiki page you provided): LDAP membership in the wmf or nda LDAP group.... [21:58:29] Otherwise I can do it tomorrow [21:58:36] I forgot to remove the enwiki config :) [21:58:51] Sure, no problem [21:59:20] (03PS1) 10Jdlrobson: Drop English Wikipedia configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993805 (https://phabricator.wikimedia.org/T353388) [21:59:32] ^ RoanKattouw I can put it on the calendar nopw [22:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240129T2200). [22:00:42] Please hold off on the security deployment window for 5ish more minutes, I have one last patch from the backport window to do [22:00:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993805 (https://phabricator.wikimedia.org/T353388) (owner: 10Jdlrobson) [22:01:21] (03PS2) 10Catrope: Drop English Wikipedia configuration for wgMFUseDesktopSpecialHistoryPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993805 (https://phabricator.wikimedia.org/T353388) (owner: 10Jdlrobson) [22:01:32] (03CR) 10Catrope: [C: 03+2] Drop English Wikipedia configuration for wgMFUseDesktopSpecialHistoryPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993805 (https://phabricator.wikimedia.org/T353388) (owner: 10Jdlrobson) [22:01:41] (03CR) 10TrainBranchBot: "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993805 (https://phabricator.wikimedia.org/T353388) (owner: 10Jdlrobson) [22:02:17] (03Merged) 10jenkins-bot: Drop English Wikipedia configuration for wgMFUseDesktopSpecialHistoryPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993805 (https://phabricator.wikimedia.org/T353388) (owner: 10Jdlrobson) [22:02:29] !log catrope@deploy2002 Started scap: Backport for [[gerrit:993805|Drop English Wikipedia configuration for wgMFUseDesktopSpecialHistoryPage (T353388)]] [22:02:34] T353388: Enable desktop history HTML on mobile - https://phabricator.wikimedia.org/T353388 [22:02:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10MoritzMuehlenhoff) a:05Arnoldokoth→03Eevans @amastilovic Thanks. Reassigning to @Eevans as the current SRE on our weekly clinic duty. [22:03:48] !log catrope@deploy2002 catrope and jdlrobson: Backport for [[gerrit:993805|Drop English Wikipedia configuration for wgMFUseDesktopSpecialHistoryPage (T353388)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:24:25] Thanks RoanKattouw ! [22:24:37] Oh whoops I never finished it [22:24:39] !log catrope@deploy2002 catrope and jdlrobson: Continuing with sync [22:24:48] It was still stuck on the test server stage [22:25:20] (debug server is working fine!) [22:29:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10Eevans) a:05Eevans→03ABran-WMF [22:29:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10Eevans) >>! In T355606#9496444, @MoritzMuehlenhoff wrote: > @amastilovic Thanks. > > Reassigning to @Eevans as the current SRE on our weekly clinic d... [22:31:02] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:993805|Drop English Wikipedia configuration for wgMFUseDesktopSpecialHistoryPage (T353388)]] (duration: 28m 33s) [22:31:07] T353388: Enable desktop history HTML on mobile - https://phabricator.wikimedia.org/T353388 [22:31:26] OK, all done for real this time [22:32:22] yay [23:00:48] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1234/console" [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [23:02:32] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1235/co" [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [23:19:22] (03PS2) 10BCornwall: ncredir: Set fifo_log_demux/nginx as wanted_by [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) [23:20:38] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1236/co" [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall)