[00:18:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:23:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:39:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003039 [00:39:09] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003039 (owner: 10TrainBranchBot) [00:45:22] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "d"} and A:restbase and A:codfw: Restart to pickup logging jars — T353550 - eevans@cumin1002 [00:45:27] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [00:53:36] (03PS6) 10RLazarus: Helm chart for k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) [00:53:38] (03PS6) 10RLazarus: admin_ng: Install k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) [01:00:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003039 (owner: 10TrainBranchBot) [01:00:52] (03CR) 10RLazarus: Helm chart for k8s-controller-sidecars (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [01:01:41] (03CR) 10RLazarus: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [01:06:02] (03CR) 10Cwhite: "Overall LGTM. Nonblocking questions inline." [puppet] - 10https://gerrit.wikimedia.org/r/1003439 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [01:06:49] (03CR) 10Cwhite: [C: 03+1] alert: Failover Icinga and Alertmanager to alert2001 [puppet] - 10https://gerrit.wikimedia.org/r/1003513 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [01:07:02] (03CR) 10Cwhite: [C: 03+1] alert: Resolve alerts DNS queries to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/1003516 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [01:07:54] (03CR) 10Cwhite: [C: 03+1] alert: Ensure the alert2001 host is reimaged with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [01:14:48] (KubernetesCalicoDown) firing: (3) mw2379.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:26:23] (03PS1) 10Eevans: Updated deployement targets [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1003576 (https://phabricator.wikimedia.org/T353550) [01:31:19] (03PS2) 10Eevans: Updated deployement targets [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1003576 (https://phabricator.wikimedia.org/T353550) [01:32:22] (03CR) 10Eevans: [V: 03+2 C: 03+2] Updated deployement targets [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1003576 (https://phabricator.wikimedia.org/T353550) (owner: 10Eevans) [01:37:24] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching P{P:cassandra%rack = "d"} and A:restbase and A:codfw: Restart to pickup logging jars — T353550 - eevans@cumin1002 [01:37:31] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [01:40:54] (03PS5) 10RLazarus: Add helmfile for running MediaWiki one-off jobs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) [01:41:12] (03CR) 10CI reject: [V: 04-1] Add helmfile for running MediaWiki one-off jobs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [01:43:49] (03PS3) 10RLazarus: mediawiki: Support one-off jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/988849 (https://phabricator.wikimedia.org/T341553) [01:43:51] (03PS6) 10RLazarus: Add helmfile for running MediaWiki one-off jobs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) [01:46:04] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 5 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914 (10Jdlrobson) [01:46:10] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM vrts1002.eqiad.wmnet [01:55:53] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1036.eqiad.wmnet with OS bullseye [01:56:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host restbase1036.eqiad.wmnet with OS bullseye [02:00:02] (03CR) 10RLazarus: Add helmfile for running MediaWiki one-off jobs. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [02:08:33] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:11:13] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1036.eqiad.wmnet with reason: host reimage [02:14:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1036.eqiad.wmnet with reason: host reimage [02:18:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:23:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:27:07] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 5 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914 (10Superpes15) a:05Superpes15→03None [02:29:40] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:31:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:31:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1036.eqiad.wmnet with OS bullseye [02:31:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host restbase1036.eqiad.wmnet with OS bullseye completed: - restbase1036 (**PASS**) - D... [02:31:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10Jclark-ctr) 05Open→03Resolved [02:38:34] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:13:35] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:36:28] (SystemdUnitFailed) firing: ferm.service on mw2282:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:18:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:23:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:29:08] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (puppetserver2003, ...), Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:29:58] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [04:30:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [04:30:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T352010)', diff saved to https://phabricator.wikimedia.org/P56822 and previous config saved to /var/cache/conftool/dbconfig/20240215-043047-ladsgroup.json [04:30:55] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [04:45:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P56823 and previous config saved to /var/cache/conftool/dbconfig/20240215-044554-ladsgroup.json [04:57:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:58:24] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.283 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:01:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P56824 and previous config saved to /var/cache/conftool/dbconfig/20240215-050101-ladsgroup.json [05:03:15] * kart_ updating cxserver.. [05:03:21] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-12-04-083437-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [05:04:21] (03Merged) 10jenkins-bot: Update cxserver to 2023-12-04-083437-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [05:09:01] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:09:26] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:14:48] (KubernetesCalicoDown) firing: (3) mw2379.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:16:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T352010)', diff saved to https://phabricator.wikimedia.org/P56825 and previous config saved to /var/cache/conftool/dbconfig/20240215-051607-ladsgroup.json [05:16:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Maintenance [05:16:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Maintenance [05:16:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1244:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P56826 and previous config saved to /var/cache/conftool/dbconfig/20240215-051629-ladsgroup.json [05:38:43] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:39:12] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:39:52] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:40:25] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:43:22] !log Update cxserver to 2023-12-04-083437-production (T344982, T338432, T351138) [05:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:30] T344982: Make cxserver call parsoid endpoints on MediaWiki, instead of going through RESTbase - https://phabricator.wikimedia.org/T344982 [05:43:30] T338432: Prepare the cxserver for usage without RESTbase - https://phabricator.wikimedia.org/T338432 [05:43:30] T351138: Some articles with gallery fail to start for translation - https://phabricator.wikimedia.org/T351138 [05:43:32] Finally! [05:51:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:52:00] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:52:02] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:08:34] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:18:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:21:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:23:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:29:16] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240215T0700) [07:00:05] kormat, marostegui, Amir1, and arnaudb: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240215T0700). [07:31:35] (03PS2) 10KartikMistry: cxserver: Remove all kademlia support from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/992744 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris) [07:32:32] (03CR) 10KartikMistry: "Do we need to bump chart version here?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992744 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris) [07:35:12] (SystemdUnitFailed) firing: ferm.service on mw2282:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:40:31] (03CR) 10Slyngshede: [C: 03+2] Add logging to default uwsgi [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1002989 (owner: 10Slyngshede) [07:40:59] (03CR) 10Slyngshede: [C: 03+2] Bump Python version in CI from 3.7 to 3.9 minimum. [software/bitu] - 10https://gerrit.wikimedia.org/r/998784 (owner: 10Slyngshede) [07:43:18] (03Merged) 10jenkins-bot: Add logging to default uwsgi [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1002989 (owner: 10Slyngshede) [07:43:20] (03Merged) 10jenkins-bot: Bump Python version in CI from 3.7 to 3.9 minimum. [software/bitu] - 10https://gerrit.wikimedia.org/r/998784 (owner: 10Slyngshede) [08:00:05] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240215T0800). [08:00:06] No Gerrit patches in the queue for this window AFAICS. [08:03:32] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:03:48] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:18:13] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: apifeatureusage::logstash [08:19:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:21:03] (03PS1) 10Muehlenhoff: Switch apufeatureusage to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003590 (https://phabricator.wikimedia.org/T349619) [08:22:22] PROBLEM - Disk space on titan2001 is CRITICAL: DISK CRITICAL - free space: /srv 24792MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [08:24:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:24:49] godog: see now a disk space alert ^ [08:24:58] (03PS2) 10Muehlenhoff: Switch apufeatureusage to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003590 (https://phabricator.wikimedia.org/T349619) [08:26:50] jynus: thank you, I"ll take a look [08:27:58] (03CR) 10Muehlenhoff: [C: 03+2] Switch apufeatureusage to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003590 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:28:57] should be recovering by itself FWIW [08:32:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: apifeatureusage::logstash [08:32:25] (03PS2) 10Muehlenhoff: Switch restbase1036 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003429 (https://phabricator.wikimedia.org/T349619) [08:35:27] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1036.eqiad.wmnet [08:35:57] (03CR) 10Muehlenhoff: [C: 03+2] Switch restbase1036 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003429 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:38:04] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:38:14] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:39:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1036.eqiad.wmnet [08:42:22] RECOVERY - Disk space on titan2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [08:46:47] (03PS1) 10Majavah: P:toolforge::prometheus: prevent toolsbeta from paging [puppet] - 10https://gerrit.wikimedia.org/r/1003592 [08:48:35] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1371/console" [puppet] - 10https://gerrit.wikimedia.org/r/1003592 (owner: 10Majavah) [08:50:31] !log rebalance Ganeti codfw/A now that the switch maintenance for A5 and A6 are completed T355864 T355863 [08:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:37] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [08:50:38] T355863: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 [08:51:29] (03PS1) 10Majavah: hieradata: pcc: update toolsbeta public key [puppet] - 10https://gerrit.wikimedia.org/r/1003593 [08:52:50] (03CR) 10Majavah: [C: 03+2] hieradata: pcc: update toolsbeta public key [puppet] - 10https://gerrit.wikimedia.org/r/1003593 (owner: 10Majavah) [08:55:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1003592 (owner: 10Majavah) [09:02:00] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:03:20] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:04:07] (03CR) 10Majavah: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1372/console" [puppet] - 10https://gerrit.wikimedia.org/r/1003593 (owner: 10Majavah) [09:05:31] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1003592 (owner: 10Majavah) [09:08:34] (SystemdUnitFailed) firing: user@11984.service on bast1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:48] (KubernetesCalicoDown) firing: (3) mw2379.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:29:09] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: eventlogging::analytics [09:33:34] (SystemdUnitFailed) resolved: user@11984.service on bast1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P56827 and previous config saved to /var/cache/conftool/dbconfig/20240215-093850-ladsgroup.json [09:38:55] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:38:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: eventlogging::analytics [09:40:39] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [09:41:08] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [09:41:09] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [09:42:17] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [09:42:18] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [09:43:15] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [09:49:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host eventlog1003.eqiad.wmnet [09:51:00] RECOVERY - BFD status on lsw1-a4-codfw.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:51:10] RECOVERY - BFD status on lsw1-b7-codfw.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:53:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host eventlog1003.eqiad.wmnet [09:53:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244:3314', diff saved to https://phabricator.wikimedia.org/P56829 and previous config saved to /var/cache/conftool/dbconfig/20240215-095356-ladsgroup.json [10:04:55] (SystemdUnitFailed) resolved: ferm.service on mw2282:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:34] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:09:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244:3314', diff saved to https://phabricator.wikimedia.org/P56830 and previous config saved to /var/cache/conftool/dbconfig/20240215-100903-ladsgroup.json [10:18:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:23:04] RECOVERY - Check whether ferm is active by checking the default input chain on mw2282 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:23:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:24:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P56831 and previous config saved to /var/cache/conftool/dbconfig/20240215-102409-ladsgroup.json [10:24:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [10:24:15] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:24:26] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [10:33:34] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 162 probes of 739 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:37:32] !log running `homer 'cr*codfw*' commit 'T351074'` for new k8s nodes [10:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:38] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [10:38:34] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 44 probes of 739 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:48:44] jouncebot: nowandnext [10:48:44] No deployments scheduled for the next 0 hour(s) and 11 minute(s) [10:48:44] In 0 hour(s) and 11 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240215T1100) [10:48:45] In 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240215T1100) [10:52:52] * Lucas_WMDE deploying something [10:53:03] !log zabe@mwmaint2002:/tmp/uploads$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user="OGPawlis" . # T357605 [10:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:14] T357605: Server side upload for OGPawlis - https://phabricator.wikimedia.org/T357605 [10:55:50] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:56:26] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:1003475|Revert "Include article name in Ploticus error messages" (T357268)]], [[gerrit:1003476|Revert "Include article name in Ploticus error messages" (T357268)]] [10:56:31] T357268: Image of timeline not found on upload.wikimedia.org - https://phabricator.wikimedia.org/T357268 [10:58:06] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and matmarex: Backport for [[gerrit:1003475|Revert "Include article name in Ploticus error messages" (T357268)]], [[gerrit:1003476|Revert "Include article name in Ploticus error messages" (T357268)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:00:05] mvolz: Your horoscope predicts another Services – Citoid / Zotero deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240215T1100). [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240215T1100) [11:00:10] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and matmarex: Continuing with sync [11:00:23] * Lucas_WMDE still deploying for a few more minutes [11:07:09] !log zabe@mwmaint2002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=mkwiki --logwiki=metawiki 'CatCat' 'MonkeyPython' # T357602 [11:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:14] T357602: Unblock stuck global rename of MonkeyPython - https://phabricator.wikimedia.org/T357602 [11:07:26] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:1003475|Revert "Include article name in Ploticus error messages" (T357268)]], [[gerrit:1003476|Revert "Include article name in Ploticus error messages" (T357268)]] (duration: 10m 59s) [11:07:31] T357268: Image of timeline not found on upload.wikimedia.org - https://phabricator.wikimedia.org/T357268 [11:07:33] * Lucas_WMDE done [11:10:53] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reboot-single for host mw2379.codfw.wmnet [11:14:46] Lucas_WMDE: thanks for deploying :) [11:18:31] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw2379.codfw.wmnet [11:20:25] (SystemdUnitFailed) firing: ferm.service on mw2379:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:22:14] PROBLEM - Check whether ferm is active by checking the default input chain on mw2379 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:25:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [11:25:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [11:25:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2105 (T352010)', diff saved to https://phabricator.wikimedia.org/P56834 and previous config saved to /var/cache/conftool/dbconfig/20240215-112535-ladsgroup.json [11:25:40] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:30:13] (03CR) 10Fabfur: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [11:30:18] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddumps1002.wikimedia.org [11:31:38] (03CR) 10Fabfur: [C: 03+1] "seems ok to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1003384 (https://phabricator.wikimedia.org/T357483) (owner: 10Hnowlan) [11:36:07] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_test_cluster::hadoop::ui [11:37:49] (03PS1) 10Muehlenhoff: Switch hadoop_ui/test to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003611 (https://phabricator.wikimedia.org/T349619) [11:38:39] (03CR) 10Hnowlan: [C: 03+2] admin: update ssh key for rkhan [puppet] - 10https://gerrit.wikimedia.org/r/1003384 (https://phabricator.wikimedia.org/T357483) (owner: 10Hnowlan) [11:39:29] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddumps1002.wikimedia.org [11:41:02] (03PS1) 10KartikMistry: Update cxserver to 2024-02-15-085232-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003612 (https://phabricator.wikimedia.org/T333969) [11:42:10] (03PS1) 10Majavah: Reapply "Failover dumps to clouddumps1002" [dns] - 10https://gerrit.wikimedia.org/r/1003613 (https://phabricator.wikimedia.org/T321313) [11:42:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Updating access key - rkhan - https://phabricator.wikimedia.org/T357483 (10hnowlan) 05Open→03Resolved a:03hnowlan Done [11:43:45] (03PS1) 10Majavah: hieradata: failover dumps web to clouddumps1002 [puppet] - 10https://gerrit.wikimedia.org/r/1003614 (https://phabricator.wikimedia.org/T321313) [11:44:05] (03CR) 10Majavah: [C: 03+2] Reapply "Failover dumps to clouddumps1002" [dns] - 10https://gerrit.wikimedia.org/r/1003613 (https://phabricator.wikimedia.org/T321313) (owner: 10Majavah) [11:44:35] (03PS2) 10KartikMistry: Update MinT to 2024-02-15-085232-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/995170 (https://phabricator.wikimedia.org/T354666) [11:45:02] (03CR) 10Majavah: [C: 03+2] hieradata: failover dumps web to clouddumps1002 [puppet] - 10https://gerrit.wikimedia.org/r/1003614 (https://phabricator.wikimedia.org/T321313) (owner: 10Majavah) [11:45:08] (03CR) 10KartikMistry: [C: 03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992744 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris) [11:45:25] (SystemdUnitFailed) resolved: ferm.service on mw2379:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:11] (03CR) 10KartikMistry: [C: 03+1] "+1. This PS needs manual rebase on the top of latest cxserver deployment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003369 (https://phabricator.wikimedia.org/T355686) (owner: 10Alexandros Kosiaris) [11:47:41] (03CR) 10Clément Goubert: [C: 03+1] mw-jobrunner: bump replicas for cirrusSearchLinksUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003499 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:48:17] (03CR) 10Hnowlan: [C: 03+1] mw-web, mw-api-ext: Raise replicas for 45% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003393 (https://phabricator.wikimedia.org/T357507) (owner: 10Clément Goubert) [11:48:56] 10SRE, 10Infrastructure-Foundations, 10netops: BGP peering from LSW to K8s hosts using loopback IP not IRB - https://phabricator.wikimedia.org/T357619 (10cmooney) Just a bit more background, I discovered this looking at a tcpdump, this is //lsw1-a4-codfw// trying to establish BGP to //mw2383//: ` 11:10:04.59... [11:49:04] (03CR) 10Hnowlan: [C: 03+1] trafficserver: move 45% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1003394 (https://phabricator.wikimedia.org/T357507) (owner: 10Clément Goubert) [11:50:08] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1031: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1003616 (https://phabricator.wikimedia.org/T319184) [11:50:15] (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: Raise replicas for 45% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003393 (https://phabricator.wikimedia.org/T357507) (owner: 10Clément Goubert) [11:50:21] (03CR) 10Muehlenhoff: [C: 03+2] Switch hadoop_ui/test to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003611 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:50:26] (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: bump replicas for cirrusSearchLinksUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003499 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:50:39] (03CR) 10Arturo Borrero Gonzalez: "check-experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003616 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:51:20] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 45% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003393 (https://phabricator.wikimedia.org/T357507) (owner: 10Clément Goubert) [11:51:25] (03Merged) 10jenkins-bot: mw-jobrunner: bump replicas for cirrusSearchLinksUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003499 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:52:14] RECOVERY - Check whether ferm is active by checking the default input chain on mw2379 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:55:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_test_cluster::hadoop::ui [11:56:22] !log cgoubert@deploy2002 Started scap: Deploying mw-on-k8s 1003499 1003393 - T349796 T357507 [11:56:28] T349796: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 [11:56:28] T357507: Move 50% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357507 [11:57:12] !log cgoubert@deploy2002 Finished scap: Deploying mw-on-k8s 1003499 1003393 - T349796 T357507 (duration: 00m 50s) [11:59:00] (03CR) 10Clément Goubert: [C: 03+2] trafficserver: move 45% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1003394 (https://phabricator.wikimedia.org/T357507) (owner: 10Clément Goubert) [11:59:42] !log Bumping external traffic to mw-on-k8s to 45% - T357507 [11:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:36] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003616 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:01:44] (03PS1) 10Hnowlan: jobqueue: migrate cirrusSearchLinksUpdate to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003617 (https://phabricator.wikimedia.org/T349796) [12:04:23] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected." [puppet] - 10https://gerrit.wikimedia.org/r/1003616 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:05:10] (03PS1) 10Cathal Mooney: Modify K8s BGP groups to only enable multihop on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1003619 (https://phabricator.wikimedia.org/T357619) [12:05:42] (03CR) 10Majavah: [C: 03+1] cloudvirt1031: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1003616 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:06:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-test-ui1001.eqiad.wmnet [12:07:30] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:09:05] (03CR) 10Clément Goubert: [C: 03+1] jobqueue: migrate cirrusSearchLinksUpdate to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003617 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:10:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-ui1001.eqiad.wmnet [12:10:10] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudvirt1031: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1003616 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:11:32] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1031.eqiad.wmnet with OS bookworm [12:11:44] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1031.eqiad.wmnet with OS bookworm [12:12:18] (03PS3) 10Muehlenhoff: Advertise puppetserver2003 as active Puppet 7 server [dns] - 10https://gerrit.wikimedia.org/r/1003403 (https://phabricator.wikimedia.org/T356991) [12:17:59] !log installing Linux 5.10.209 on Bullseye hosts [12:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:19:16] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1003491 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [12:19:33] (03CR) 10Ayounsi: [C: 03+1] "Approach and standardization makes sens to me, let's be careful in the deployment as there are many changes at once." [homer/public] - 10https://gerrit.wikimedia.org/r/1003619 (https://phabricator.wikimedia.org/T357619) (owner: 10Cathal Mooney) [12:21:08] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reboot-single for host mw2379.codfw.wmnet [12:23:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:30:00] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1031.eqiad.wmnet with reason: host reimage [12:32:41] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1031.eqiad.wmnet with reason: host reimage [12:34:03] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host mw2379.codfw.wmnet [12:35:06] (03CR) 10Jelto: [C: 03+2] etherpad: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/1003492 (owner: 10Dzahn) [12:35:11] (03CR) 10Jelto: [V: 03+1 C: 03+2] etherpad: add $service_ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/1003493 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn) [12:35:25] (SystemdUnitFailed) firing: ferm.service on mw2379:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:28] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1003439 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [12:36:46] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: run Thanos components in a systemd slice (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003439 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [12:46:19] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "grafana: Ensure the grafana2001 hosts uses Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/1003469 (owner: 10Andrea Denisse) [12:46:25] (03CR) 10Filippo Giunchedi: [C: 03+1] alert: Resolve alerts DNS queries to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/1003516 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [12:47:21] (03CR) 10Filippo Giunchedi: [C: 03+1] alert: Failover Icinga and Alertmanager to alert2001 [puppet] - 10https://gerrit.wikimedia.org/r/1003513 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [12:48:34] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:49:01] (03CR) 10Filippo Giunchedi: "I think for the reimage we should stay on puppet 5, after the reimage is done we run the puppet 7 migration cookbook. For grafana we ran t" [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [12:49:03] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) thanks @Jclark-ctr, unfortunately, that does not really help a lot, and does not answer any of the questio... [12:50:48] (03CR) 10Muehlenhoff: "Yeah, let's just untangle them, there are enough moving parts already. When the alert hosts are fully on Bookworm we can start by migratio" [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [12:53:44] (03PS4) 10Ayounsi: Ganeti: pass the v4 and v6 IPs to the VM as fw_cfg [software/spicerack] - 10https://gerrit.wikimedia.org/r/1003491 (https://phabricator.wikimedia.org/T300152) [12:55:00] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144 (10MoritzMuehlenhoff) [12:59:01] !incidents [12:59:01] 4442 (RESOLVED) [2x] NELHigh sre (tcp.timed_out) [12:59:01] 4441 (RESOLVED) [3x] ProbeDown sre (text-https:443 probes/service eqsin) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240215T1300) [13:00:11] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1375/co" [puppet] - 10https://gerrit.wikimedia.org/r/1003420 (https://phabricator.wikimedia.org/T343925) (owner: 10Ladsgroup) [13:00:25] (SystemdUnitFailed) resolved: ferm.service on mw2379:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:00:59] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1031.eqiad.wmnet with OS bookworm [13:01:11] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1031.eqiad.wmnet with OS bookworm com... [13:03:47] (03CR) 10Andrea Denisse: "Good idea, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [13:03:52] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:03:53] (03CR) 10Muehlenhoff: [C: 03+2] Advertise puppetserver2003 as active Puppet 7 server [dns] - 10https://gerrit.wikimedia.org/r/1003403 (https://phabricator.wikimedia.org/T356991) (owner: 10Muehlenhoff) [13:04:13] (03Abandoned) 10Andrea Denisse: alert: Ensure the alert2001 host is reimaged with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [13:04:30] (03CR) 10Andrea Denisse: [C: 03+2] Revert "grafana: Ensure the grafana2001 hosts uses Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/1003469 (owner: 10Andrea Denisse) [13:04:41] (03PS11) 10Slyngshede: puppet-merge: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [13:07:35] (03PS2) 10Superpes15: [rowiki] Change autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002171 (https://phabricator.wikimedia.org/T355990) [13:07:41] 10SRE, 10observability, 10Patch-For-Review, 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 (10fgiunchedi) Now Thanos services run in their own slice, which should help with enforcing resource limits. WRT smo... [13:08:42] (03PS1) 10Brouberol: superset: enable OIDC login [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003747 (https://phabricator.wikimedia.org/T353794) [13:08:58] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1376/co" [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [13:09:15] (03Abandoned) 10Andrea Denisse: alert: Ensure the alert1001 host is reimaged with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003531 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [13:11:21] (03CR) 10Ayounsi: Add routed ganeti support to late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003464 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:12:04] (03PS1) 10Brouberol: idp: Register superset and superset-next IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1003749 (https://phabricator.wikimedia.org/T353794) [13:12:35] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1377/co" [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [13:13:16] (03PS2) 10Brouberol: idp: Register superset and superset-next IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1003749 (https://phabricator.wikimedia.org/T353794) [13:14:48] (KubernetesCalicoDown) firing: (3) mw2379.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:15:08] (03PS1) 10Muehlenhoff: Revert "Advertise puppetserver2003 as active Puppet 7 server" [dns] - 10https://gerrit.wikimedia.org/r/1003750 [13:16:24] (03CR) 10Slyngshede: [V: 03+1] "My suggestion is that we abandon https://gerrit.wikimedia.org/r/c/operations/puppet/+/978049/3, and the just use this CR, which I've updat" [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [13:16:37] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Advertise puppetserver2003 as active Puppet 7 server" [dns] - 10https://gerrit.wikimedia.org/r/1003750 (owner: 10Muehlenhoff) [13:17:35] (03PS2) 10Brouberol: superset: enable OIDC login [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003747 (https://phabricator.wikimedia.org/T353794) [13:20:10] (03CR) 10Ayounsi: Add support for routed Ganeti in D-I early_command.sh (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1003416 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:22:44] (03PS1) 10Ladsgroup: cloud: Remove mediawiki_smarthosts from all of WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003753 [13:23:44] (03PS4) 10Ayounsi: Add support for routed Ganeti in D-I early_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1003416 (https://phabricator.wikimedia.org/T300152) [13:26:44] (03PS2) 10Ladsgroup: cloud: Remove mediawiki_smarthosts from all of WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003753 [13:26:46] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1003491 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:26:54] (03CR) 10Muehlenhoff: "That's not enough: In Bullseye Python 2 isn't covered by security support (it was only included since the Chromium build systems needed it" [puppet] - 10https://gerrit.wikimedia.org/r/1003526 (https://phabricator.wikimedia.org/T353550) (owner: 10Eevans) [13:28:00] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1379/co" [puppet] - 10https://gerrit.wikimedia.org/r/1003753 (owner: 10Ladsgroup) [13:36:00] (03PS3) 10Brouberol: idp: Register superset and superset-next IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1003749 (https://phabricator.wikimedia.org/T353794) [13:38:02] (03PS2) 10Eevans: cassandra: install git-fat to satisfy scap requirement [puppet] - 10https://gerrit.wikimedia.org/r/1003526 (https://phabricator.wikimedia.org/T353550) [13:38:31] (03CR) 10Stevemunene: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1003749 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [13:38:34] (JobUnavailable) resolved: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:38:48] (03CR) 10Brouberol: [C: 03+2] idp: Register superset and superset-next IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1003749 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [13:39:53] (03PS2) 10Ayounsi: Add routed ganeti support to late_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1003464 (https://phabricator.wikimedia.org/T300152) [13:40:00] (03PS3) 10Ladsgroup: cloud: Remove mediawiki_smarthosts from all of WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003753 [13:41:37] (03CR) 10CI reject: [V: 04-1] cloud: Remove mediawiki_smarthosts from all of WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003753 (owner: 10Ladsgroup) [13:41:46] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:42:04] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:43:02] (03CR) 10Ayounsi: [C: 03+2] Add routed ganeti support to late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003464 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:43:28] (03PS4) 10Ladsgroup: cloud: Remove mediawiki_smarthosts from all of WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003753 [13:46:04] (03PS5) 10Ladsgroup: cloud: Remove mediawiki_smarthosts from all of WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003753 [13:46:28] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, 10User-aborrero: openstack: nova refuses to admit a compute node after a reimage - https://phabricator.wikimedia.org/T357631 (10aborrero) [13:48:18] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2005.codfw.wmnet with OS bookworm [13:50:22] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1003753 (owner: 10Ladsgroup) [13:50:56] (03CR) 10Ayounsi: Routed Ganeti: Add v6 static route to VM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/995032 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:51:54] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:52:18] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:56:05] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10MoritzMuehlenhoff) I had enabled puppetserver2003 but there is still some missing bit in the setup: When adding puppetserver200... [13:56:30] (03CR) 10Majavah: [V: 03+1 C: 03+1] cloud: Remove mediawiki_smarthosts from all of WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003753 (owner: 10Ladsgroup) [13:59:42] (03PS1) 10Superpes15: [commonswiki] Add an editautopatrolprotected level protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003760 (https://phabricator.wikimedia.org/T357298) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240215T1400). [14:00:05] Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] Hi :) [14:00:22] might be 20 minutes or so before I can deploy [14:01:44] No rush! In the meantime I can lunch (I'm lunching so late lol) and create a new patch :) [14:02:24] (03PS3) 10Ladsgroup: mail: Add wikimail routing to wmcs as well [puppet] - 10https://gerrit.wikimedia.org/r/1003420 (https://phabricator.wikimedia.org/T343925) [14:02:28] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mail: Add wikimail routing to wmcs as well [puppet] - 10https://gerrit.wikimedia.org/r/1003420 (https://phabricator.wikimedia.org/T343925) (owner: 10Ladsgroup) [14:02:37] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2005.codfw.wmnet with reason: host reimage [14:02:39] (03PS2) 10Superpes15: [commonswiki] Add an editautopatrolprotected level protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003760 (https://phabricator.wikimedia.org/T357298) [14:02:46] (03PS3) 10Ladsgroup: exim: Avoid considering wikimedia domains as local in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003498 (https://phabricator.wikimedia.org/T343925) [14:02:50] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] exim: Avoid considering wikimedia domains as local in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003498 (https://phabricator.wikimedia.org/T343925) (owner: 10Ladsgroup) [14:03:04] (03PS6) 10Ladsgroup: cloud: Remove mediawiki_smarthosts from all of WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003753 [14:03:11] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] cloud: Remove mediawiki_smarthosts from all of WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003753 (owner: 10Ladsgroup) [14:04:19] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, 10User-aborrero: openstack: nova refuses to admit a compute node after a reimage - https://phabricator.wikimedia.org/T357631 (10aborrero) https://docs.openstack.org/nova/latest/admin/troubleshooting/orphaned-allocations.html [14:05:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2005.codfw.wmnet with reason: host reimage [14:05:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance [14:06:03] alright, I can deploy now [14:06:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance [14:06:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1246:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P56836 and previous config saved to /var/cache/conftool/dbconfig/20240215-140613-ladsgroup.json [14:06:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:07:18] (03PS1) 10Slyngshede: Puppetmaster: Alert when unmerged changes exists in Puppet repo. [alerts] - 10https://gerrit.wikimedia.org/r/1003761 (https://phabricator.wikimedia.org/T350694) [14:07:41] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002171 (https://phabricator.wikimedia.org/T355990) (owner: 10Superpes15) [14:07:53] Superpes: are you ready now or should I wait for you to finish lunch? :) [14:08:34] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:09:09] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1381/console" [puppet] - 10https://gerrit.wikimedia.org/r/1003496 (owner: 10Dzahn) [14:09:31] (03PS5) 10Ayounsi: Add support for routed Ganeti in D-I early_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1003416 (https://phabricator.wikimedia.org/T300152) [14:11:29] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [14:11:37] (03PS3) 10Brouberol: superset: enable OIDC login [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003747 (https://phabricator.wikimedia.org/T353794) [14:12:12] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10cmooney) 05Open→03Resolved a:03cmooney All looking good, closing task. Thanks everyone for their assistance. [14:14:14] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10MoritzMuehlenhoff) Surprisingly, after the revert of the SVC change I can't reproduce this by explicitly telling ping2003 to use... [14:14:50] (03CR) 10Ayounsi: [C: 03+2] don't require a cable ID on planned cables [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1003371 (https://phabricator.wikimedia.org/T357259) (owner: 10Ayounsi) [14:15:07] (03PS4) 10Brouberol: superset: enable OIDC login [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003747 (https://phabricator.wikimedia.org/T353794) [14:15:38] (03Merged) 10jenkins-bot: don't require a cable ID on planned cables [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1003371 (https://phabricator.wikimedia.org/T357259) (owner: 10Ayounsi) [14:16:14] I'm here Lucas_WMDE with 2 new patches :D [14:16:41] can we start with the one I already reviewed? :P [14:16:55] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [14:17:06] (03PS1) 10Superpes15: [ruwikiquote] Add 'suppressredirect' right to editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003763 (https://phabricator.wikimedia.org/T357241) [14:17:11] Yep [14:17:15] (03CR) 10Jelto: [V: 03+1 C: 03+2] phabricator,etherpad: fix some puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/1003496 (owner: 10Dzahn) [14:17:26] ok! [14:17:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002171 (https://phabricator.wikimedia.org/T355990) (owner: 10Superpes15) [14:17:44] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2005.codfw.wmnet with OS bookworm [14:18:06] (03Merged) 10jenkins-bot: [rowiki] Change autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002171 (https://phabricator.wikimedia.org/T355990) (owner: 10Superpes15) [14:18:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:18:16] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1003526 (https://phabricator.wikimedia.org/T353550) (owner: 10Eevans) [14:18:31] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:1002171|[rowiki] Change autoconfirmed setting (T355990)]] [14:18:35] T355990: Set $wgAutoConfirmCount to 10 for Romanian Wikipedia - https://phabricator.wikimedia.org/T355990 [14:20:02] I added the other 2 patches on wikitech [14:20:15] !log lucaswerkmeister-wmde@deploy2002 superpes and lucaswerkmeister-wmde: Backport for [[gerrit:1002171|[rowiki] Change autoconfirmed setting (T355990)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:20:46] (03PS5) 10Jaime Nuche: support Zuul v2 on bullseye contint hosts [puppet] - 10https://gerrit.wikimedia.org/r/1002461 (https://phabricator.wikimedia.org/T342346) [14:20:52] Superpes: anything to test? [14:21:03] I guess you don’t really want to create an account and make ten edits just to see if you become autoconfirmed or not [14:21:18] Nope lol [14:21:38] Also because we should wait 4 days lmao [14:21:53] !log lucaswerkmeister-wmde@deploy2002 superpes and lucaswerkmeister-wmde: Continuing with sync [14:23:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:23:16] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [14:23:39] (03CR) 10Muehlenhoff: "Looks good. Shall I merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/1002461 (https://phabricator.wikimedia.org/T342346) (owner: 10Jaime Nuche) [14:24:25] (03CR) 10Jaime Nuche: "@Muehlenhoff we need someone with merge permissions for this change. Can we ask you to +2 it?" [puppet] - 10https://gerrit.wikimedia.org/r/1002461 (https://phabricator.wikimedia.org/T342346) (owner: 10Jaime Nuche) [14:25:29] (03CR) 10Muehlenhoff: [C: 03+2] support Zuul v2 on bullseye contint hosts [puppet] - 10https://gerrit.wikimedia.org/r/1002461 (https://phabricator.wikimedia.org/T342346) (owner: 10Jaime Nuche) [14:27:48] 10SRE, 10ops-codfw: PowerSupplyFailure - mw2389 - https://phabricator.wikimedia.org/T357377 (10Jhancock.wm) part received. returned borrowed part to inventory server and put the assumed newer PSU in the machine. no alerts. [14:28:25] (03PS2) 10Superpes15: [ruwikiquote] Add 'suppressredirect' right to editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003763 (https://phabricator.wikimedia.org/T357241) [14:28:27] (03CR) 10Alexandros Kosiaris: [C: 03+1] Modify K8s BGP groups to only enable multihop on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1003619 (https://phabricator.wikimedia.org/T357619) (owner: 10Cathal Mooney) [14:28:54] 10SRE, 10ops-codfw: PowerSupplyFailure - mw2389 - https://phabricator.wikimedia.org/T357377 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:29:01] (03CR) 10Muehlenhoff: [C: 03+2] "Sure, merged." [puppet] - 10https://gerrit.wikimedia.org/r/1002461 (https://phabricator.wikimedia.org/T342346) (owner: 10Jaime Nuche) [14:29:27] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:1002171|[rowiki] Change autoconfirmed setting (T355990)]] (duration: 10m 55s) [14:29:31] T355990: Set $wgAutoConfirmCount to 10 for Romanian Wikipedia - https://phabricator.wikimedia.org/T355990 [14:30:01] (03CR) 10Jaime Nuche: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1002461 (https://phabricator.wikimedia.org/T342346) (owner: 10Jaime Nuche) [14:31:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] cxserver: Remove all kademlia support from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/992744 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris) [14:32:52] (03Merged) 10jenkins-bot: cxserver: Remove all kademlia support from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/992744 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris) [14:33:31] (03CR) 10Hnowlan: [C: 03+2] jobqueue: migrate cirrusSearchLinksUpdate to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003617 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [14:34:26] (03Merged) 10jenkins-bot: jobqueue: migrate cirrusSearchLinksUpdate to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003617 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [14:35:21] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [14:35:39] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [14:35:44] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:35:47] (03CR) 10Lucas Werkmeister (WMDE): [commonswiki] Add an editautopatrolprotected level protection (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003760 (https://phabricator.wikimedia.org/T357298) (owner: 10Superpes15) [14:36:26] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [14:36:41] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:36:55] !log migrating cirrusSearchLinksUpdate to k8s [14:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:13] (03PS3) 10Lucas Werkmeister (WMDE): [ruwikiquote] Add 'suppressredirect' right to editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003763 (https://phabricator.wikimedia.org/T357241) (owner: 10Superpes15) [14:37:16] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:37:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003763 (https://phabricator.wikimedia.org/T357241) (owner: 10Superpes15) [14:38:04] (03Merged) 10jenkins-bot: [ruwikiquote] Add 'suppressredirect' right to editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003763 (https://phabricator.wikimedia.org/T357241) (owner: 10Superpes15) [14:38:29] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:1003763|[ruwikiquote] Add 'suppressredirect' right to editors (T357241)]] [14:38:34] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:34] T357241: Russian Wikiquote needs ''suppressredirect'' right for ''editor'' group - https://phabricator.wikimedia.org/T357241 [14:40:00] !log lucaswerkmeister-wmde@deploy2002 superpes and lucaswerkmeister-wmde: Backport for [[gerrit:1003763|[ruwikiquote] Add 'suppressredirect' right to editors (T357241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:40:25] Tested! It works [14:40:43] !log lucaswerkmeister-wmde@deploy2002 superpes and lucaswerkmeister-wmde: Continuing with sync [14:40:47] ok! [14:41:11] meanwhile I’ll try to understand the commons change [14:43:03] Yep just a new protection level (only-autopatrolled based) :) [14:44:20] ok, and the protection level etc. is already translated via WikimediaMessages, it seems [14:44:34] because some other wikis have them, I see [14:44:53] Yep exactly! [14:45:01] ok, ok [14:45:07] and it looks like those other wikis don’t set anything else either [14:45:27] (I was half looking for the part that configures which other groups can assign the group – but this isn’t a new group, it’s a new restriction level) [14:45:28] Well I only set wgSemiprotectedRestrictionLevels on itwiki [14:45:45] Lucas_WMDE: Can you ping me at the end of the deployments so I can depool hosts for the network migrations, please? [14:45:50] claime: ok [14:45:54] ty <3 [14:45:56] To show the semiprotection message instead of full protection! But on commons is not required [14:46:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:46:20] (03CR) 10Lucas Werkmeister (WMDE): "Otherwise this looks good to me, I think. (Several other wikis already have this protection level, with no additional config needed as far" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003760 (https://phabricator.wikimedia.org/T357298) (owner: 10Superpes15) [14:46:36] Superpes: okay, please still update the change for the comments I left :) [14:46:48] (03PS3) 10Alexandros Kosiaris: cxserver: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003369 (https://phabricator.wikimedia.org/T355686) [14:47:00] (03PS1) 10Jelto: etherpad: install mariadb server in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1003769 (https://phabricator.wikimedia.org/T316421) [14:47:07] (if wrapping the commit message is too cumbersome, I can live without it, but I’d prefer to have it wrapped) [14:47:56] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:1003763|[ruwikiquote] Add 'suppressredirect' right to editors (T357241)]] (duration: 09m 26s) [14:48:01] (03PS3) 10Superpes15: [commonswiki] Add an editautopatrolprotected level protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003760 (https://phabricator.wikimedia.org/T357298) [14:48:01] T357241: Russian Wikiquote needs ''suppressredirect'' right for ''editor'' group - https://phabricator.wikimedia.org/T357241 [14:48:33] (03CR) 10Superpes15: [commonswiki] Add an editautopatrolprotected level protection (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003760 (https://phabricator.wikimedia.org/T357298) (owner: 10Superpes15) [14:49:03] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1382/" [puppet] - 10https://gerrit.wikimedia.org/r/1003769 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [14:49:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] cxserver: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003369 (https://phabricator.wikimedia.org/T355686) (owner: 10Alexandros Kosiaris) [14:49:44] (03CR) 10Herron: [C: 03+1] alert: Failover Icinga and Alertmanager to alert2001 [puppet] - 10https://gerrit.wikimedia.org/r/1003513 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [14:50:10] (03CR) 10Herron: [C: 03+1] alert: Resolve alerts DNS queries to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/1003516 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [14:50:26] 10SRE, 10serviceops, 10SecTeam-Processed, 10Security, 10Vuln-Misconfiguration: Helm Chart misconfigurations - https://phabricator.wikimedia.org/T355167 (10akosiaris) 05In progress→03Resolved a:03akosiaris I 'll resolve this, all patches have been merged. [14:50:34] (03Merged) 10jenkins-bot: cxserver: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003369 (https://phabricator.wikimedia.org/T355686) (owner: 10Alexandros Kosiaris) [14:51:00] (03CR) 10Lucas Werkmeister (WMDE): [commonswiki] Add an editautopatrolprotected level protection (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003760 (https://phabricator.wikimedia.org/T357298) (owner: 10Superpes15) [14:51:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:51:40] Uhm looks weird for the // change [14:51:53] I saw it published wtf [14:55:01] idk what happened then [14:55:07] oh wait [14:55:38] (03CR) 10Jforrester: "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003377 (https://phabricator.wikimedia.org/T355686) (owner: 10Alexandros Kosiaris) [14:55:39] no, nevermind, it’s not what I thought it might bse [14:55:39] *be [14:56:31] Ah it's likely my internet... [14:56:39] Also on IRC I've problem in sending message... it's very slow atm! Re-trying [14:57:35] (03PS4) 10Superpes15: [commonswiki] Add an editautopatrolprotected level protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003760 (https://phabricator.wikimedia.org/T357298) [14:57:39] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) In theory if all those patches are merged/deployed, the VM will be using /32 IPs from early_command.sh all the way to its final state and... [14:58:10] jouncebot: next [14:58:10] In 1 hour(s) and 1 minute(s): WikimediaCampaignEvents extension deployment (task T347909) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240215T1600) [14:58:23] For the commit message, it's not a problem, I just added some unnecessary infos and now I removed them :P [14:58:34] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:48] (03PS5) 10Lucas Werkmeister (WMDE): [commonswiki] Add an editautopatrolprotected level protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003760 (https://phabricator.wikimedia.org/T357298) (owner: 10Superpes15) [14:58:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003760 (https://phabricator.wikimedia.org/T357298) (owner: 10Superpes15) [14:59:12] Now I'm not able to open gerrit with my internet... so please confirm me that the change has been published [14:59:43] (03Merged) 10jenkins-bot: [commonswiki] Add an editautopatrolprotected level protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003760 (https://phabricator.wikimedia.org/T357298) (owner: 10Superpes15) [14:59:48] (03PS1) 10Hnowlan: mw-jobrunner: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003772 (https://phabricator.wikimedia.org/T349796) [14:59:53] Superpes: it has, yes [14:59:56] (see also wikibugs just now) [15:00:08] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:1003760|[commonswiki] Add an editautopatrolprotected level protection (T357298)]] [15:01:44] !log lucaswerkmeister-wmde@deploy2002 superpes and lucaswerkmeister-wmde: Backport for [[gerrit:1003760|[commonswiki] Add an editautopatrolprotected level protection (T357298)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:02:46] * Lucas_WMDE tries to find out where the protection levels are exposed in the API [15:02:54] (I can see the new right in https://commons.wikimedia.org/w/api.php?action=query&meta=siteinfo&siprop=usergroups&format=json&formatversion=2, at least) [15:03:18] It works :) [15:03:35] aha, siprop=restrictions [15:03:50] yup, looks good in the aPI [15:04:33] !log lucaswerkmeister-wmde@deploy2002 superpes and lucaswerkmeister-wmde: Continuing with sync [15:04:57] Yep thanks for double checking :P [15:09:59] 10SRE, 10Content-Transform-Team, 10MW-on-K8s, 10Traffic, and 2 others: Create parsoid mediawiki deployment - https://phabricator.wikimedia.org/T357392 (10cscott) Seems fine, most parsoid traffic is probably happening on the main cluster these days anyway. We just want to make sure that scandium doesn't get... [15:10:33] !log installing libde265 security updates [15:11:46] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:1003760|[commonswiki] Add an editautopatrolprotected level protection (T357298)]] (duration: 11m 37s) [15:12:47] !log UTC afternoon backport+config window done [15:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:52] claime: all yours [15:13:11] Lucas_WMDE: thanks very much :) [15:13:19] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:13:27] (03PS1) 10Muehlenhoff: Advertise puppetserver2003 as active Puppet 7 server [dns] - 10https://gerrit.wikimedia.org/r/1003778 (https://phabricator.wikimedia.org/T356991) [15:14:29] !log Draining kubernetes2059.codfw.wmnet kubernetes2028.codfw.wmnet kubernetes2027.codfw.wmnet kubernetes2060.codfw.wmnet kubernetes2008.codfw.wmnet kubernetes2007.codfw.wmnet kubernetes2055.codfw.wmnet mw2301.codfw.wmnet mw2424.codfw.wmnet mw2425.codfw.wmnet mw2427.codfw.wmnet - T355866 [15:14:33] (KubernetesCalicoDown) resolved: (3) mw2379.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:34] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [15:15:11] !log Depooling mw2302|mw2303|mw2304|mw2305|mw2306|mw2307|mw2308|mw2309|mw2426 - T355866 [15:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:32] !log cgoubert@cumin2002 conftool action : set/pooled=inactive; selector: name=(mw2302|mw2303|mw2304|mw2305|mw2306|mw2307|mw2308|mw2309|mw2426).* [15:15:43] (03CR) 10Muehlenhoff: [C: 03+2] Advertise puppetserver2003 as active Puppet 7 server [dns] - 10https://gerrit.wikimedia.org/r/1003778 (https://phabricator.wikimedia.org/T356991) (owner: 10Muehlenhoff) [15:15:47] Thanks for you assistance Lucas_WMDE :3 [15:16:12] np :) [15:16:14] 10SRE-tools, 10Infrastructure-Foundations: Decommission cookbook: lock per switch - https://phabricator.wikimedia.org/T353513 (10ayounsi) Yeah that would work too but might not be worth it as the cookbook main role is to run `configure_switch_interfaces()` and might be refactored in {T344326} [15:24:52] 10sre-alert-triage, 10Data-Platform-SRE: Alert in need of triage: Updater process (instance wdqs1022) - https://phabricator.wikimedia.org/T357496 (10Gehel) p:05Triage→03High [15:24:53] !log imported openssl11 1.1.1w-0+deb11u1+wmf2 to component/haproxy26 T352744 (with fix for libssl11-dev file contents) [15:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:58] T352744: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 [15:25:00] (03PS1) 10Clément Goubert: mediawiki: Align maxSurge on maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003781 [15:25:14] 10sre-alert-triage, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Alert in need of triage: Updater process (instance wdqs1022) - https://phabricator.wikimedia.org/T357496 (10Gehel) [15:27:17] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10MoritzMuehlenhoff) On a second attempt it worked. After https://gerrit.wikimedia.org/r/c/operations/puppet... [15:27:36] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10MoritzMuehlenhoff) [15:33:48] (03CR) 10Clément Goubert: [C: 03+1] mw-jobrunner: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003772 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:35:33] (03CR) 10Filippo Giunchedi: "LGTM, nit inline" [alerts] - 10https://gerrit.wikimedia.org/r/1003761 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:35:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1003413 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:36:02] (03CR) 10Filippo Giunchedi: [C: 03+1] P:puppetboard absent Icinga checks for PuppetBoard. [puppet] - 10https://gerrit.wikimedia.org/r/1003406 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:36:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, please loop in data engineering folks e.g. Ben for awareness" [puppet] - 10https://gerrit.wikimedia.org/r/1003382 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:36:44] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, please loop in data engineering folks e.g. Ben for awareness" [puppet] - 10https://gerrit.wikimedia.org/r/1003383 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:36:58] (03CR) 10Filippo Giunchedi: [C: 03+1] P:ganeti: Absent checks for generic Ganeti services. [puppet] - 10https://gerrit.wikimedia.org/r/1003374 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:37:31] 10sre-alert-triage, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Alert in need of triage: Updater process (instance wdqs1022) - https://phabricator.wikimedia.org/T357496 (10bking) 05Open→03In progress a:03bking [15:37:34] (03CR) 10Filippo Giunchedi: [C: 03+1] P:puppetdb::microservice absent uwsgi Icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/1003405 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:38:06] (03CR) 10Filippo Giunchedi: [C: 03+1] puppetdb: Use the nginx certs [puppet] - 10https://gerrit.wikimedia.org/r/1003014 (https://phabricator.wikimedia.org/T342784) (owner: 10JHathaway) [15:38:46] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse) [15:40:29] 10sre-alert-triage, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Alert in need of triage: Updater process (instance wdqs1022) - https://phabricator.wikimedia.org/T357496 (10bking) Per T347505 , these are graph split hosts , which means they don't run the updater at all. We need to remove this check from the... [15:41:08] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse) [15:45:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T355866 - db2155 db2156 db2105 db2122 db2133 es2024', diff saved to https://phabricator.wikimedia.org/P56837 and previous config saved to /var/cache/conftool/dbconfig/20240215-154520-arnaudb.json [15:45:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2155.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:45:26] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [15:45:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2155.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:45:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2156.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:45:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2156.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:45:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2105.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:45:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2105.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:45:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2122.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:46:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2122.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:46:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2133.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:46:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2133.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:46:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on es2024.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:46:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2024.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:49:13] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a6-codfw.mgmt with reason: prepping for server uplink migration codfw rack a6 [15:49:41] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a6-codfw.mgmt with reason: prepping for server uplink migration codfw rack a6 [15:49:48] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=dc8a2b8d-561d-404c-ac7f-f64637c16dd1) set by cmooney@cumin... [15:50:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1003014 (https://phabricator.wikimedia.org/T342784) (owner: 10JHathaway) [15:52:46] (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003772 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:53:15] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [15:53:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on es2027.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:54:02] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [15:54:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on es2027.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:54:18] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on es2028.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:54:26] (03Merged) 10jenkins-bot: mw-jobrunner: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003772 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:54:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on es2028.codfw.wmnet with reason: T355866 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:56:31] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 38 hosts with reason: Migrating servers in codfw rack A6 to lsw1-a6-codfw [15:57:07] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 38 hosts with reason: Migrating servers in codfw rack A6 to lsw1-a6-codfw [15:58:37] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=23a82a8c-672f-4105-8a05-0b7dbbb4cb97) set by cmooney@cumin... [16:00:05] Daimona, HouseOfM, thcipriani, and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) WikimediaCampaignEvents extension deployment (task T347909) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240215T1600). [16:00:06] T347909: Deploy the WikimediaCampaignEvents extension to production - https://phabricator.wikimedia.org/T347909 [16:00:25] o/ [16:00:51] Hi! [16:00:54] !log commencing move of server uplinks codfw row A6 T355866 [16:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:06] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [16:08:36] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) Started to write the doc over there : https://wikitech.wikimedia.org/wiki/Ganeti#Routed_Ganeti [16:10:43] (03CR) 10Dzahn: [C: 03+1] etherpad: install mariadb server in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1003769 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [16:11:07] o/ [16:11:13] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10cmooney) All moves now complete, ports up on new switch and all devices pinging ok! [16:11:56] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10ABran-WMF) amazing, thanks @cmooney! will start repooling [16:12:40] Hi HouseOfM! dancy, we're ready to go now. I'm going to create the new tables in a minute [16:12:44] !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [16:12:44] !log hnowlan@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [16:12:52] Ok [16:12:55] !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [16:12:57] !log hnowlan@deploy2002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [16:13:14] !log Uncordoning kubernetes2059.codfw.wmnet kubernetes2028.codfw.wmnet kubernetes2027.codfw.wmnet kubernetes2060.codfw.wmnet kubernetes2008.codfw.wmnet kubernetes2007.codfw.wmnet kubernetes2055.codfw.wmnet mw2301.codfw.wmnet mw2424.codfw.wmnet mw2425.codfw.wmnet mw2427.codfw.wmnet - T355866 [16:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:19] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [16:13:31] !log Repooling mw2302|mw2303|mw2304|mw2305|mw2306|mw2307|mw2308|mw2309|mw2426 - T355866 [16:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 25%: T355866 - Post migration repool of db2155', diff saved to https://phabricator.wikimedia.org/P56838 and previous config saved to /var/cache/conftool/dbconfig/20240215-161338-arnaudb.json [16:13:47] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [16:13:51] !log cgoubert@cumin2002 conftool action : set/pooled=yes; selector: name=(mw2302|mw2303|mw2304|mw2305|mw2306|mw2307|mw2308|mw2309|mw2426).* [16:14:18] 10SRE, 10Ganeti, 10Infrastructure-Foundations: SSH host key verification failures in Ganeti intra node SSH calls after Bullseye update - https://phabricator.wikimedia.org/T309724 (10MoritzMuehlenhoff) >>! In T309724#9540291, @Volans wrote: > Should we have `/var/lib/ganeti/known_hosts` be managed by Puppet?... [16:14:37] !log Creating new DB table for the WikimediaCampaignEvents extension in x1.testwiki, x1.test2wiki, x1.officewiki, and x1.wikishared # T347909 [16:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:42] T347909: Deploy the WikimediaCampaignEvents extension to production - https://phabricator.wikimedia.org/T347909 [16:15:05] (03PS1) 10Brouberol: idp: restrict the superset services to the *-k8s.wikimedia.org domains [puppet] - 10https://gerrit.wikimedia.org/r/1003790 (https://phabricator.wikimedia.org/T353794) [16:16:01] !log hnowlan@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2311.codfw.wmnet|mw2335.codfw.wmnet|mw2379.codfw.wmnet|mw2380.codfw.wmnet|mw2383.codfw.wmnet),cluster=kubernetes,service=kubesvc [16:17:12] The tables have been created and I've confirmed that they're there [16:17:57] dancy, how would you like me to send you the secret config? [16:18:36] Hit me up in IRC [16:18:49] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:18:57] PROBLEM - SSH on mw2379 is CRITICAL: connect to address 10.192.5.5 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:19:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:20:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] mediawiki: Align maxSurge on maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003781 (owner: 10Clément Goubert) [16:21:14] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Align maxSurge on maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003781 (owner: 10Clément Goubert) [16:21:19] Thank you! Now only the two public config changes remaining [16:21:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002993 (https://phabricator.wikimedia.org/T347909) (owner: 10Mhorsey) [16:21:55] (03CR) 10Bking: [C: 03+1] idp: restrict the superset services to the *-k8s.wikimedia.org domains [puppet] - 10https://gerrit.wikimedia.org/r/1003790 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [16:22:15] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10MoritzMuehlenhoff) 05Open→03Resolved This is completed [16:22:22] (03CR) 10Brouberol: [C: 03+2] idp: restrict the superset services to the *-k8s.wikimedia.org domains [puppet] - 10https://gerrit.wikimedia.org/r/1003790 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [16:22:25] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10MoritzMuehlenhoff) [16:22:31] (03Merged) 10jenkins-bot: mediawiki: Align maxSurge on maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003781 (owner: 10Clément Goubert) [16:22:33] (KubernetesCalicoDown) firing: mw2379.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2379.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:22:50] (03CR) 10Majavah: idp: restrict the superset services to the *-k8s.wikimedia.org domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003790 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [16:23:07] (03PS2) 10Ahmon Dancy: Load WikimediaCampaignEvents if CampaignEvents is loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002993 (https://phabricator.wikimedia.org/T347909) (owner: 10Mhorsey) [16:23:29] (03CR) 10TrainBranchBot: "Approved by dancy@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002993 (https://phabricator.wikimedia.org/T347909) (owner: 10Mhorsey) [16:23:31] (03CR) 10Brouberol: [C: 03+2] idp: restrict the superset services to the *-k8s.wikimedia.org domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003790 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [16:24:15] (03Merged) 10jenkins-bot: Load WikimediaCampaignEvents if CampaignEvents is loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002993 (https://phabricator.wikimedia.org/T347909) (owner: 10Mhorsey) [16:24:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:24:38] !log dancy@deploy2002 Started scap: Backport for [[gerrit:1002993|Load WikimediaCampaignEvents if CampaignEvents is loaded (T347909)]] [16:24:44] (03PS1) 10Brouberol: idp: fix typo in superset config [puppet] - 10https://gerrit.wikimedia.org/r/1003791 (https://phabricator.wikimedia.org/T353794) [16:24:47] T347909: Deploy the WikimediaCampaignEvents extension to production - https://phabricator.wikimedia.org/T347909 [16:25:15] (03CR) 10Ryan Kemper: "`./debian/rules verify_commit` gave OK" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/1002619 (https://phabricator.wikimedia.org/T356651) (owner: 10Ebernhardson) [16:25:21] (03CR) 10Ryan Kemper: [C: 03+2] Bump version of extra plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/1002619 (https://phabricator.wikimedia.org/T356651) (owner: 10Ebernhardson) [16:25:28] (03CR) 10Brouberol: [C: 03+2] idp: restrict the superset services to the *-k8s.wikimedia.org domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003790 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [16:26:35] (03CR) 10Bking: [V: 03+1] idp: fix typo in superset config [puppet] - 10https://gerrit.wikimedia.org/r/1003791 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [16:26:42] (03CR) 10Brouberol: [C: 03+2] idp: fix typo in superset config [puppet] - 10https://gerrit.wikimedia.org/r/1003791 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [16:26:49] !log dancy@deploy2002 mhorsey and dancy: Backport for [[gerrit:1002993|Load WikimediaCampaignEvents if CampaignEvents is loaded (T347909)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:26:50] (03CR) 10JHathaway: [C: 03+2] puppetdb: Use the nginx certs [puppet] - 10https://gerrit.wikimedia.org/r/1003014 (https://phabricator.wikimedia.org/T342784) (owner: 10JHathaway) [16:28:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 50%: T355866 - Post migration repool of db2155', diff saved to https://phabricator.wikimedia.org/P56839 and previous config saved to /var/cache/conftool/dbconfig/20240215-162843-arnaudb.json [16:28:48] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [16:29:57] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:29:58] !log kubectl cordon mw2379.codfw.wmnet - bgp issues [16:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:05] RECOVERY - SSH on mw2379 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:30:26] !log dancy@deploy2002 mhorsey and dancy: Continuing with sync [16:31:12] (03CR) 10BCornwall: [C: 03+2] ncmonitor: Remove useless apt-get require [puppet] - 10https://gerrit.wikimedia.org/r/1003524 (owner: 10BCornwall) [16:32:33] (KubernetesCalicoDown) resolved: mw2379.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2379.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:35:20] (03CR) 10Eevans: [C: 03+2] cassandra: install git-fat to satisfy scap requirement [puppet] - 10https://gerrit.wikimedia.org/r/1003526 (https://phabricator.wikimedia.org/T353550) (owner: 10Eevans) [16:38:15] !log dancy@deploy2002 Finished scap: Backport for [[gerrit:1002993|Load WikimediaCampaignEvents if CampaignEvents is loaded (T347909)]] (duration: 13m 36s) [16:38:20] T347909: Deploy the WikimediaCampaignEvents extension to production - https://phabricator.wikimedia.org/T347909 [16:39:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002994 (https://phabricator.wikimedia.org/T347909) (owner: 10Mhorsey) [16:39:46] (03PS2) 10Ahmon Dancy: Remove explicit load of WikimediaCampaignevents extension from beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002994 (https://phabricator.wikimedia.org/T347909) (owner: 10Mhorsey) [16:39:55] (03CR) 10TrainBranchBot: "Approved by dancy@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002994 (https://phabricator.wikimedia.org/T347909) (owner: 10Mhorsey) [16:40:08] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [16:40:13] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:40:14] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:40:23] !log hnowlan@cumin2002 conftool action : set/pooled=no; selector: name=mw2379.codfw.wmnet [16:40:46] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:40:47] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [16:42:04] (03Merged) 10jenkins-bot: Remove explicit load of WikimediaCampaignevents extension from beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002994 (https://phabricator.wikimedia.org/T347909) (owner: 10Mhorsey) [16:43:01] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@162f72f] (aqs): Deploying to updated target list — T353550 [16:43:07] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [16:43:12] (03PS1) 10Brouberol: idp: remove superset services from CAS as an attempt to restore production service [puppet] - 10https://gerrit.wikimedia.org/r/1003796 (https://phabricator.wikimedia.org/T357688) [16:43:38] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@162f72f] (aqs): Deploying to updated target list — T353550 (duration: 00m 37s) [16:43:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 75%: T355866 - Post migration repool of db2155', diff saved to https://phabricator.wikimedia.org/P56840 and previous config saved to /var/cache/conftool/dbconfig/20240215-164348-arnaudb.json [16:43:53] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [16:44:16] (03PS2) 10Brouberol: idp: remove superset services from CAS as an attempt to restore service [puppet] - 10https://gerrit.wikimedia.org/r/1003796 (https://phabricator.wikimedia.org/T357688) [16:44:47] PROBLEM - prometheus-codfw.wikimedia.org requires authentication on prometheus2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:44:49] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1383/co" [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [16:44:53] PROBLEM - prometheus-codfw.wikimedia.org tls expiry on prometheus2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:45:12] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@162f72f] (cassandra-dev): Deploying to updated target list — T353550 [16:45:27] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@162f72f] (cassandra-dev): Deploying to updated target list — T353550 (duration: 00m 15s) [16:45:45] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw2379.codfw.wmnet with reason: BGP issues - uncordoned, needs investigation [16:45:51] (03CR) 10Brouberol: [C: 03+2] idp: remove superset services from CAS as an attempt to restore service [puppet] - 10https://gerrit.wikimedia.org/r/1003796 (https://phabricator.wikimedia.org/T357688) (owner: 10Brouberol) [16:45:55] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@162f72f] (ml-cache): Deploying to updated target list — T353550 [16:46:01] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw2379.codfw.wmnet with reason: BGP issues - uncordoned, needs investigation [16:46:10] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@162f72f] (ml-cache): Deploying to updated target list — T353550 (duration: 00m 15s) [16:46:18] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@162f72f] (sessionstore): Deploying to updated target list — T353550 [16:46:33] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@162f72f] (sessionstore): Deploying to updated target list — T353550 (duration: 00m 15s) [16:49:41] RECOVERY - prometheus-codfw.wikimedia.org requires authentication on prometheus2005 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:49:47] RECOVERY - prometheus-codfw.wikimedia.org tls expiry on prometheus2005 is OK: OK - Certificate prometheus.discovery.wmnet will expire on Sat 09 Mar 2024 09:10:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:51:27] (03CR) 10JHathaway: "that seems fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [16:51:40] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "rack1"} and A:aqs and A:eqiad: Restart to pickup logging jars — T353550 - eevans@cumin1002 [16:51:45] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [16:51:52] 10SRE, 10Content-Transform-Team, 10MW-on-K8s, 10Traffic, and 2 others: Create parsoid mediawiki deployment - https://phabricator.wikimedia.org/T357392 (10cscott) Also worth keeping in mind: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/965608 -- in order to avoid inadventently getting exte... [16:52:53] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:53:09] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [16:53:13] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:53:14] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:53:17] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:53:18] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [16:53:20] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:56:57] (03PS8) 10BCornwall: fifo-log-demux: Decouple service from nginx/ats [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) [16:58:47] (03PS1) 10DDesouza: design/style-guide: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003802 (https://phabricator.wikimedia.org/T352202) [16:58:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 100%: T355866 - Post migration repool of db2155', diff saved to https://phabricator.wikimedia.org/P56841 and previous config saved to /var/cache/conftool/dbconfig/20240215-165853-arnaudb.json [16:58:58] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [16:58:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 25%: T355866 - Post migration repool of db2156', diff saved to https://phabricator.wikimedia.org/P56842 and previous config saved to /var/cache/conftool/dbconfig/20240215-165858-arnaudb.json [17:00:04] swfrench-wmf and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240215T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:26] swfrench-wmf and I are going to deploy an apache config change 👋 [17:00:37] (03CR) 10DDesouza: "@Daniel Zahn - I tried deploying it now and got the same issue. Is it a quota issue? I ran into a similar issue last time and that was the" [deployment-charts] - 10https://gerrit.wikimedia.org/r/999180 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [17:01:57] (03CR) 10DDesouza: "Anyway, all looks fine now. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/999180 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [17:02:11] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1384/co" [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [17:05:30] !log disabling puppet shortly on mediawiki::webserver hosts to deploy T357436 [17:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:41] T357436: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 [17:13:24] (03CR) 10Scott French: [C: 03+2] Add wikihole redirect for donatewiki [puppet] - 10https://gerrit.wikimedia.org/r/1003515 (https://phabricator.wikimedia.org/T357436) (owner: 10Dwisehaupt) [17:14:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 50%: T355866 - Post migration repool of db2156', diff saved to https://phabricator.wikimedia.org/P56843 and previous config saved to /var/cache/conftool/dbconfig/20240215-171403-arnaudb.json [17:14:20] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [17:21:36] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:21:37] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudnet1005.eqiad.wmnet [17:22:30] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:23:19] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:23:58] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:24:31] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on vrts1002.eqiad.wmnet with reason: Migration Ongoing [17:24:41] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching P{P:cassandra%rack = "rack1"} and A:aqs and A:eqiad: Restart to pickup logging jars — T353550 - eevans@cumin1002 [17:24:45] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on vrts1002.eqiad.wmnet with reason: Migration Ongoing [17:24:46] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [17:28:05] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1005.eqiad.wmnet [17:29:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 75%: T355866 - Post migration repool of db2156', diff saved to https://phabricator.wikimedia.org/P56844 and previous config saved to /var/cache/conftool/dbconfig/20240215-172909-arnaudb.json [17:29:16] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [17:30:16] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:31:21] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:31:22] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:32:22] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:32:23] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:33:37] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:33:38] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:34:48] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:34:49] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:34:58] (03PS1) 10Ryan Kemper: Bump version to 7.10.2-12 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/1003803 (https://phabricator.wikimedia.org/T356651) [17:35:31] (03PS2) 10Ryan Kemper: Bump version to 7.10.2-12 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/1003803 (https://phabricator.wikimedia.org/T356651) [17:35:41] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:35:42] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:36:06] (03CR) 10Ebernhardson: [C: 03+2] Bump version to 7.10.2-12 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/1003803 (https://phabricator.wikimedia.org/T356651) (owner: 10Ryan Kemper) [17:36:34] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:36:39] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [17:36:41] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [17:36:42] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [17:36:45] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [17:36:46] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [17:37:32] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [17:37:33] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [17:38:07] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [18:02:38] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudelastic1006.wikimedia.org [18:05:07] (03PS1) 10BryanDavis: toolhub: Bump container version to 2024-02-12-223131-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003806 [18:08:34] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:09:23] !log bking@cumin2002 START - Cookbook sre.dns.netbox [18:11:29] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1006.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [18:11:41] (03PS1) 10JHathaway: cas: compress log files and only keep a year's worth [puppet] - 10https://gerrit.wikimedia.org/r/1003808 (https://phabricator.wikimedia.org/T357711) [18:12:37] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1006.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [18:12:38] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:12:39] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudelastic1006.wikimedia.org [18:13:49] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2024-02-12-223131-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003806 (owner: 10BryanDavis) [18:14:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 75%: T355866 - Post migration repool of db2105', diff saved to https://phabricator.wikimedia.org/P56849 and previous config saved to /var/cache/conftool/dbconfig/20240215-181429-arnaudb.json [18:14:35] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [18:14:47] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2024-02-12-223131-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003806 (owner: 10BryanDavis) [18:15:47] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [18:16:19] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [18:17:04] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply [18:17:47] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [18:18:00] !log bking@cumin2002 START - Cookbook sre.dns.netbox [18:18:42] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [18:18:55] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/1003808 (https://phabricator.wikimedia.org/T357711) (owner: 10JHathaway) [18:19:04] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003808 (https://phabricator.wikimedia.org/T357711) (owner: 10JHathaway) [18:19:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:20:10] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1006 to private IPs - bking@cumin2002" [18:20:56] (03PS9) 10BCornwall: fifo-log-demux: Decouple service from nginx/ats [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) [18:20:59] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1006 to private IPs - bking@cumin2002" [18:21:00] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:21:42] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1006 [18:21:59] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [18:23:00] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1006 [18:23:39] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1006.eqiad.wmnet with OS bullseye [18:24:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:26:23] (03CR) 10Scott French: [C: 03+2] httpbb: add donate.wikimedia.org redirect tests [puppet] - 10https://gerrit.wikimedia.org/r/1003525 (https://phabricator.wikimedia.org/T357436) (owner: 10Scott French) [18:26:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P56850 and previous config saved to /var/cache/conftool/dbconfig/20240215-182644-ladsgroup.json [18:26:49] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:29:13] 10SRE, 10Fundraising-Backlog, 10Wikimedia-Apache-configuration, 10fundraising-tech-ops, and 2 others: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10Scott_French) Thanks to @Dwisehaupt for preparing the config patch and @RLazarus for assistance deploying it. The change is now l... [18:29:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 100%: T355866 - Post migration repool of db2105', diff saved to https://phabricator.wikimedia.org/P56852 and previous config saved to /var/cache/conftool/dbconfig/20240215-182934-arnaudb.json [18:29:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 25%: T355866 - Post migration repool of db2122', diff saved to https://phabricator.wikimedia.org/P56853 and previous config saved to /var/cache/conftool/dbconfig/20240215-182939-arnaudb.json [18:29:42] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [18:30:18] (03PS1) 10Jdlrobson: Add border-collapse to wikitable [skins/MinervaNeue] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003485 (https://phabricator.wikimedia.org/T357589) [18:30:46] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1385/co" [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [18:32:17] 10SRE, 10Fundraising-Backlog, 10Wikimedia-Apache-configuration, 10fundraising-tech-ops, 10serviceops: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10Scott_French) 05Open→03Resolved a:03Scott_French [18:41:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246:3314', diff saved to https://phabricator.wikimedia.org/P56855 and previous config saved to /var/cache/conftool/dbconfig/20240215-184150-ladsgroup.json [18:42:55] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudservices1006.eqiad.wmnet [18:44:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 50%: T355866 - Post migration repool of db2122', diff saved to https://phabricator.wikimedia.org/P56856 and previous config saved to /var/cache/conftool/dbconfig/20240215-184444-arnaudb.json [18:44:50] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [18:45:27] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:45:49] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:50:29] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:50:46] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices1006.eqiad.wmnet [18:50:51] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:51:09] (03PS2) 10TheDJ: Remove deprecated X-Webkit-CSP-Report-Only response header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003108 (https://phabricator.wikimedia.org/T357479) [18:53:00] (03PS3) 10TheDJ: Remove deprecated X-Webkit-CSP-Report-Only response header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003108 (https://phabricator.wikimedia.org/T357479) [18:56:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246:3314', diff saved to https://phabricator.wikimedia.org/P56857 and previous config saved to /var/cache/conftool/dbconfig/20240215-185657-ladsgroup.json [18:58:17] (03CR) 10CDanis: [C: 03+1] cas: compress log files and only keep a year's worth [puppet] - 10https://gerrit.wikimedia.org/r/1003808 (https://phabricator.wikimedia.org/T357711) (owner: 10JHathaway) [18:59:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 75%: T355866 - Post migration repool of db2122', diff saved to https://phabricator.wikimedia.org/P56858 and previous config saved to /var/cache/conftool/dbconfig/20240215-185949-arnaudb.json [18:59:55] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [19:00:05] jeena and brennen: MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240215T1900). Please do the needful. [19:00:38] (03CR) 10Dzahn: [C: 03+2] "@Daniel Souza Jelto fixed this. See https://phabricator.wikimedia.org/T357413 for details. Cheers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/999180 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [19:02:58] o/ [19:04:36] !log train 1.42.0-wmf.18 (T354436): no current blockers, rolling to all wikis. [19:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:42] T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436 [19:04:48] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003817 (https://phabricator.wikimedia.org/T354436) [19:04:50] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003817 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot) [19:06:16] (03PS1) 10Ebernhardson: Connection: Correct read-only detection [extensions/CirrusSearch] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003827 (https://phabricator.wikimedia.org/T354793) [19:11:20] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1006.eqiad.wmnet with OS bullseye [19:12:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P56859 and previous config saved to /var/cache/conftool/dbconfig/20240215-191203-ladsgroup.json [19:12:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance [19:12:09] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:12:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance [19:12:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T352010)', diff saved to https://phabricator.wikimedia.org/P56860 and previous config saved to /var/cache/conftool/dbconfig/20240215-191226-ladsgroup.json [19:13:42] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003817 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot) [19:14:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 100%: T355866 - Post migration repool of db2122', diff saved to https://phabricator.wikimedia.org/P56861 and previous config saved to /var/cache/conftool/dbconfig/20240215-191454-arnaudb.json [19:15:00] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [19:15:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 25%: T355866 - Post migration repool of es2024', diff saved to https://phabricator.wikimedia.org/P56862 and previous config saved to /var/cache/conftool/dbconfig/20240215-191500-arnaudb.json [19:17:00] (03PS2) 10GergesShamon: Increase move rate limit for extendedmovers in arwiki to 16/60 T357229 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000292 [19:19:20] (03CR) 10Hubaishan: "good job" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000292 (owner: 10GergesShamon) [19:22:41] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.18 refs T354436 [19:22:46] T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436 [19:24:58] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1006.eqiad.wmnet with OS bullseye [19:30:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 50%: T355866 - Post migration repool of es2024', diff saved to https://phabricator.wikimedia.org/P56863 and previous config saved to /var/cache/conftool/dbconfig/20240215-193005-arnaudb.json [19:30:15] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [19:31:45] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "rack2"} and A:aqs and A:eqiad: Restart to pickup logging jars — T353550 - eevans@cumin1002 [19:31:51] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [19:34:27] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:34:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T352010)', diff saved to https://phabricator.wikimedia.org/P56864 and previous config saved to /var/cache/conftool/dbconfig/20240215-193455-ladsgroup.json [19:35:01] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:35:07] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:36:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. We definitely had proper compressed logs in the initial setup (that's when we puppetised the log4j config), but presumably a v" [puppet] - 10https://gerrit.wikimedia.org/r/1003808 (https://phabricator.wikimedia.org/T357711) (owner: 10JHathaway) [19:39:48] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1006.eqiad.wmnet with reason: host reimage [19:41:04] (03CR) 10JHathaway: [C: 03+2] cas: compress log files and only keep a year's worth [puppet] - 10https://gerrit.wikimedia.org/r/1003808 (https://phabricator.wikimedia.org/T357711) (owner: 10JHathaway) [19:42:43] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1006.eqiad.wmnet with reason: host reimage [19:42:58] (03PS1) 10Ryan Kemper: graph_split: don't alert on updater absence [puppet] - 10https://gerrit.wikimedia.org/r/1003819 (https://phabricator.wikimedia.org/T357496) [19:43:14] !log manually generating checksums in parallel for wikidata full history dumps run, in screen session, owned by ariel, on snapshot1009 [19:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:36] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003819 (https://phabricator.wikimedia.org/T357496) (owner: 10Ryan Kemper) [19:43:52] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003819 (https://phabricator.wikimedia.org/T357496) (owner: 10Ryan Kemper) [19:43:54] 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402 (10dancy) p:05Triage→03Medium a:03dancy [19:45:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 75%: T355866 - Post migration repool of es2024', diff saved to https://phabricator.wikimedia.org/P56865 and previous config saved to /var/cache/conftool/dbconfig/20240215-194510-arnaudb.json [19:45:16] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [19:47:01] (03CR) 10Bking: [C: 03+1] graph_split: don't alert on updater absence [puppet] - 10https://gerrit.wikimedia.org/r/1003819 (https://phabricator.wikimedia.org/T357496) (owner: 10Ryan Kemper) [19:48:00] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade - ryankemper@cumin2002 - T356651 [19:48:05] T356651: Rebuild and deploy textify plugin - https://phabricator.wikimedia.org/T356651 [19:49:05] (03CR) 10Ryan Kemper: [C: 03+1] graph_split: don't alert on updater absence [puppet] - 10https://gerrit.wikimedia.org/r/1003819 (https://phabricator.wikimedia.org/T357496) (owner: 10Ryan Kemper) [19:49:07] (03CR) 10Ryan Kemper: [C: 03+2] graph_split: don't alert on updater absence [puppet] - 10https://gerrit.wikimedia.org/r/1003819 (https://phabricator.wikimedia.org/T357496) (owner: 10Ryan Kemper) [19:50:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P56866 and previous config saved to /var/cache/conftool/dbconfig/20240215-195001-ladsgroup.json [19:58:47] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [20:00:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 100%: T355866 - Post migration repool of es2024', diff saved to https://phabricator.wikimedia.org/P56867 and previous config saved to /var/cache/conftool/dbconfig/20240215-200015-arnaudb.json [20:00:30] T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 [20:03:43] (03PS1) 10JHathaway: cas: lower log level WARN [puppet] - 10https://gerrit.wikimedia.org/r/1003867 [20:04:20] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/1003867 (owner: 10JHathaway) [20:05:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P56868 and previous config saved to /var/cache/conftool/dbconfig/20240215-200507-ladsgroup.json [20:05:14] (03PS2) 10JHathaway: cas: lower log level WARN [puppet] - 10https://gerrit.wikimedia.org/r/1003867 [20:05:29] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003867 (owner: 10JHathaway) [20:06:16] (03CR) 10C. Scott Ananian: [C: 04-2] "Not until I9d5fb6348609642ad94743cc5dae81ce608be99d rides the train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999062 (https://phabricator.wikimedia.org/T355566) (owner: 10C. Scott Ananian) [20:06:43] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching P{P:cassandra%rack = "rack2"} and A:aqs and A:eqiad: Restart to pickup logging jars — T353550 - eevans@cumin1002 [20:06:48] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [20:08:30] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "rack3"} and A:aqs and A:eqiad: Restart to pickup logging jars — T353550 - eevans@cumin1002 [20:13:18] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003870 (https://phabricator.wikimedia.org/T354436) [20:13:20] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003870 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot) [20:14:37] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003870 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot) [20:19:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:20:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T352010)', diff saved to https://phabricator.wikimedia.org/P56869 and previous config saved to /var/cache/conftool/dbconfig/20240215-202014-ladsgroup.json [20:20:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [20:20:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:20:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [20:20:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2109 (T352010)', diff saved to https://phabricator.wikimedia.org/P56870 and previous config saved to /var/cache/conftool/dbconfig/20240215-202036-ladsgroup.json [20:24:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:27:39] (03PS1) 10Jdlrobson: Enable night mode on mobile test servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003873 [20:38:50] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.18 refs T354436 [20:38:56] T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436 [20:39:30] (MediaWikiHighErrorRate) firing: (3) Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:39:45] (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:41:31] ugh [20:41:48] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching P{P:cassandra%rack = "rack3"} and A:aqs and A:eqiad: Restart to pickup logging jars — T353550 - eevans@cumin1002 [20:41:52] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [20:42:54] welp: .17 i/p/P/ParsoidOutputAccess:140 ParserOutput does not have a render ID [20:43:30] something here not backwards compatible? [20:45:26] added https://phabricator.wikimedia.org/T356368 as a train blocker to get some attention on it [20:45:38] but, yes, seems something was not very rollbackable [20:46:47] does https://gerrit.wikimedia.org/r/c/mediawiki/core/+/957773 just need to be backported? [20:46:56] !log brennen@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.18 refs T354436 (duration: 08m 05s) [20:47:02] T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436 [20:47:09] included in shows it as only in master + .18 [20:47:32] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "a_c"} and A:aqs and A:codfw: Restart to pickup logging jars — T353550 - eevans@cumin1002 [20:47:36] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [20:54:12] bd808: good question [20:58:09] seems like a giant thing to backport, tantamount to just rolling forward [20:59:16] cscott seems to not be on irc at the moment [20:59:34] probably somewhere o slack I'd suppose [21:00:01] * bd808 is sick of his keyboard dropping keystrokes [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240215T2100). [21:00:05] subbu and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:56] and me o/ [21:00:57] please hold on backport window [21:01:02] currently dealing with a couple of UBNs [21:01:07] brennen: anything I can help with? [21:01:43] so we're blocked on 2 things: T357668, T356368 [21:01:45] T357668: TypeError: Argument 1 passed to MediaWiki\Parser\Sanitizer::encodeAttribute() must be of the type string, null given, called in /srv/mediawiki/php-1.42.0-wmf.18/includes/xml/Xml.php on line 81 - https://phabricator.wikimedia.org/T357668 [21:01:46] T356368: Revision endpoint: InvalidArgumentException: ParserOutput does not have a render ID - https://phabricator.wikimedia.org/T356368 [21:01:48] o/ [21:02:03] the second one has been blowing up since i rolled the train back to group1 for the first one [21:02:16] 10SRE, 10ops-eqiad, 10DC-Ops: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571 (10VRiley-WMF) [21:02:37] fun times. :/ [21:02:52] let me try to find scott [21:02:55] my current inclination is to roll forward again, hoping that fixes T356368, and then do that revert. [21:03:18] subbu: thanks. I was about to ask if you would :) [21:03:24] appreciated [21:07:47] i'm here sorry i'm late! [21:08:07] cscott, not related to the backport .. but related to whether your render id patch is safe to backport to wmf.17 [21:08:41] or if it'd be better to roll forward and do this revert for the other current blocker: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ImageMap/+/1003828 [21:08:50] but brennen i think rolling forward the train and reverting the imagemap patch seems like a reasonable plan ... cscott for context: wmf18 -> wmf17 rollback from group2 caused T356368 [21:08:52] T356368: Revision endpoint: InvalidArgumentException: ParserOutput does not have a render ID - https://phabricator.wikimedia.org/T356368 [21:09:07] but, let us see if scott has any input here. [21:09:51] yeah, i think reverting imagemap is safer [21:10:08] the render id patch is just really big, makes me a lot more nervous about backporting [21:10:37] the imagemap revert seems small and compact by comparison [21:10:37] yeah, that makes sense. i will go ahead and roll forward to clear the large volume of current errors and then deal with the revert. [21:11:18] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003878 (https://phabricator.wikimedia.org/T354436) [21:11:20] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003878 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot) [21:12:05] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003878 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot) [21:16:47] oh, i'm pretty sure why rollback caused complaints, and its my fault, sorry. :( [21:17:15] I can also write a quick "forward compat" patch for wmf.17 that should fix the rollback issues if that would be helpful. [21:17:46] it would probably be a good idea to have that, if it's not too much effort. [21:17:50] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/957773/comment/fe36aad4_a9a08078/ [21:17:58] (03CR) 10CDanis: [C: 03+1] cas: lower log level WARN [puppet] - 10https://gerrit.wikimedia.org/r/1003867 (owner: 10JHathaway) [21:18:00] i'll get the patches ready. [21:18:25] sorry about that, forward-compat on ParserOutput is really hard, it keeps biting us. [21:18:26] thanks cscott. hopefully won't be needed but it's good to know that we *can* roll back. [21:19:29] (03PS1) 10Brennen Bearnes: Revert "Set target to $wgExternalLinkTarget" [extensions/ImageMap] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003830 [21:19:30] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:20:45] the null encodeAttribute thing seems trivial to fix, just make sure null attributes are filtered out [21:20:46] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching P{P:cassandra%rack = "a_c"} and A:aqs and A:codfw: Restart to pickup logging jars — T353550 - eevans@cumin1002 [21:20:56] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.18 refs T354436 [21:20:58] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [21:21:05] T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436 [21:23:50] tgr: should i wait for that fix rather than deploying revert? [21:25:34] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ImageMap/+/1003831 cscott wanna review it? [21:25:50] I don't have ImageMap set up locally, but the change is minimal [21:26:02] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "b_e"} and A:aqs and A:codfw: Restart to pickup logging jars — T353550 - eevans@cumin1002 [21:26:07] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [21:27:10] probably should be fixed in Parser::getExternalLinkAttribs() instead of in ImageMap, but that's not something to attempt on a running train [21:27:33] tgr looks reasonable. [21:27:53] arlo has imagemap installed .. so pinged him there. [21:28:10] but, i can be bold and +2 since it feels like the right thing. [21:28:15] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [21:28:17] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1006.eqiad.wmnet with OS bullseye [21:30:32] yeah that patch seems right.  I agree that the Real Fix (tm) is to use unset() in externalLinkAttrs but this is a safer minimal fix. [21:31:17] ya, i already +2ed it. [21:31:42] awright, cool. i'll backport. [21:32:18] (03PS1) 10Brennen Bearnes: Filter out null external link attributes [extensions/ImageMap] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003832 (https://phabricator.wikimedia.org/T357668) [21:33:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/ImageMap] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003832 (https://phabricator.wikimedia.org/T357668) (owner: 10Brennen Bearnes) [21:34:43] (03CR) 10Brennen Bearnes: [C: 04-2] "Doing https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ImageMap/+/1003832 instead." [extensions/ImageMap] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003830 (owner: 10Brennen Bearnes) [21:35:03] (03Abandoned) 10Brennen Bearnes: Revert "Set target to $wgExternalLinkTarget" [extensions/ImageMap] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003830 (owner: 10Brennen Bearnes) [21:35:18] (03CR) 10Bking: [C: 03+2] cloudelastic: Complete cloudelastic1006's migration [puppet] - 10https://gerrit.wikimedia.org/r/1003558 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:35:32] (03PS2) 10Bking: cloudelastic: Complete cloudelastic1006's migration [puppet] - 10https://gerrit.wikimedia.org/r/1003558 (https://phabricator.wikimedia.org/T355617) [21:36:09] (03CR) 10Bking: [V: 03+2 C: 03+2] cloudelastic: Complete cloudelastic1006's migration [puppet] - 10https://gerrit.wikimedia.org/r/1003558 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:42:00] (03PS1) 10C. Scott Ananian: [ParserOutput] allow rollback of render id [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1003833 (https://phabricator.wikimedia.org/T356368) [21:42:31] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1003833 against wmf.17 *should* allow rollback to wmf.17 [21:46:32] * subbu looks [21:46:39] (03PS2) 10C. Scott Ananian: [ParserOutput] allow rollback of render id [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1003833 (https://phabricator.wikimedia.org/T356368) [21:46:59] 10SRE, 10ops-eqiad, 10DC-Ops: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571 (10VRiley-WMF) an-redacteddb1001 Rack D2 U25 Port 28 CableID 5365 [21:49:22] 10SRE, 10ops-eqiad, 10DC-Ops: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571 (10VRiley-WMF) [21:51:28] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic [21:51:35] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic [21:52:25] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1005* for IP migration - bking@cumin2002 - T355617 [21:52:29] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1005* for IP migration - bking@cumin2002 - T355617 [21:52:30] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [21:53:03] (03Merged) 10jenkins-bot: Filter out null external link attributes [extensions/ImageMap] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003832 (https://phabricator.wikimedia.org/T357668) (owner: 10Brennen Bearnes) [21:53:19] !log brennen@deploy2002 Started scap: Backport for [[gerrit:1003832|Filter out null external link attributes (T357668)]] [21:53:24] T357668: TypeError: Argument 1 passed to MediaWiki\Parser\Sanitizer::encodeAttribute() must be of the type string, null given, called in /srv/mediawiki/php-1.42.0-wmf.18/includes/xml/Xml.php on line 81 - https://phabricator.wikimedia.org/T357668 [21:53:48] i don't suppose this can be tested particularly easily. [21:54:04] (03PS2) 10Dzahn: add alert for planet content updates (last modified) [alerts] - 10https://gerrit.wikimedia.org/r/1003507 (https://phabricator.wikimedia.org/T353298) [21:54:37] (03CR) 10CI reject: [V: 04-1] add alert for planet content updates (last modified) [alerts] - 10https://gerrit.wikimedia.org/r/1003507 (https://phabricator.wikimedia.org/T353298) (owner: 10Dzahn) [21:54:43] !log brennen@deploy2002 brennen: Backport for [[gerrit:1003832|Filter out null external link attributes (T357668)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:55:54] (03CR) 10Subramanya Sastry: [ParserOutput] allow rollback of render id (031 comment) [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1003833 (https://phabricator.wikimedia.org/T356368) (owner: 10C. Scott Ananian) [21:55:59] brennen the "deploy it and see if the site falls over" test strategy [21:56:16] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade - ryankemper@cumin2002 - T356651 [21:56:19] cscott, i left you a review comment on the patch. [21:56:20] T356651: Rebuild and deploy textify plugin - https://phabricator.wikimedia.org/T356651 [21:56:43] minor thing .. missing check for missing revid that you have right now on master. [21:57:08] cscott: yup, here goes nothin' [21:57:10] which should be an exceptional case, but famous last words .. so worth checking and logging it. [21:57:14] !log brennen@deploy2002 brennen: Continuing with sync [21:57:27] (03PS3) 10Dzahn: add alert for planet content updates (last modified) [alerts] - 10https://gerrit.wikimedia.org/r/1003507 (https://phabricator.wikimedia.org/T353298) [21:57:29] (03CR) 10C. Scott Ananian: [ParserOutput] allow rollback of render id (031 comment) [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1003833 (https://phabricator.wikimedia.org/T356368) (owner: 10C. Scott Ananian) [21:58:13] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:59:06] (03CR) 10Subramanya Sastry: [ParserOutput] allow rollback of render id (032 comments) [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1003833 (https://phabricator.wikimedia.org/T356368) (owner: 10C. Scott Ananian) [21:59:44] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad plugin upgrade - ryankemper@cumin2002 - T356651 [21:59:45] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching P{P:cassandra%rack = "b_e"} and A:aqs and A:codfw: Restart to pickup logging jars — T353550 - eevans@cumin1002 [21:59:53] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [21:59:56] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "c_f"} and A:aqs and A:codfw: Restart to pickup logging jars — T353550 - eevans@cumin1002 [22:00:10] !log vriley@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [22:00:11] (03CR) 10C. Scott Ananian: [ParserOutput] allow rollback of render id (031 comment) [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1003833 (https://phabricator.wikimedia.org/T356368) (owner: 10C. Scott Ananian) [22:00:20] (03PS1) 10Ahmon Dancy: logstash_checker.py: Add ability to check all MediaWiki canaries at once [puppet] - 10https://gerrit.wikimedia.org/r/1003885 (https://phabricator.wikimedia.org/T357402) [22:00:55] (03PS2) 10Bking: cloudelastic: Begin private IP migration for cloudelastic1005 [puppet] - 10https://gerrit.wikimedia.org/r/1003561 (https://phabricator.wikimedia.org/T355617) [22:01:40] (03CR) 10CI reject: [V: 04-1] logstash_checker.py: Add ability to check all MediaWiki canaries at once [puppet] - 10https://gerrit.wikimedia.org/r/1003885 (https://phabricator.wikimedia.org/T357402) (owner: 10Ahmon Dancy) [22:02:14] (03PS5) 10Dzahn: site: add etherpad role to etherpad1004 [puppet] - 10https://gerrit.wikimedia.org/r/999973 (https://phabricator.wikimedia.org/T316421) [22:02:24] (03CR) 10Bking: [V: 03+2 C: 03+2] cloudelastic: Begin private IP migration for cloudelastic1005 [puppet] - 10https://gerrit.wikimedia.org/r/1003561 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:03:25] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cloudelastic1006\.eqiad\.wmnet [22:03:26] (03CR) 10Subramanya Sastry: [ParserOutput] allow rollback of render id (031 comment) [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1003833 (https://phabricator.wikimedia.org/T356368) (owner: 10C. Scott Ananian) [22:03:55] (03CR) 10C. Scott Ananian: [ParserOutput] allow rollback of render id (031 comment) [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1003833 (https://phabricator.wikimedia.org/T356368) (owner: 10C. Scott Ananian) [22:04:15] (03PS5) 10Dzahn: site: apply etherpad role on both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/1003073 (https://phabricator.wikimedia.org/T316421) [22:04:36] (03CR) 10C. Scott Ananian: [ParserOutput] allow rollback of render id (031 comment) [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1003833 (https://phabricator.wikimedia.org/T356368) (owner: 10C. Scott Ananian) [22:05:00] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:1003832|Filter out null external link attributes (T357668)]] (duration: 11m 40s) [22:05:07] T357668: TypeError: Argument 1 passed to MediaWiki\Parser\Sanitizer::encodeAttribute() must be of the type string, null given, called in /srv/mediawiki/php-1.42.0-wmf.18/includes/xml/Xml.php on line 81 - https://phabricator.wikimedia.org/T357668 [22:05:31] is it safe to assume the backport window is cancelled today? [22:05:54] (03CR) 10Dzahn: [C: 03+2] nagios_common/planet: remove check_lastmod check, script and config [puppet] - 10https://gerrit.wikimedia.org/r/1003084 (https://phabricator.wikimedia.org/T353298) (owner: 10Dzahn) [22:05:58] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudelastic1005.wikimedia.org [22:06:30] (03PS2) 10Ahmon Dancy: logstash_checker.py: Add ability to check all MediaWiki canaries at once [puppet] - 10https://gerrit.wikimedia.org/r/1003885 (https://phabricator.wikimedia.org/T357402) [22:08:34] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:08:47] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-redacteddb1001.mgmt.eqiad.wmnet with reboot policy FORCED [22:09:19] Jdlrobson: fwiw, train blockers seem resolved to me. i don't personally object to some stuff going out in the next hour, but i think i will be carefully backing away from the deploy2002 commandline now myself. :) [22:09:27] (03CR) 10Dzahn: [C: 03+2] "deployed on alert* hosts, no issues seen:" [puppet] - 10https://gerrit.wikimedia.org/r/1003084 (https://phabricator.wikimedia.org/T353298) (owner: 10Dzahn) [22:10:19] grafana needs a chart for collective relenger blood pressure. [22:10:44] (03CR) 10Subramanya Sastry: [ParserOutput] allow rollback of render id (032 comments) [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1003833 (https://phabricator.wikimedia.org/T356368) (owner: 10C. Scott Ananian) [22:11:10] brennen: that's a KPI that we probably would be sad about if we tracked it :/ [22:11:45] yay reg. resolved train blockers. [22:12:02] !log bking@cumin2002 START - Cookbook sre.dns.netbox [22:13:02] brennen: ok. The only urgent one I had was the table regression in mobile we discussed earlier, but it could wait until monday (or even tomorrow) [22:14:09] Jdlrobson: i don't mind doing that one just to get it cleaned up [22:14:24] anybody else have anything urgent? [22:14:51] (03PS1) 10Gergő Tisza: logstash: Use normalized_message for checksums [puppet] - 10https://gerrit.wikimedia.org/r/1003890 [22:14:59] there is nothing urgent from us ... it is a config change to turn on DT with Parsoid on wikitech ... but cscott i suppose we could wait one more day? [22:16:16] so, you can go ahead with jon's. [22:16:53] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1005.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [22:16:58] the config change will be the very first parsoid html based read views rollout :) but we can do it tomorrow and not bury the lede late thu evening. :) [22:18:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003485 (https://phabricator.wikimedia.org/T357589) (owner: 10Jdlrobson) [22:18:20] subbu: cool, thanks [22:19:00] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1005.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [22:19:01] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:19:02] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudelastic1005.wikimedia.org [22:19:04] (03CR) 10CI reject: [V: 04-1] logstash: Use normalized_message for checksums [puppet] - 10https://gerrit.wikimedia.org/r/1003890 (owner: 10Gergő Tisza) [22:19:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:19:24] (03PS3) 10C. Scott Ananian: [ParserOutput] allow rollback of render id [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1003833 (https://phabricator.wikimedia.org/T356368) [22:19:50] brennen: ack let me know when you want me to look at it [22:20:33] i think there's a limit of one t-shirt per day, so let's save the parsoid read views rollout for another day :) [22:20:51] (03CR) 10Subramanya Sastry: [C: 03+1] "Good to go if this needed." [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1003833 (https://phabricator.wikimedia.org/T356368) (owner: 10C. Scott Ananian) [22:21:09] okay :) [22:21:44] haha [22:22:26] officially no deploys tomorrow ... but thcipriani brennen do you prefer we do this tuesday (or monday)? [22:22:49] monday's a US holiday, i think? [22:22:54] brennen: I!bash'ed your blood pressure quip. And I found one from back in the day that's related: https://bash.toolforge.org/quip/AU7VVvJI6snAnmqnK_zy [22:23:26] wikitech config change should be relatively safe either day, no? [22:23:34] (SystemdUnitFailed) firing: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:23:35] i'm fine with wikitech tuesday, officewiki wednesday (after group1 rollout) and conquering the world on thursday [22:23:37] bd808, ya .. it is safe [22:24:05] we can party (almost) every day next week [22:24:07] given everyone's blood pressure, i didn't push for it today .. but low traffic, low impact. [22:24:08] no objections from me either day, though i expect monday to be more of a ghost town and i'm personally out. [22:24:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:24:31] ^ this one is cirrussearch noise i think. [22:24:45] cscott, okay .. we can touch base tomorrow. [22:27:25] !log bking@cumin2002 START - Cookbook sre.dns.netbox [22:28:24] (03PS2) 10Gergő Tisza: logstash: Use normalized_message for checksums [puppet] - 10https://gerrit.wikimedia.org/r/1003890 [22:29:47] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1005 to private IPs - bking@cumin2002" [22:30:37] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1005 to private IPs - bking@cumin2002" [22:30:38] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:33:23] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1005 [22:33:32] my own blood pressure would be much lower if we did the big group2 deployment earlier in the week, *or* if we did notmal backport deployments on firdays too [22:33:41] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1003892 is a "better" fix for T357668 btw. [22:33:41] T357668: TypeError: Argument 1 passed to MediaWiki\Parser\Sanitizer::encodeAttribute() must be of the type string, null given, called in /srv/mediawiki/php-1.42.0-wmf.18/includes/xml/Xml.php on line 81 - https://phabricator.wikimedia.org/T357668 [22:33:50] for master [22:34:13] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching P{P:cassandra%rack = "c_f"} and A:aqs and A:codfw: Restart to pickup logging jars — T353550 - eevans@cumin1002 [22:34:20] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [22:34:45] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1005 [22:36:30] brennen: i have a fix for the cirrus noise that was going to go out in the backport today but didn't make it, i have to run for a little bit but i can backport that in 45min or so [22:36:52] ebernhardson: ack, ty [22:38:34] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-redacteddb1001.mgmt.eqiad.wmnet with reboot policy FORCED [22:40:04] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1005.eqiad.wmnet with OS bullseye [22:40:21] !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-redacteddb1001'] [22:40:34] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-redacteddb1001'] [22:43:17] 10SRE, 10ops-eqiad, 10DC-Ops: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571 (10VRiley-WMF) [22:46:41] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics-privatedata-users for jwheeler - https://phabricator.wikimedia.org/T357731 (10JWheeler-WMF) [22:47:18] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:sessionstore: Restart to pickup logging jars — T353550 - eevans@cumin1002 [22:47:23] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [22:48:34] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:58:23] brennen: still in CI? [22:58:44] Jdlrobson: ...yeah. i wonder if this is hanging... [22:59:14] 39m elapsed [22:59:19] brennen: FYI we may also have another UBN (https://phabricator.wikimedia.org/T357724) that I may need help deploying tomorrow [23:00:22] fun [23:00:28] brennen: for that it seems like we might need to run a database script to update preferences :( [23:00:44] this last test looks hung up (sorry disappeared into meetings earlier) [23:01:10] npm audit has been running for ... 30 minutes? [23:02:25] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1005.eqiad.wmnet with OS bullseye [23:02:48] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1005.eqiad.wmnet with OS bullseye [23:03:02] that do seem bogus. [23:04:25] ...that host is...unhappy [23:09:17] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1005.eqiad.wmnet with OS bullseye [23:13:51] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1005.eqiad.wmnet with OS bullseye [23:15:44] back, i can ship my backport when deploy is available, but it looks like a backport is still running so i'll wait [23:16:02] ugh, I give up troubleshooting what's going on with this CI job, let's kill it and rerun [23:16:19] npm is hung up somewhere [23:17:09] (03CR) 10CI reject: [V: 04-1] Add border-collapse to wikitable [skins/MinervaNeue] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003485 (https://phabricator.wikimedia.org/T357589) (owner: 10Jdlrobson) [23:17:38] (03CR) 10Thcipriani: [C: 03+2] Add border-collapse to wikitable [skins/MinervaNeue] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003485 (https://phabricator.wikimedia.org/T357589) (owner: 10Jdlrobson) [23:17:56] ebernhardson: what repo is the backport for? [23:18:14] thcipriani: cirrus, it's https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1003827 [23:18:36] !log removing 2 files for legal compliance [23:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:12] ebernhardson: let's get it running through CI, so go ahead and CR+2 and we'll see which patch finishes first [23:19:20] kk [23:19:32] (03CR) 10Ebernhardson: [C: 03+2] "merging for backport" [extensions/CirrusSearch] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003827 (https://phabricator.wikimedia.org/T354793) (owner: 10Ebernhardson) [23:19:42] Jdlrobson: re: font size regression, i'm out tomorrow but a couple of relengers should be about. [23:26:55] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:sessionstore: Restart to pickup logging jars — T353550 - eevans@cumin1002 [23:27:00] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [23:28:27] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1005.eqiad.wmnet with reason: host reimage [23:30:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T352010)', diff saved to https://phabricator.wikimedia.org/P56873 and previous config saved to /var/cache/conftool/dbconfig/20240215-233053-ladsgroup.json [23:31:01] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:31:46] brennen: https://youtu.be/7XTAsLSa6T8 is a quick summary of the font size issue. It's basically a problem with cached HTML. Still not sure how to handle it but it can wait until tomorrow. [23:31:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1005.eqiad.wmnet with reason: host reimage [23:32:46] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:32:48] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:33:03] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:33:11] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:38:03] (03Merged) 10jenkins-bot: Add border-collapse to wikitable [skins/MinervaNeue] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003485 (https://phabricator.wikimedia.org/T357589) (owner: 10Jdlrobson) [23:38:10] yippee [23:38:38] !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:1003485|Add border-collapse to wikitable (T357589)]] [23:38:43] T357589: [Regression] Wikitables has unwanted border spacing on mobile - https://phabricator.wikimedia.org/T357589 [23:40:07] !log thcipriani@deploy2002 thcipriani and jdlrobson: Backport for [[gerrit:1003485|Add border-collapse to wikitable (T357589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:40:33] ^ Jdlrobson the much anticipated css change is on mwdebug, what do you think? Look good? [23:41:14] (03Merged) 10jenkins-bot: Connection: Correct read-only detection [extensions/CirrusSearch] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003827 (https://phabricator.wikimedia.org/T354793) (owner: 10Ebernhardson) [23:41:54] thcipriani: hurrah [23:41:55] looking now [23:42:10] thcipriani: yep [23:42:12] works as advertised! [23:42:13] please sync [23:42:26] * thcipriani does [23:42:33] !log thcipriani@deploy2002 thcipriani and jdlrobson: Continuing with sync [23:46:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P56874 and previous config saved to /var/cache/conftool/dbconfig/20240215-234600-ladsgroup.json [23:46:59] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [23:50:10] !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:1003485|Add border-collapse to wikitable (T357589)]] (duration: 11m 31s) [23:50:15] T357589: [Regression] Wikitables has unwanted border spacing on mobile - https://phabricator.wikimedia.org/T357589 [23:50:16] ^ Jdlrobson live now [23:50:35] ebernhardson: want me to sling yours out? Or are you already poised over the enter key? [23:50:53] thcipriani: sure, you can ship it. I didn't particularly prepare [23:52:25] !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:1003827|Connection: Correct read-only detection (T354793 T356526)]] [23:52:31] T354793: SUP: Adapt saneitizer to allow SUP to operate next to cirrus jobs - https://phabricator.wikimedia.org/T354793 [23:52:31] T356526: High level of backend errors for CirrusSearch jobs in jobrunners - https://phabricator.wikimedia.org/T356526 [23:52:46] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [23:52:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1005.eqiad.wmnet with OS bullseye [23:53:50] !log thcipriani@deploy2002 ebernhardson and thcipriani: Backport for [[gerrit:1003827|Connection: Correct read-only detection (T354793 T356526)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:54:33] ^ ebernhardson on mwdebug machines, any way to test all's well? [23:55:02] (03CR) 10Cwhite: add alert for planet content updates (last modified) (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1003507 (https://phabricator.wikimedia.org/T353298) (owner: 10Dzahn) [23:55:06] thcipriani: not really, it's a jobqueue only thing. can ship it [23:55:23] okie doke, deploying more then [23:55:26] (03PS2) 10Bking: cloudelastic: Complete cloudelastic1005's migration [puppet] - 10https://gerrit.wikimedia.org/r/1003563 (https://phabricator.wikimedia.org/T355617) [23:55:31] !log thcipriani@deploy2002 ebernhardson and thcipriani: Continuing with sync [23:56:16] (03CR) 10Bking: [V: 03+2 C: 03+2] cloudelastic: Complete cloudelastic1005's migration [puppet] - 10https://gerrit.wikimedia.org/r/1003563 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [23:56:30] "continuing with sync" is a better way to say "deploying more"---this is why we have tools :) [23:56:48] :) [23:58:23] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1003890 (owner: 10Gergő Tisza)