[00:24:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T352010)', diff saved to https://phabricator.wikimedia.org/P58269 and previous config saved to /var/cache/conftool/dbconfig/20240301-002417-ladsgroup.json [00:24:27] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:39:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1007672 [00:39:11] 06SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9589485 (10wiki_willy) Thanks @Volans, that makes sense. My preference would be to leave Netbox as is, and use the accounting spreadsheet to make the S/N connection to each other. Would we be addi... [00:39:15] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1007672 (owner: 10TrainBranchBot) [00:39:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P58270 and previous config saved to /var/cache/conftool/dbconfig/20240301-003923-ladsgroup.json [00:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:54:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P58271 and previous config saved to /var/cache/conftool/dbconfig/20240301-005429-ladsgroup.json [01:00:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1007672 (owner: 10TrainBranchBot) [01:09:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T352010)', diff saved to https://phabricator.wikimedia.org/P58273 and previous config saved to /var/cache/conftool/dbconfig/20240301-010936-ladsgroup.json [01:09:38] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [01:09:43] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:09:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [01:35:41] (ProbeDown) firing: (2) Service kubemaster2001:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:40:41] (ProbeDown) resolved: (2) Service kubemaster2001:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:41:45] the kubemasters seem to have self-resolved and the probes are happy now but I'm still seeing timeouts in the kube-apiserver logs -- probably just from ops that started during the unhealthy period and are now timing out, but still poking around a bit [02:38:02] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:02] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:48:34] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:40:45] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:45:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:36:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:36:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:36:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T354015)', diff saved to https://phabricator.wikimedia.org/P58274 and previous config saved to /var/cache/conftool/dbconfig/20240301-063633-marostegui.json [06:36:37] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [06:36:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P58275 and previous config saved to /var/cache/conftool/dbconfig/20240301-063647-root.json [06:39:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1118.eqiad.wmnet [06:43:04] (ProbeDown) firing: Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:44:36] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [06:47:35] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1118.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:48:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1118.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:48:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:48:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1118.eqiad.wmnet [06:50:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:51:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P58276 and previous config saved to /var/cache/conftool/dbconfig/20240301-065152-root.json [06:53:04] (ProbeDown) resolved: Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240301T0700) [07:06:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P58277 and previous config saved to /var/cache/conftool/dbconfig/20240301-070657-root.json [07:22:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P58278 and previous config saved to /var/cache/conftool/dbconfig/20240301-072202-root.json [07:23:39] !log eoghan@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on vrts1002.eqiad.wmnet with reason: Not in production, silencing alarms until we decide whether to decom or not [07:23:52] !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on vrts1002.eqiad.wmnet with reason: Not in production, silencing alarms until we decide whether to decom or not [07:37:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P58279 and previous config saved to /var/cache/conftool/dbconfig/20240301-073707-root.json [07:52:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P58280 and previous config saved to /var/cache/conftool/dbconfig/20240301-075212-root.json [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240301T0800) [08:15:51] 06SRE, 10FY2023/2024-Q3: prometheus-icinga-am.service Fails to Start on alert2001 - https://phabricator.wikimedia.org/T358838 (10andrea.denisse) [08:16:09] 06SRE, 10FY2023/2024-Q3: prometheus-icinga-am.service Fails to Start on alert2001 - https://phabricator.wikimedia.org/T358838#9589822 (10andrea.denisse) a:03andrea.denisse [08:18:28] (03PS2) 10Majavah: P:openstack: rabbitmq: remove cloudcontrol term [puppet] - 10https://gerrit.wikimedia.org/r/1007293 [08:21:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:22:26] (03CR) 10Majavah: [C: 03+2] P:openstack: rabbitmq: remove cloudcontrol term [puppet] - 10https://gerrit.wikimedia.org/r/1007293 (owner: 10Majavah) [08:30:19] (03CR) 10JMeybohm: [C: 04-1] admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [08:32:35] (03PS2) 10Majavah: P:openstack: rabbitmq: remove cloud-hosts term [puppet] - 10https://gerrit.wikimedia.org/r/1007294 [08:32:37] (03PS2) 10Majavah: P:openstack: rabbitmq: remove cinder-backups term [puppet] - 10https://gerrit.wikimedia.org/r/1007295 [08:32:39] (03PS5) 10Majavah: P:openstack: rabbitmq: use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/998419 [08:37:56] (03CR) 10Majavah: [C: 03+2] P:openstack: rabbitmq: remove cloud-hosts term (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007294 (owner: 10Majavah) [08:40:50] (03CR) 10Brouberol: [C: 03+2] superset: fix, add missing gunicorn statsd export config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007650 (https://phabricator.wikimedia.org/T358778) (owner: 10Brouberol) [08:42:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [08:42:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [08:42:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [08:43:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [08:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:35] (03PS3) 10Majavah: P:openstack: rabbitmq: remove cinder-backups term [puppet] - 10https://gerrit.wikimedia.org/r/1007295 (https://phabricator.wikimedia.org/T344065) [08:52:37] (03PS6) 10Majavah: P:openstack: rabbitmq: use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/998419 [08:52:39] (03PS1) 10Majavah: Remove unused eqiad1 cinder backup role [puppet] - 10https://gerrit.wikimedia.org/r/1007853 (https://phabricator.wikimedia.org/T344065) [08:53:42] (03PS1) 10Brouberol: superset: rollout the cache user isolation feature flags everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007854 (https://phabricator.wikimedia.org/T273850) [08:55:05] (03CR) 10Majavah: P:openstack: rabbitmq: remove cinder-backups term (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007295 (https://phabricator.wikimedia.org/T344065) (owner: 10Majavah) [09:01:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:27:28] (03PS1) 10Brouberol: Remove an-tool1005 and associated hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1007857 (https://phabricator.wikimedia.org/T358706) [09:37:03] (03PS2) 10Brouberol: Remove an-tool1005 and associated hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1007857 (https://phabricator.wikimedia.org/T358706) [09:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [10:11:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [10:14:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "The use in codfw1dev may be just a leftover." [puppet] - 10https://gerrit.wikimedia.org/r/1007853 (https://phabricator.wikimedia.org/T344065) (owner: 10Majavah) [10:14:52] (03CR) 10Majavah: [C: 03+2] Remove unused eqiad1 cinder backup role [puppet] - 10https://gerrit.wikimedia.org/r/1007853 (https://phabricator.wikimedia.org/T344065) (owner: 10Majavah) [10:14:58] (03CR) 10Arturo Borrero Gonzalez: P:openstack: rabbitmq: remove cinder-backups term (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007295 (https://phabricator.wikimedia.org/T344065) (owner: 10Majavah) [10:16:09] (03PS7) 10Majavah: P:openstack: rabbitmq: use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/998419 [10:16:35] (03CR) 10Majavah: P:openstack: rabbitmq: use firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998419 (owner: 10Majavah) [10:25:32] (03CR) 10Majavah: P:openstack: rabbitmq: use firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998419 (owner: 10Majavah) [10:34:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2117.codfw.wmnet with reason: Silence for maintenance [10:34:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2117.codfw.wmnet with reason: Silence for maintenance [10:40:54] (03PS1) 10Ladsgroup: db2117: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/1007863 [10:41:41] (03Abandoned) 10Majavah: puppetserver: '/srv/puppet_code/environments' owned by puppet/puppet [puppet] - 10https://gerrit.wikimedia.org/r/975089 (https://phabricator.wikimedia.org/T351468) (owner: 10Andrew Bogott) [10:42:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:openstack: rabbitmq: remove cinder-backups term [puppet] - 10https://gerrit.wikimedia.org/r/1007295 (https://phabricator.wikimedia.org/T344065) (owner: 10Majavah) [10:42:59] (03CR) 10Ladsgroup: [C: 03+2] db2117: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/1007863 (owner: 10Ladsgroup) [10:46:28] (03CR) 10Arturo Borrero Gonzalez: P:openstack: rabbitmq: use firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998419 (owner: 10Majavah) [10:54:40] (03PS4) 10Majavah: P:openstack: rabbitmq: remove cinder-backups term [puppet] - 10https://gerrit.wikimedia.org/r/1007295 (https://phabricator.wikimedia.org/T344065) [10:54:42] (03PS8) 10Majavah: P:openstack: rabbitmq: use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/998419 [10:54:44] (03PS1) 10Majavah: P:openstack: rabbitmq: restrict clustering ports [puppet] - 10https://gerrit.wikimedia.org/r/1007864 [11:03:41] (03CR) 10Jbond: [C: 03+1] Change build image user from root to nobody [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1004192 (owner: 10Hashar) [11:04:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1005559 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [11:06:40] (03CR) 10Jbond: "To add a bit of context here. puppet-merge writes the sha1 to some web host on the puppetmaster, config-master has a proxy rule to fetch" [puppet] - 10https://gerrit.wikimedia.org/r/1007363 (https://phabricator.wikimedia.org/T341717) (owner: 10Muehlenhoff) [11:11:17] (03CR) 10Jbond: P:puppetserver: git: use creates for initial deploy-code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007396 (owner: 10Majavah) [11:14:10] (03CR) 10Jbond: "lgtm nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [11:15:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance [11:16:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance [11:16:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T352010)', diff saved to https://phabricator.wikimedia.org/P58281 and previous config saved to /var/cache/conftool/dbconfig/20240301-111610-ladsgroup.json [11:16:14] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:17:19] (03CR) 10Jbond: [C: 03+1] "lgtm comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [11:18:14] (03CR) 10Jbond: Allow kerberos::systemd::timer to use a custom email sender (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007578 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [11:20:56] (03PS1) 10Majavah: P:wmcs: ntp: generate ACLs via network class [puppet] - 10https://gerrit.wikimedia.org/r/1007886 [11:25:02] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1557/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007886 (owner: 10Majavah) [11:32:56] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@241457d]: (no justification provided) [11:33:25] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@241457d]: (no justification provided) (duration: 00m 28s) [11:33:56] (03PS1) 10Clément Goubert: Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1007888 (https://phabricator.wikimedia.org/T351074) [11:35:18] (03PS1) 10Majavah: P:dumps::distribution::nfs: use networks class for WMCS network ranges [puppet] - 10https://gerrit.wikimedia.org/r/1007889 [11:36:53] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1558/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007889 (owner: 10Majavah) [11:44:38] (03PS1) 10Clément Goubert: Move 6 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1007892 (https://phabricator.wikimedia.org/T351074) [11:45:20] (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: Increase replicas to 240 total [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007584 (https://phabricator.wikimedia.org/T356497) (owner: 10Clément Goubert) [11:46:37] (03Merged) 10jenkins-bot: mw-api-int: Increase replicas to 240 total [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007584 (https://phabricator.wikimedia.org/T356497) (owner: 10Clément Goubert) [11:47:03] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1173.eqiad.wmnet [11:48:17] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:48:48] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:54:49] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1173.eqiad.wmnet [11:55:23] (03PS1) 10Clément Goubert: mw-api-int: Hold eqiad back on resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007893 (https://phabricator.wikimedia.org/T356497) [11:55:40] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:56:19] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:57:01] (03CR) 10Clément Goubert: [C: 03+2] "Self-merging because of possible blocked deployments" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007893 (https://phabricator.wikimedia.org/T356497) (owner: 10Clément Goubert) [11:57:54] (03Merged) 10jenkins-bot: mw-api-int: Hold eqiad back on resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007893 (https://phabricator.wikimedia.org/T356497) (owner: 10Clément Goubert) [11:58:15] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:58:33] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [12:32:06] (03PS7) 10Elukey: kserve: upgrade all images to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 (https://phabricator.wikimedia.org/T337213) [12:34:11] (03CR) 10Elukey: "Should be ready to go now!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [12:43:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T354015)', diff saved to https://phabricator.wikimedia.org/P58282 and previous config saved to /var/cache/conftool/dbconfig/20240301-124306-marostegui.json [12:43:10] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [12:43:25] (03CR) 10Klausman: [C: 03+1] "Just a few nits, feel free to ignore them." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [12:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P58283 and previous config saved to /var/cache/conftool/dbconfig/20240301-125812-marostegui.json [13:02:07] !log Depooling mw1387.eqiad.wmnet,mw1389.eqiad.wmnet,mw1391.eqiad.wmnet,mw1393.eqiad.wmnet,mw1395.eqiad.wmnet,mw1397.eqiad.wmnet for reimage to k8s nodes - T351074 [13:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:23] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [13:03:24] !log refreshing image metadata of commons Алтарна_частина.jpg [13:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:50] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1387.eqiad.wmnet with OS bullseye [13:12:07] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1389.eqiad.wmnet with OS bullseye [13:12:26] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1391.eqiad.wmnet with OS bullseye [13:12:43] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1393.eqiad.wmnet with OS bullseye [13:13:02] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1395.eqiad.wmnet with OS bullseye [13:13:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P58284 and previous config saved to /var/cache/conftool/dbconfig/20240301-131318-marostegui.json [13:13:22] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1397.eqiad.wmnet with OS bullseye [13:25:34] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1387.eqiad.wmnet with reason: host reimage [13:25:38] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1389.eqiad.wmnet with reason: host reimage [13:26:05] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1391.eqiad.wmnet with reason: host reimage [13:26:17] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1393.eqiad.wmnet with reason: host reimage [13:26:26] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1395.eqiad.wmnet with reason: host reimage [13:26:45] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1397.eqiad.wmnet with reason: host reimage [13:28:09] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1387.eqiad.wmnet with reason: host reimage [13:28:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T354015)', diff saved to https://phabricator.wikimedia.org/P58285 and previous config saved to /var/cache/conftool/dbconfig/20240301-132824-marostegui.json [13:28:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [13:28:34] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [13:28:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [13:30:44] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1395.eqiad.wmnet with reason: host reimage [13:33:12] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1391.eqiad.wmnet with reason: host reimage [13:35:36] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1389.eqiad.wmnet with reason: host reimage [13:36:20] more hosts being k8sized? 👀 [13:36:57] ah, T351074 is the task I hadn’t seen before :) [13:36:58] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [13:37:52] Lucas_WMDE: Yep [13:37:58] nice \o/ [13:38:05] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1397.eqiad.wmnet with reason: host reimage [13:38:08] Preparing for mobileapps to move to core calls [13:40:21] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:40:38] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:41:34] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1393.eqiad.wmnet with reason: host reimage [13:46:37] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1387.eqiad.wmnet with OS bullseye [13:47:45] (03PS2) 10Majavah: Add new role for OVS cloudnet [puppet] - 10https://gerrit.wikimedia.org/r/1007900 (https://phabricator.wikimedia.org/T358761) [13:47:47] (03PS2) 10Majavah: Add some new networks for WMCS OVS testing [puppet] - 10https://gerrit.wikimedia.org/r/1007901 (https://phabricator.wikimedia.org/T358761) [13:48:25] (03PS1) 10Elukey: kserve: add missing comma to kserve yaml config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007903 (https://phabricator.wikimedia.org/T337213) [13:48:41] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1395.eqiad.wmnet with OS bullseye [13:51:21] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1391.eqiad.wmnet with OS bullseye [13:51:42] (03CR) 10Elukey: [C: 03+2] kserve: add missing comma to kserve yaml config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007903 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [13:53:30] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1389.eqiad.wmnet with OS bullseye [13:54:40] (03CR) 10Majavah: "Feedback on these names is welcome :-) I'll add these to Netbox too once this patch is merged." [puppet] - 10https://gerrit.wikimedia.org/r/1007901 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [13:56:11] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1397.eqiad.wmnet with OS bullseye [13:57:26] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:57:39] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:59:54] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1393.eqiad.wmnet with OS bullseye [14:00:31] !log Running homer 'cr*eqiad*' commit 'T351074' [14:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:37] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [14:01:11] (03PS1) 10EoghanGaffney: [etherpad] Set userName and userColor padOptions to null [puppet] - 10https://gerrit.wikimedia.org/r/1007905 (https://phabricator.wikimedia.org/T316421) [14:01:58] (03CR) 10Elukey: [C: 03+1] ML isvcs: drop our own memory-usage alerts [alerts] - 10https://gerrit.wikimedia.org/r/1007564 (https://phabricator.wikimedia.org/T358742) (owner: 10Klausman) [14:02:12] (03PS1) 10Slyngshede: Implement feedback for signup page, from design team. [software/bitu] - 10https://gerrit.wikimedia.org/r/1007906 (https://phabricator.wikimedia.org/T355205) [14:05:16] (03PS1) 10Elukey: kserve: use numeric id for nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007907 (https://phabricator.wikimedia.org/T337213) [14:06:57] (03PS1) 10Filippo Giunchedi: data-engineering: fix spark alerts deployment [alerts] - 10https://gerrit.wikimedia.org/r/1007908 (https://phabricator.wikimedia.org/T354255) [14:08:06] !log cgoubert@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1387.eqiad.wmnet|mw1389.eqiad.wmnet|mw1391.eqiad.wmnet|mw1393.eqiad.wmnet|mw1395.eqiad.wmnet|mw1397.eqiad.wmnet),cluster=kubernetes,service=kubesvc [14:08:40] !log Pooled and uncordoned mw1387.eqiad.wmnet mw1389.eqiad.wmnet mw1391.eqiad.wmnet mw1393.eqiad.wmnet mw1395.eqiad.wmnet mw1397.eqiad.wmnet - T351074 [14:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:43] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [14:11:15] (03PS1) 10Clément Goubert: Revert "mw-api-int: Hold eqiad back on resources" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007878 [14:14:00] (03PS2) 10Slyngshede: Implement feedback for signup page, from design team. [software/bitu] - 10https://gerrit.wikimedia.org/r/1007906 (https://phabricator.wikimedia.org/T355205) [14:21:28] (03PS2) 10Filippo Giunchedi: data-engineering: fix spark alerts deployment [alerts] - 10https://gerrit.wikimedia.org/r/1007908 (https://phabricator.wikimedia.org/T354255) [14:21:30] (03PS1) 10Filippo Giunchedi: data-platform: fix superset available alerts [alerts] - 10https://gerrit.wikimedia.org/r/1007911 (https://phabricator.wikimedia.org/T354255) [14:23:44] (03CR) 10Klausman: [C: 03+2] ML isvcs: drop our own memory-usage alerts [alerts] - 10https://gerrit.wikimedia.org/r/1007564 (https://phabricator.wikimedia.org/T358742) (owner: 10Klausman) [14:23:59] (03CR) 10Klausman: [C: 03+1] kserve: use numeric id for nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007907 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [14:24:53] (03Merged) 10jenkins-bot: ML isvcs: drop our own memory-usage alerts [alerts] - 10https://gerrit.wikimedia.org/r/1007564 (https://phabricator.wikimedia.org/T358742) (owner: 10Klausman) [14:25:18] (03PS1) 10Clément Goubert: calico: Bump wikikube kube-controllers memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007912 [14:25:22] (03CR) 10Clément Goubert: [C: 03+2] Revert "mw-api-int: Hold eqiad back on resources" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007878 (owner: 10Clément Goubert) [14:26:21] (03Merged) 10jenkins-bot: Revert "mw-api-int: Hold eqiad back on resources" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007878 (owner: 10Clément Goubert) [14:27:49] (03CR) 10Elukey: [V: 03+2 C: 03+2] kserve: use numeric id for nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007907 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [14:28:08] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:28:18] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@4421d2c] (releasing): (no justification provided) [14:28:39] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:28:56] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@4421d2c] (releasing): (no justification provided) (duration: 00m 38s) [14:30:48] (03PS1) 10Elukey: kserve: bump docker image version for the storage-initializer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007915 (https://phabricator.wikimedia.org/T337213) [14:32:42] (03CR) 10Ssingh: [C: 03+1] Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/1007703 (https://phabricator.wikimedia.org/T352920) (owner: 10Cathal Mooney) [14:33:38] (03CR) 10Ssingh: [V: 03+1] "Yeah, thanks, that adds up! I think at some point we added pybal to main and then we removed the component above. I will fix that and reve" [puppet] - 10https://gerrit.wikimedia.org/r/1007704 (owner: 10Ssingh) [14:33:47] (03Abandoned) 10Ssingh: pybal: install python-twisted from component/pybal [puppet] - 10https://gerrit.wikimedia.org/r/1007704 (owner: 10Ssingh) [14:34:03] (03PS1) 10Ssingh: Revert "pybal: do not install from component" [puppet] - 10https://gerrit.wikimedia.org/r/1007879 [14:38:03] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:07] (03PS1) 10Majavah: openstack: neutron: manage ml2 plugins directory [puppet] - 10https://gerrit.wikimedia.org/r/1007917 (https://phabricator.wikimedia.org/T326373) [14:42:57] (03CR) 10MVernon: [C: 03+1] "Thanks for your work on this! This seems like a reasonable place to start from, we can always tweak in the light of experience :)" [alerts] - 10https://gerrit.wikimedia.org/r/1004619 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:43:21] (03CR) 10DCausse: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1007905 (https://phabricator.wikimedia.org/T316421) (owner: 10EoghanGaffney) [14:43:52] (03PS1) 10Ssingh: dns::auth: move all service statement management to confd [puppet] - 10https://gerrit.wikimedia.org/r/1007918 (https://phabricator.wikimedia.org/T347054) [14:44:06] (03PS2) 10Ssingh: dns::auth: move all service state management to confd [puppet] - 10https://gerrit.wikimedia.org/r/1007918 (https://phabricator.wikimedia.org/T347054) [14:44:53] (03PS1) 10Effie Mouzeli: mw-mcrouter: enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007919 [14:45:27] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1559/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007918 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:46:07] (03CR) 10Ssingh: "To be merged on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/1007918 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:48:16] (03CR) 10Majavah: [V: 03+1 C: 03+2] hieradata: disable cron restarts on Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1007328 (https://phabricator.wikimedia.org/T358343) (owner: 10Majavah) [14:50:24] (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007919 (owner: 10Effie Mouzeli) [14:51:16] (03CR) 10Elukey: [C: 03+2] kserve: bump docker image version for the storage-initializer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007915 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [14:51:19] (03Merged) 10jenkins-bot: mw-mcrouter: enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007919 (owner: 10Effie Mouzeli) [14:52:06] (03CR) 10Majavah: "Seems reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/1007636 (https://phabricator.wikimedia.org/T356904) (owner: 10FNegri) [14:52:32] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [14:52:34] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [14:52:55] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [14:52:56] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [14:53:21] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [14:54:29] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [14:57:14] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:57:29] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:58:03] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:32] (03CR) 10Ssingh: [V: 03+1] "This patch has been superseded by the unified patch for all services in I28c395e." [puppet] - 10https://gerrit.wikimedia.org/r/1005559 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:05:42] (03Abandoned) 10Ssingh: P:dns::auth: update confd keys to reflect new schema [puppet] - 10https://gerrit.wikimedia.org/r/1005559 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:05:46] (03Abandoned) 10Ssingh: depool codfw: (if required) maintenance work in codfw [dns] - 10https://gerrit.wikimedia.org/r/1007656 (owner: 10Ssingh) [15:05:54] (03Abandoned) 10Ssingh: dns6001: set confd_enabled to false [puppet] - 10https://gerrit.wikimedia.org/r/1006057 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:10:21] (03PS1) 10Elukey: kserve: add wmf-certificates in the right place for storage-init [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007923 (https://phabricator.wikimedia.org/T337213) [15:10:31] PROBLEM - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [15:11:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] [etherpad] Set userName and userColor padOptions to null [puppet] - 10https://gerrit.wikimedia.org/r/1007905 (https://phabricator.wikimedia.org/T316421) (owner: 10EoghanGaffney) [15:11:11] PROBLEM - MariaDB Replica Lag: s1 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:12:29] PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [15:12:37] latency spike on elastic@codfw, alert should resolve soon [15:17:39] (03CR) 10Klausman: [C: 03+1] kserve: add wmf-certificates in the right place for storage-init [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007923 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [15:18:31] RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [15:18:35] RECOVERY - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [15:22:07] (03CR) 10Majavah: [C: 03+1] [wmcs-backup] Don't backup temp toolsdb volumes [puppet] - 10https://gerrit.wikimedia.org/r/1007636 (https://phabricator.wikimedia.org/T356904) (owner: 10FNegri) [15:25:01] (03CR) 10FNegri: [C: 03+2] [wmcs-backup] Don't backup temp toolsdb volumes [puppet] - 10https://gerrit.wikimedia.org/r/1007636 (https://phabricator.wikimedia.org/T356904) (owner: 10FNegri) [15:26:43] (03PS1) 10Clément Goubert: Move 3 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1007926 (https://phabricator.wikimedia.org/T351074) [15:26:48] (03CR) 10Elukey: [V: 03+2 C: 03+2] kserve: add wmf-certificates in the right place for storage-init [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007923 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [15:34:17] RECOVERY - MariaDB Replica Lag: s1 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 2.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:38:22] (03CR) 10Eevans: [C: 03+1] "Insofar as I understand what this does (I don't, really), it LGTM 😊" [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1007701 (https://phabricator.wikimedia.org/T357739) (owner: 10Ahmon Dancy) [15:44:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1007886 (owner: 10Majavah) [15:44:58] (03PS1) 10Elukey: kserve: bump default Docker image for storage-init [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007931 (https://phabricator.wikimedia.org/T337213) [15:47:55] (03CR) 10Arturo Borrero Gonzalez: P:dumps::distribution::nfs: use networks class for WMCS network ranges (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1007889 (owner: 10Majavah) [15:48:05] (03CR) 10Elukey: [C: 03+2] kserve: bump default Docker image for storage-init [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007931 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [15:48:26] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:wmcs: ntp: generate ACLs via network class [puppet] - 10https://gerrit.wikimedia.org/r/1007886 (owner: 10Majavah) [15:48:31] PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb6_80: Servers ncredir4001.ulsfo.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir4002.ulsfo.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:48:31] PROBLEM - PyBal backends health check on lvs3010 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb_80: Servers ncredir3003.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:48:31] PROBLEM - PyBal backends health check on lvs3008 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb6_443: Servers ncredir3003.esams.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir3003.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:48:31] PROBLEM - PyBal backends health check on lvs6003 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb6_443: Servers ncredir6001.drmrs.wmnet are marked down but pooled: ncredirlb_80: Servers ncredir6001.drmrs.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir6002.drmrs.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:48:31] PROBLEM - PyBal backends health check on lvs6001 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb6_443: Servers ncredir6001.drmrs.wmnet are marked down but pooled: ncredirlb_80: Servers ncredir6002.drmrs.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir6002.drmrs.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:48:37] PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb6_80: Servers ncredir5001.eqsin.wmnet are marked down but pooled: ncredirlb_80: Servers ncredir5001.eqsin.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir5002.eqsin.wmnet are marked down but pooled: ncredirlb6_443: Servers ncredir5002.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:48:37] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb6_80: Servers ncredir5002.eqsin.wmnet are marked down but pooled: ncredirlb_80: Servers ncredir5002.eqsin.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir5001.eqsin.wmnet are marked down but pooled: ncredirlb6_443: Servers ncredir5001.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:48:48] huh [15:48:56] what's happening ehre [15:48:57] (ProbeDown) firing: (11) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:49:16] oh uh [15:49:18] um [15:49:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1007900 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [15:49:22] https://grafana.wikimedia.org/d/zCYRtYvWz/ncredir-overview?orgId=1&var-cluster=ulsfo%20prometheus%2Fops [15:49:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb_80: Servers ncredir2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:49:27] PROBLEM - PyBal backends health check on lvs2011 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb6_80: Servers ncredir2002.codfw.wmnet are marked down but pooled: ncredirlb6_443: Servers ncredir2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:49:31] !incidents [15:49:31] RECOVERY - PyBal backends health check on lvs4008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:49:31] RECOVERY - PyBal backends health check on lvs3010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:49:31] RECOVERY - PyBal backends health check on lvs3008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:49:31] RECOVERY - PyBal backends health check on lvs6003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:49:31] RECOVERY - PyBal backends health check on lvs6001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:49:31] 4500 (UNACKED) [11x] ProbeDown sre (ncredir-https:443 probes/service) [15:49:31] 4499 (RESOLVED) [2x] ProbeDown sre (kubemaster2001:6443 probes/custom codfw) [15:49:37] RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:49:37] RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:49:40] !ack 4500 [15:49:41] 4500 (ACKED) [11x] ProbeDown sre (ncredir-https:443 probes/service) [15:49:49] the dreaded spike of requests strikes again [15:50:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1007901 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [15:50:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:50:27] RECOVERY - PyBal backends health check on lvs2011 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:51:07] on all dcs is a bit weird though, isn't it? [15:51:15] first time for sure [15:51:18] usually it was just ulsfo [15:51:27] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:51:40] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:51:48] Interesting, it definitely rattled Elasticsearch [15:52:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1007917 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [15:53:56] (03PS1) 10Gehel: query_service: refactoring 'query_service::monitor::updater' [puppet] - 10https://gerrit.wikimedia.org/r/1007933 (https://phabricator.wikimedia.org/T357496) [15:53:57] (ProbeDown) resolved: (11) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:54:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/998419 (owner: 10Majavah) [15:55:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1007864 (owner: 10Majavah) [15:55:54] (03CR) 10Arturo Borrero Gonzalez: "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1007295 (https://phabricator.wikimedia.org/T344065) (owner: 10Majavah) [15:56:48] (03CR) 10Gehel: "Some explanation of what I'm trying to achieve on https://youtu.be/XfCZ8tO3QwQ" [puppet] - 10https://gerrit.wikimedia.org/r/1007933 (https://phabricator.wikimedia.org/T357496) (owner: 10Gehel) [15:56:58] (03CR) 10Kamila Součková: [C: 03+1] Move 3 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1007926 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [15:57:39] !log Depooling mw1384.eqiad.wmnet,mw1432.eqiad.wmnet,mw1433.eqiad.wmnet for move to k8s - T351074 [15:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:42] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:00:08] (03CR) 10Clément Goubert: [C: 03+2] Move 3 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1007926 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [16:03:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db2117.codfw.wmnet with reason: Silence for maintenance [16:03:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db2117.codfw.wmnet with reason: Silence for maintenance [16:03:31] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [16:04:39] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [16:05:01] !log dancy@deploy2002 Started deploy [analytics/refinery@6e8f25b]: (no justification provided) [16:05:05] !log dancy@deploy2002 Finished deploy [analytics/refinery@6e8f25b]: (no justification provided) (duration: 00m 03s) [16:06:36] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1433.eqiad.wmnet with OS bullseye [16:06:39] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1384.eqiad.wmnet with OS bullseye [16:06:41] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1432.eqiad.wmnet with OS bullseye [16:15:15] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [16:15:43] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [16:16:05] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [16:16:23] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:16:46] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [16:17:06] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:17:25] (SystemdUnitFailed) firing: ferm.service on kubemaster1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:52] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1433.eqiad.wmnet with reason: host reimage [16:20:04] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1432.eqiad.wmnet with reason: host reimage [16:20:48] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1384.eqiad.wmnet with reason: host reimage [16:22:19] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1433.eqiad.wmnet with reason: host reimage [16:22:25] (SystemdUnitFailed) resolved: ferm.service on kubemaster1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:24:56] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1432.eqiad.wmnet with reason: host reimage [16:27:13] (03CR) 10Gehel: [C: 04-1] "We should reduce duplication between _public and _internal. Happy to discuss in more details if needed." [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [16:27:48] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1384.eqiad.wmnet with reason: host reimage [16:40:16] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1433.eqiad.wmnet with OS bullseye [16:43:04] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1432.eqiad.wmnet with OS bullseye [16:44:21] (03PS5) 10Cathal Mooney: Remove cloud_private_v4_set from cloudgw nftables definition [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) [16:44:45] (03CR) 10Cathal Mooney: Remove cloud_private_v4_set from cloudgw nftables definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) (owner: 10Cathal Mooney) [16:46:34] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1384.eqiad.wmnet with OS bullseye [16:46:57] !log Running homer 'cr*eqiad*' commit 'T351074' [16:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:00] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:52:56] !log cgoubert@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1384.eqiad.wmnet|mw1432.eqiad.wmnet|mw1433.eqiad.wmnet),cluster=kubernetes,service=kubesvc [16:54:18] !log Pooled and uncordoned mw1384.eqiad.wmnet mw1432.eqiad.wmnet mw1433.eqiad.wmnet - T351074 [16:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:21] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:55:38] (03PS3) 10Cathal Mooney: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) [17:04:22] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9591255 (10mpopov) Approved from my side, both for the request in general and analytics-product-users membership :) [17:05:13] (03CR) 10Bking: wdqs: Distinguish between public and internal monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [17:05:34] (03PS4) 10Cathal Mooney: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) [17:07:12] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9591261 (10mpopov) Thank you @cmooney for taking this non-standard case on and helping KC out! This dual account thing has become a real thorn for KC so... [17:07:26] (03CR) 10Cathal Mooney: cloudgw: filtering traffic routing between VMs and cloud vrf (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [17:09:23] (03PS5) 10Cathal Mooney: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) [17:11:35] (03PS4) 10Dzahn: contint: create ci_test role for zuul-only and apply on contint1003 [puppet] - 10https://gerrit.wikimedia.org/r/1007434 (https://phabricator.wikimedia.org/T358237) [17:12:01] 06SRE, 10ops-eqiad, 06DC-Ops: hw move: GPU from stat1005 to stat1010 - https://phabricator.wikimedia.org/T358763#9591267 (10Jclark-ctr) @BTullis I will be available monday 10am (est) if that works for you [17:13:43] (03CR) 10JMeybohm: [C: 03+1] "These are not guaranteed to run on the kubemasters (if that impression has been given), but change is fine anyways" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007912 (owner: 10Clément Goubert) [17:15:28] (03CR) 10Dzahn: [C: 03+1] "after adding the ci::manager_host hiera key (thanks Jaime Nuche) it works now: https://puppet-compiler.wmflabs.org/output/1007434/1563/con" [puppet] - 10https://gerrit.wikimedia.org/r/1007434 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [17:15:37] 06SRE, 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T358812#9591289 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Duplicate ticket for T358787 [17:26:18] 06SRE, 06Infrastructure-Foundations, 10netops: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627#9591343 (10Jclark-ctr) [17:26:36] 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad, 06Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463#9591341 (10Jclark-ctr) 05Open→03Resolved disconnected and removed from netbox [17:36:02] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9591421 (10KCVelaga_WMF) [17:39:24] one of the databases is very lagged - 10.64.32.13: 14947.515431 seconds lagged [17:42:49] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Uploads fail due to 401 error from swift on wednesdays - https://phabricator.wikimedia.org/T358830#9591448 (10Umherirrender) There are some information about the 401 status in T228292#7490101 There is also {T206252}, which could be related, but oth... [17:45:14] (03PS1) 10Bking: Elastic: remove soon-to-be decommissioned master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1007952 [17:49:56] one of the databases is very lagged - 10.64.32.13: 14947.515431 seconds lagged - this is causing issues due to maxlag being exceeded [17:51:10] 06SRE, 10ops-eqiad, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9591474 (10VRiley-WMF) @jcrespo We are at the point to image. Would you be able to assist for updating Puppet? [17:53:35] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9591478 (10phaultfinder) [17:55:26] (03PS6) 10Cathal Mooney: cloudgw: filtering traffic routing between VMs and cloud vrf [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) [17:58:02] !log dancy@deploy2002 Started deploy [cassandra/logstash-logback-encoder@162f72f]: (no justification provided) [17:58:10] !log dancy@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@162f72f]: (no justification provided) (duration: 00m 08s) [18:00:26] 06SRE, 10ops-eqiad, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9591527 (10Jclark-ctr) @jcrespo are the Raid instructions backwards os is usually on ssd's RAID 0? HW Raid: Y/N Create 2 logical disks- first one with the HDs with RAID 6 wh... [18:00:50] 06SRE, 10ops-eqiad, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9591528 (10jcrespo) Sure, let me know what can I do for you, and I will get it done next week. The only custom stuff compared to other hosts is the HW RAID partitioning, the res... [18:02:20] [18:02:41] thanks acn [18:02:55] also, $ host 10.64.32.13 ns0.wikimedia.org [18:02:56] 13.32.64.10.in-addr.arpa domain name pointer db1169.eqiad.wmnet. [18:03:29] 06SRE, 10ops-eqiad, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9591536 (10jcrespo) >>! In T355353#9591527, @Jclark-ctr wrote: > @jcrespo are the Raid instructions backwards os is usually on ssd's RAID 0? Os instructions are correct, we... [18:04:09] 06SRE: db1169 is lagged over 16000 seconds - https://phabricator.wikimedia.org/T358892 (10JJMC89) [18:11:15] (03PS1) 10Ssingh: P:cache::varnish::frontend: reload vcl in beta [puppet] - 10https://gerrit.wikimedia.org/r/1007953 (https://phabricator.wikimedia.org/T358887) [18:12:22] !log taavi@cumin1002 dbctl commit (dc=all): 'depool db1169 T358892', diff saved to https://phabricator.wikimedia.org/P58287 and previous config saved to /var/cache/conftool/dbconfig/20240301-181221-taavi.json [18:12:26] T358892: db1169 is lagged over 16000 seconds - https://phabricator.wikimedia.org/T358892 [18:12:48] 06SRE, 06DBA: db1169 is lagged over 16000 seconds - https://phabricator.wikimedia.org/T358892#9591569 (10JJMC89) [18:13:15] (03CR) 10Ahmon Dancy: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1007953 (https://phabricator.wikimedia.org/T358887) (owner: 10Ssingh) [18:14:02] 06SRE, 06DBA: db1169 is lagged over 16000 seconds - https://phabricator.wikimedia.org/T358892#9591586 (10Marostegui) a:03Marostegui [18:14:32] 06SRE, 06DBA: db1169 is lagged over 16000 seconds - https://phabricator.wikimedia.org/T358892#9591584 (10Marostegui) I will review what happened when I get home as that schema change is being done with the script, it should've depooled it. [18:14:36] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1564/console" [puppet] - 10https://gerrit.wikimedia.org/r/1007953 (https://phabricator.wikimedia.org/T358887) (owner: 10Ssingh) [18:15:06] (03PS5) 10Dzahn: contint: create ci_test role for zuul-only and apply on contint1003 [puppet] - 10https://gerrit.wikimedia.org/r/1007434 (https://phabricator.wikimedia.org/T358237) [18:33:30] (03PS6) 10Dzahn: contint: create ci_test role for zuul-only and apply on contint1003 [puppet] - 10https://gerrit.wikimedia.org/r/1007434 (https://phabricator.wikimedia.org/T358237) [18:45:44] (03CR) 10Dzahn: [C: 03+2] "- disabled monitoring notifications, added puppet7, added nftables as firewall provider, added team as role owner etc.." [puppet] - 10https://gerrit.wikimedia.org/r/1007434 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [18:50:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1186.eqiad.wmnet with reason: Maintenance [18:50:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1186.eqiad.wmnet with reason: Maintenance [18:50:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T354015)', diff saved to https://phabricator.wikimedia.org/P58288 and previous config saved to /var/cache/conftool/dbconfig/20240301-185046-marostegui.json [18:50:54] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [18:53:54] taavi: ^ that's so weird cause the script now depooled (I didn't do anything) [18:54:03] odd [18:54:07] So something must have gone wrong [18:54:14] I'll check later [18:56:46] 06SRE, 06Infrastructure-Foundations, 10Mail, 07Security: Domains of most projects do not have DMARC policy - https://phabricator.wikimedia.org/T211403#9591866 (10Frostly) (it looks like indeed that emails from all projects are sent from @wikimedia.org, so a restrictive policy for the other domains is a goo... [18:57:48] 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, and 2 others: Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117#9591863 (10CodeReviewBot) dancy opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/230 Optionally run checks after deploying to testservers [19:03:18] 06SRE, 10ops-eqiad: PowerSupplyFailure - an-coord1003 - https://phabricator.wikimedia.org/T358787#9591916 (10VRiley-WMF) Swapped out power supply. It is back in operation. [19:03:32] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Set testservers_check_cmd to httppb in production [puppet] - 10https://gerrit.wikimedia.org/r/1007956 (https://phabricator.wikimedia.org/T358117) [19:03:35] 06SRE, 10ops-eqiad: PowerSupplyFailure - an-coord1003 - https://phabricator.wikimedia.org/T358787#9591920 (10VRiley-WMF) 05Open→03Resolved [19:04:27] (03PS1) 10Dzahn: ci: add profile::ci::httpd to ci_test role [puppet] - 10https://gerrit.wikimedia.org/r/1007958 (https://phabricator.wikimedia.org/T358237) [19:05:39] (03PS2) 10Dzahn: ci: add profile::ci::httpd to ci_test role [puppet] - 10https://gerrit.wikimedia.org/r/1007958 (https://phabricator.wikimedia.org/T358237) [19:06:28] (03PS1) 10Sbailey: wikifeeds: upgrade to node18 from node16 deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007959 (https://phabricator.wikimedia.org/T358017) [19:06:44] (03CR) 10CI reject: [V: 04-1] scap.cfg.erb: Set testservers_check_cmd to httppb in production [puppet] - 10https://gerrit.wikimedia.org/r/1007956 (https://phabricator.wikimedia.org/T358117) (owner: 10Ahmon Dancy) [19:07:44] (03CR) 10Dzahn: [C: 03+2] ci: add profile::ci::httpd to ci_test role [puppet] - 10https://gerrit.wikimedia.org/r/1007958 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [19:08:33] (03PS2) 10Ahmon Dancy: scap.cfg.erb: Set testservers_check_cmd to httppb in production [puppet] - 10https://gerrit.wikimedia.org/r/1007956 (https://phabricator.wikimedia.org/T358117) [19:11:42] (03CR) 10CI reject: [V: 04-1] scap.cfg.erb: Set testservers_check_cmd to httppb in production [puppet] - 10https://gerrit.wikimedia.org/r/1007956 (https://phabricator.wikimedia.org/T358117) (owner: 10Ahmon Dancy) [19:11:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:12:08] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:12:54] !log contint1003 - sudo a2dismod mpm_event ; a2enmod php7.4 ; systemctl restart apache2 - common issue with puppet setup of an apache on first run [19:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51594 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:14:38] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:17:51] (03PS3) 10Ahmon Dancy: scap.cfg.erb: Set testservers_check_cmd to httppb in production [puppet] - 10https://gerrit.wikimedia.org/r/1007956 (https://phabricator.wikimedia.org/T358117) [19:19:33] 06SRE, 10Continuous-Integration-Infrastructure, 06collaboration-services, 10vm-requests, 13Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9591960 (10Dzahn) This is done now. please use `contint1003.eqiad.wmnet` with private IP. test VM has been created and... [19:20:57] (03CR) 10Dzahn: "This patch is here in case you would like to copy the contents of /var/lib/zuul from the prod CI host for testing." [puppet] - 10https://gerrit.wikimedia.org/r/1007433 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [19:21:15] 06SRE, 10Continuous-Integration-Infrastructure, 06collaboration-services, 10vm-requests, 13Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9591961 (10Dzahn) 05Open→03Resolved a:03Dzahn [19:21:55] (03Abandoned) 10Dzahn: site: add ci role to contint1003 [puppet] - 10https://gerrit.wikimedia.org/r/1007017 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [19:27:57] 06SRE, 10ops-eqiad, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9591986 (10VRiley-WMF) Hi @dr0ptp4kt I have racked and stacked cp1086 in the following location Rack B 7 U 20 CableID 1966 Port 20 P... [19:30:15] 06SRE, 10ops-eqiad, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9591988 (10dr0ptp4kt) Thanks @VRiley-WMF ! @bking is up next for imaging, I think. [19:34:41] (03PS4) 10Ahmon Dancy: scap.cfg.erb: Set testservers_check_cmd to httppb in production [puppet] - 10https://gerrit.wikimedia.org/r/1007956 (https://phabricator.wikimedia.org/T358117) [20:11:27] 06SRE, 06DBA: db1169 is lagged over 16000 seconds - https://phabricator.wikimedia.org/T358892#9592030 (10Ladsgroup) I do see it being depooled by the script: T354015#9587476 [20:12:01] (03PS1) 10Bking: elastic: add elastic2088-2109 to production role [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) [20:12:39] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [20:13:14] (03CR) 10CI reject: [V: 04-1] elastic: add elastic2088-2109 to production role [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [20:15:45] 06SRE, 06DBA: db1169 is lagged over 16000 seconds - https://phabricator.wikimedia.org/T358892#9592053 (10Ladsgroup) Funnily enough I don't see it being repooled in the ticket [20:21:33] (03PS2) 10Bking: elastic: add elastic2088-2109 to production role [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) [20:29:33] (03CR) 10JHathaway: P:puppetserver: git: use creates for initial deploy-code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007396 (owner: 10Majavah) [20:31:26] 06SRE, 06DBA: db1169 is lagged over 16000 seconds - https://phabricator.wikimedia.org/T358892#9592075 (10Ladsgroup) I can't find it in the logs but the usual reason is probably because db1169 was depooled when the script got started and since the config gets loaded at start of the script, it assumed it's one o... [20:32:59] (03PS3) 10Bking: elastic: add elastic2088-2109 to production role [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) [20:33:45] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [20:35:20] !log phabricator - added to WMF-NDA (group 61): Aline Bruenger, Corinna Hillebrand, Kai Nissen, Christoph Jauera (all WMDE staff appearing in NDA spreadsheet) T358578 [20:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:24] T358578: Add WMDE staff who have signed the NDA with the WMF to the WMF-NDA phabricator policy group - https://phabricator.wikimedia.org/T358578 [20:36:11] (03CR) 10JHathaway: [C: 03+1] cumin: fix insetup role report mapping [puppet] - 10https://gerrit.wikimedia.org/r/1007743 (owner: 10Volans) [20:36:40] (03CR) 10JHathaway: [C: 03+1] idp-test: Align acmechief setting to the role, not via host records [puppet] - 10https://gerrit.wikimedia.org/r/1007258 (owner: 10Muehlenhoff) [20:40:57] !log phabricator - added to WMF-NDA (group 61): Loren Johnson, Jonathan Fraine, Kris Litson, Lena Meintrup (all WMDE staff appearing in NDA spreadsheet) T358578 [20:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:00] T358578: Add WMDE staff who have signed the NDA with the WMF to the WMF-NDA phabricator policy group - https://phabricator.wikimedia.org/T358578 [20:45:43] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2109.codfw.wmnet with OS bullseye [20:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:54:48] 06SRE, 06DBA: db1169 is lagged over 16000 seconds - https://phabricator.wikimedia.org/T358892#9592143 (10Marostegui) >>! In T358892#9592030, @Ladsgroup wrote: > I do see it being depooled by the script: T354015#9587476 That's from yesterday :) [21:04:39] (03PS1) 10Jforrester: ZObjectStore::updateZObjectAsSystemUser: Also give wf-staff rights [extensions/WikiLambda] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007885 [21:10:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P58289 and previous config saved to /var/cache/conftool/dbconfig/20240301-211003-root.json [21:11:12] 06SRE, 06DBA: db1169 is lagged over 16000 seconds - https://phabricator.wikimedia.org/T358892#9592172 (10Marostegui) 05Open→03Resolved >>! In T358892#9592075, @Ladsgroup wrote: > I can't find it in the logs but the usual reason is probably because db1169 was depooled when the script got started and since t... [21:25:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P58290 and previous config saved to /var/cache/conftool/dbconfig/20240301-212508-root.json [21:36:29] (03PS5) 10Andrea Denisse: icinga: Set log group to 'nagios' to resolve permission conflicts [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358539) [21:36:56] (03CR) 10Andrea Denisse: icinga: Set log group to 'nagios' to resolve permission conflicts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358539) (owner: 10Andrea Denisse) [21:40:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P58291 and previous config saved to /var/cache/conftool/dbconfig/20240301-214013-root.json [21:41:45] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [21:48:55] 06SRE, 10ops-eqiad, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9592401 (10bking) @VRiley-WMF or @Jclark-ctr are there any other lifecycle steps I need to take to get this host back into production... [21:50:09] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:52:30] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2035 to codfw - jhancock@cumin2002" [21:55:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P58292 and previous config saved to /var/cache/conftool/dbconfig/20240301-215517-root.json [21:55:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2035 to codfw - jhancock@cumin2002" [21:55:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:01:03] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:02:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2035.mgmt.codfw.wmnet with reboot policy FORCED [22:03:03] (SystemdUnitFailed) firing: (2) check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:03:05] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2035 to codfw - jhancock@cumin2002" [22:03:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2035 to codfw - jhancock@cumin2002" [22:03:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:04:09] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:08:03] (SystemdUnitFailed) firing: (2) check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:35] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:10:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P58293 and previous config saved to /var/cache/conftool/dbconfig/20240301-221022-root.json [22:10:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es2035.mgmt.codfw.wmnet with reboot policy FORCED [22:11:31] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2037 to codfw - jhancock@cumin2002" [22:11:34] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2109.codfw.wmnet with OS bullseye [22:11:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2035.mgmt.codfw.wmnet with reboot policy FORCED [22:11:45] (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [22:12:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2037 to codfw - jhancock@cumin2002" [22:12:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:14:09] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:15:22] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:17:17] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2038 to codfw - jhancock@cumin2002" [22:18:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2038 to codfw - jhancock@cumin2002" [22:18:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:19:06] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922 (10cchen) [22:21:36] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es2035.mgmt.codfw.wmnet with reboot policy FORCED [22:23:50] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:25:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P58294 and previous config saved to /var/cache/conftool/dbconfig/20240301-222527-root.json [22:26:04] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2039 to codfw - jhancock@cumin2002" [22:26:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2039 to codfw - jhancock@cumin2002" [22:26:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:27:25] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2109.codfw.wmnet with reason: host reimage [22:27:54] (03CR) 10Bking: [C: 04-1] "Unfortunately, the blast radius is pretty big for us...we could lose the ability to do routine operations if the library doesn't work. CCi" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [22:28:19] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:30:30] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2040 to codfw - jhancock@cumin2002" [22:30:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2035.mgmt.codfw.wmnet with reboot policy FORCED [22:31:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2040 to codfw - jhancock@cumin2002" [22:31:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:32:05] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2109.codfw.wmnet with reason: host reimage [22:32:50] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:34:49] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbprov2005 to codfw - jhancock@cumin2002" [22:35:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbprov2005 to codfw - jhancock@cumin2002" [22:35:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:44:20] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:46:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es2035.mgmt.codfw.wmnet with reboot policy FORCED [22:46:15] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbprov2006 to codfw - jhancock@cumin2002" [22:47:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbprov2006 to codfw - jhancock@cumin2002" [22:47:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:48:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2035.mgmt.codfw.wmnet with reboot policy FORCED [22:48:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2109.codfw.wmnet with OS bullseye [22:51:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2036.mgmt.codfw.wmnet with reboot policy FORCED [22:52:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2037.mgmt.codfw.wmnet with reboot policy FORCED [22:53:32] (ProbeDown) resolved: (2) Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:53:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2038.mgmt.codfw.wmnet with reboot policy FORCED [22:53:58] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es2038.mgmt.codfw.wmnet with reboot policy FORCED [22:54:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2038.mgmt.codfw.wmnet with reboot policy FORCED [22:56:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2039.mgmt.codfw.wmnet with reboot policy FORCED [22:57:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2040.mgmt.codfw.wmnet with reboot policy FORCED [22:57:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es2040.mgmt.codfw.wmnet with reboot policy FORCED [22:57:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [22:58:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED [22:58:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2040.mgmt.codfw.wmnet with reboot policy FORCED [23:02:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es2037.mgmt.codfw.wmnet with reboot policy FORCED [23:03:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2037.mgmt.codfw.wmnet with reboot policy FORCED [23:06:04] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [23:06:11] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED [23:08:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2035.mgmt.codfw.wmnet with reboot policy FORCED [23:09:19] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es2037.mgmt.codfw.wmnet with reboot policy FORCED [23:13:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2036.mgmt.codfw.wmnet with reboot policy FORCED [23:18:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2038.mgmt.codfw.wmnet with reboot policy FORCED [23:18:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2039.mgmt.codfw.wmnet with reboot policy FORCED [23:19:39] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10FY2023/2024-Q3-Q4, 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9592622 (10bking) Hey @fnegri ! I posted [[ https://gerrit.wikimedia.org/r/c/operations/softwar... [23:20:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2040.mgmt.codfw.wmnet with reboot policy FORCED [23:24:18] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9592624 (10Jhancock.wm) [23:25:07] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9592625 (10Jhancock.wm) [23:25:27] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9592636 (10Jhancock.wm) provisioning script failed on both. will check the cables on next visit [23:30:55] (03CR) 10Bking: "above PCC failure is due to elastic2109 being on the wrong OS (bookworm instead of bullseye). I just finished reimaging this host, so tryi" [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [23:31:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007969 (https://phabricator.wikimedia.org/T353878) (owner: 10Bking) [23:38:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:39:13] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:40:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51596 bytes in 1.435 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:41:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring