[00:01:00] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1023550|Set timezones for new wikis (T360310 T360303 T363263 T363256 T363250 T363243 T363270)]], [[gerrit:1023528|Update interwiki cache]] [00:01:34] T360310: Post-creation work for bewwiki - https://phabricator.wikimedia.org/T360310 [00:01:36] T360303: Post-creation work for kuswiki - https://phabricator.wikimedia.org/T360303 [00:01:36] T363263: Post-creation work for iglwiki - https://phabricator.wikimedia.org/T363263 [00:01:36] T363256: Post-creation work for kaawiktionary - https://phabricator.wikimedia.org/T363256 [00:01:37] T363250: Post-creation work for mswikisource - https://phabricator.wikimedia.org/T363250 [00:01:37] T363243: Post-creation work for kawikisource - https://phabricator.wikimedia.org/T363243 [00:01:37] T363270: Post-creation work for mywikisource - https://phabricator.wikimedia.org/T363270 [00:03:43] !log zabe@deploy1002 zabe: Backport for [[gerrit:1023550|Set timezones for new wikis (T360310 T360303 T363263 T363256 T363250 T363243 T363270)]], [[gerrit:1023528|Update interwiki cache]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:03:53] !log zabe@deploy1002 zabe: Continuing with sync [00:14:56] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1023550|Set timezones for new wikis (T360310 T360303 T363263 T363256 T363250 T363243 T363270)]], [[gerrit:1023528|Update interwiki cache]] (duration: 13m 56s) [00:15:08] * zabe done :) [00:15:26] T360310: Post-creation work for bewwiki - https://phabricator.wikimedia.org/T360310 [00:15:27] T360303: Post-creation work for kuswiki - https://phabricator.wikimedia.org/T360303 [00:15:27] T363263: Post-creation work for iglwiki - https://phabricator.wikimedia.org/T363263 [00:15:27] T363256: Post-creation work for kaawiktionary - https://phabricator.wikimedia.org/T363256 [00:15:28] T363250: Post-creation work for mswikisource - https://phabricator.wikimedia.org/T363250 [00:15:28] T363243: Post-creation work for kawikisource - https://phabricator.wikimedia.org/T363243 [00:15:28] T363270: Post-creation work for mywikisource - https://phabricator.wikimedia.org/T363270 [00:40:44] (03PS1) 10Scott French: WIP: hieradata: disable etcd replication on conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/1023554 (https://phabricator.wikimedia.org/T358636) [00:40:46] (03PS1) 10Scott French: WIP: etcdmirror::instance: absent all resources [puppet] - 10https://gerrit.wikimedia.org/r/1023555 (https://phabricator.wikimedia.org/T358636) [00:40:47] (03PS1) 10Scott French: WIP: etcdmirror: reconfigure with full-keyspace replication [puppet] - 10https://gerrit.wikimedia.org/r/1023556 (https://phabricator.wikimedia.org/T358636) [00:40:50] (03PS1) 10Scott French: WIP: hieradata: reenable etcd replication on conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/1023557 (https://phabricator.wikimedia.org/T358636) [00:45:25] (ProbeDown) firing: Service citoid:4003 has failed probes (http_citoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#citoid:4003 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:48:52] (ProbeDown) resolved: Service citoid:4003 has failed probes (http_citoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#citoid:4003 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:15:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:52] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:35] 06SRE, 10SRE-Access-Requests: Requesting membership in airflow-analytics-product-admins for nshahquinn-wmf and hghani - https://phabricator.wikimedia.org/T363288 (10nshahquinn-wmf) 03NEW [02:49:20] 06SRE, 10SRE-Access-Requests, 06Movement-Insights: Requesting membership in airflow-analytics-product-admins for nshahquinn-wmf and hghani - https://phabricator.wikimedia.org/T363288#9739059 (10nshahquinn-wmf) @OSefu-WMF, @mpopov could you approve this access for @Hghani and me? [03:00:25] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:05:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:24:30] (ProbeDown) firing: (2) Service wdqs1014:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:29:30] (ProbeDown) resolved: (2) Service wdqs1014:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:18:00] (03PS2) 10KartikMistry: Update cxserver to 2024-04-23-221507-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016077 (https://phabricator.wikimedia.org/T363263) [04:52:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [04:52:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [04:52:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T352010)', diff saved to https://phabricator.wikimedia.org/P61129 and previous config saved to /var/cache/conftool/dbconfig/20240424-045230-ladsgroup.json [04:52:54] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:10:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:59:21] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete script to detect ever-changing puppet runs [puppet] - 10https://gerrit.wikimedia.org/r/1021875 (https://phabricator.wikimedia.org/T345909) (owner: 10Muehlenhoff) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240424T0600) [06:02:56] (03CR) 10Muehlenhoff: [C:03+2] heat: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1021868 (owner: 10Muehlenhoff) [06:17:59] !log installing glibc security updates [06:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:18] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9739149 (10Lina_Farid_WMDE) Thank you all, I sent an email just now to @KFrancis . Apologies for the delay it seems that I need to update my notification preferences. [06:45:56] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for testreduce/Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1023734 (https://phabricator.wikimedia.org/T135991) [06:47:54] (03PS1) 10Muehlenhoff: Add Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1023735 (https://phabricator.wikimedia.org/T346935) [06:51:40] (03CR) 10Muehlenhoff: [C:03+2] Add Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1023735 (https://phabricator.wikimedia.org/T346935) (owner: 10Muehlenhoff) [07:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240424T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:15:48] (03PS1) 10Muehlenhoff: Fix after piwik->matomo role rename [puppet] - 10https://gerrit.wikimedia.org/r/1023738 (https://phabricator.wikimedia.org/T349397) [07:19:04] (03CR) 10Muehlenhoff: [C:03+2] Fix after piwik->matomo role rename [puppet] - 10https://gerrit.wikimedia.org/r/1023738 (https://phabricator.wikimedia.org/T349397) (owner: 10Muehlenhoff) [07:40:40] (03PS1) 10Muehlenhoff: apt_staging: Enable profile::auto_restarts::service for rsync/nginx/envoy [puppet] - 10https://gerrit.wikimedia.org/r/1023783 (https://phabricator.wikimedia.org/T135991) [07:58:38] (03PS4) 10Aklapper: Replace a strlen(null) call for PHP 8.1 [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1020170 (https://phabricator.wikimedia.org/T342244) [08:03:19] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "promote prometheus1006 as pushgateway primary" [dns] - 10https://gerrit.wikimedia.org/r/1023154 (owner: 10Herron) [08:03:26] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "prometheus: promote prometheus1006 to pushgateway duty" [puppet] - 10https://gerrit.wikimedia.org/r/1023155 (owner: 10Herron) [08:03:33] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "trafficserver: move prometheus-eqiad to prometheus1006" [puppet] - 10https://gerrit.wikimedia.org/r/1023156 (owner: 10Herron) [08:04:05] (03CR) 10Filippo Giunchedi: [C:03+2] jaeger: upgrade to 1.56 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023381 (https://phabricator.wikimedia.org/T362719) (owner: 10Filippo Giunchedi) [08:06:39] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for redis/idm [puppet] - 10https://gerrit.wikimedia.org/r/1023784 (https://phabricator.wikimedia.org/T135991) [08:07:42] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023784 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:08:13] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [08:08:58] !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [08:13:58] (03CR) 10Brouberol: idp_test: register the mpic_next service configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023431 (https://phabricator.wikimedia.org/T361341) (owner: 10Brouberol) [08:14:37] (03PS2) 10Brouberol: idp_test: register the mpic_next service configuration [puppet] - 10https://gerrit.wikimedia.org/r/1023431 (https://phabricator.wikimedia.org/T361341) [08:15:24] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for redis/idm [puppet] - 10https://gerrit.wikimedia.org/r/1023784 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:20:18] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023554 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [08:21:54] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023555 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [08:22:13] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for titan/Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1023785 (https://phabricator.wikimedia.org/T135991) [08:22:25] (SystemdUnitFailed) firing: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:06] (03CR) 10Santiago Faci: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1023431 (https://phabricator.wikimedia.org/T361341) (owner: 10Brouberol) [08:23:19] (03CR) 10Santiago Faci: [C:03+1] idp_test: register the mpic_next service configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023431 (https://phabricator.wikimedia.org/T361341) (owner: 10Brouberol) [08:24:09] (03CR) 10JMeybohm: [V:03+1 C:03+2] Revert: kubernetes::node restart rsyslog if too many fd's are blocked by inotify [puppet] - 10https://gerrit.wikimedia.org/r/1023417 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [08:24:56] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023556 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [08:25:14] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023557 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [08:26:58] (03CR) 10Jelto: "looks mostly good, some comments and nits in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1021948 (owner: 10EoghanGaffney) [08:27:41] (03PS2) 10JMeybohm: toil: add rsyslog_imfile_remedy [puppet] - 10https://gerrit.wikimedia.org/r/1023422 (https://phabricator.wikimedia.org/T357616) (owner: 10Filippo Giunchedi) [08:28:35] (03PS3) 10Filippo Giunchedi: toil: add rsyslog_imfile_remedy [puppet] - 10https://gerrit.wikimedia.org/r/1023422 (https://phabricator.wikimedia.org/T357616) [08:28:36] (03PS1) 10Filippo Giunchedi: kubernetes: add rsyslog imfile remedy to nodes [puppet] - 10https://gerrit.wikimedia.org/r/1023806 (https://phabricator.wikimedia.org/T357616) [08:30:36] (03CR) 10Filippo Giunchedi: [C:03+1] Enable profile::auto_restarts::service for titan/Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1023785 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:30:55] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2097/c" [puppet] - 10https://gerrit.wikimedia.org/r/1023422 (https://phabricator.wikimedia.org/T357616) (owner: 10Filippo Giunchedi) [08:31:34] (03CR) 10CI reject: [V:04-1] kubernetes: add rsyslog imfile remedy to nodes [puppet] - 10https://gerrit.wikimedia.org/r/1023806 (https://phabricator.wikimedia.org/T357616) (owner: 10Filippo Giunchedi) [08:32:56] (03CR) 10JMeybohm: [C:03+2] toil: add rsyslog_imfile_remedy [puppet] - 10https://gerrit.wikimedia.org/r/1023422 (https://phabricator.wikimedia.org/T357616) (owner: 10Filippo Giunchedi) [08:33:53] (03CR) 10Fabfur: [V:03+1 C:03+2] hiera: buffer memory limit increase for cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1023060 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [08:34:45] (03PS2) 10Filippo Giunchedi: kubernetes: add rsyslog imfile remedy to nodes [puppet] - 10https://gerrit.wikimedia.org/r/1023806 (https://phabricator.wikimedia.org/T357616) [08:35:04] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for titan/Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1023785 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:36:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1248.eqiad.wmnet with reason: T362746 [08:36:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1248.eqiad.wmnet with reason: T362746 [08:36:58] T362746: Upgrade s4 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362746 [08:37:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1248', diff saved to https://phabricator.wikimedia.org/P61130 and previous config saved to /var/cache/conftool/dbconfig/20240424-083736-arnaudb.json [08:38:40] (03CR) 10Elukey: [V:03+2 C:03+2] "Done" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [08:38:57] (03CR) 10Elukey: [V:03+1 C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1009548 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [08:39:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1248.eqiad.wmnet with OS bookworm [08:44:01] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for uwsgi-bitu [puppet] - 10https://gerrit.wikimedia.org/r/1023807 (https://phabricator.wikimedia.org/T135991) [08:46:16] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-codfw [08:47:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-codfw [08:47:11] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023807 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:47:25] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:00] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for uwsgi-bitu [puppet] - 10https://gerrit.wikimedia.org/r/1023807 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:51:25] (03CR) 10Clément Goubert: [C:03+2] mw-web, mw-api-ext: Raise replicas for 80% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023412 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [08:51:35] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-eqiad [08:52:05] (03CR) 10Btullis: [C:03+1] "Looks good to me, thanks elukey." [puppet] - 10https://gerrit.wikimedia.org/r/1023454 (https://phabricator.wikimedia.org/T362181) (owner: 10Elukey) [08:52:24] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 80% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023412 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [08:52:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-eqiad [08:52:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1248.eqiad.wmnet with reason: host reimage [08:53:05] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [08:53:23] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [08:53:32] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [08:53:51] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [08:54:01] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [08:54:24] (03CR) 10Brouberol: [C:03+2] idp_test: register the mpic_next service configuration [puppet] - 10https://gerrit.wikimedia.org/r/1023431 (https://phabricator.wikimedia.org/T361341) (owner: 10Brouberol) [08:54:29] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [08:54:34] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [08:54:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T352010)', diff saved to https://phabricator.wikimedia.org/P61131 and previous config saved to /var/cache/conftool/dbconfig/20240424-085442-ladsgroup.json [08:54:51] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [08:54:55] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [08:55:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1248.eqiad.wmnet with reason: host reimage [08:56:36] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#9739515 (10cmooney) >>! In T362985#9735205, @ayounsi wrote: > Another question I think is "do we still have to go through text files ?" > It made sens for back in the time when we we... [08:57:17] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#9739518 (10cmooney) [08:57:26] (03CR) 10Clément Goubert: [C:03+2] trafficserver: move 80% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1023413 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [08:59:50] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023554 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [08:59:57] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023555 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [09:00:29] (03CR) 10Volans: [C:03+1] "LGTM, bike-shedding on the service name inline" [puppet] - 10https://gerrit.wikimedia.org/r/1023556 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [09:00:37] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023557 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [09:02:53] (03CR) 10Elukey: [C:03+2] Deploy the Java Truststore with PKI Root CA on Stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/1023454 (https://phabricator.wikimedia.org/T362181) (owner: 10Elukey) [09:08:05] !log run 'kill `pgrep -u dbad2021`' on all stat nodes to unblock puppet [09:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:59] (03CR) 10Aklapper: [C:03+1] "Thanks for looking into this. From a quick read this makes sense to me." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975392 (https://phabricator.wikimedia.org/T351363) (owner: 10Pppery) [09:09:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P61132 and previous config saved to /var/cache/conftool/dbconfig/20240424-090950-ladsgroup.json [09:10:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1248.eqiad.wmnet with OS bookworm [09:15:33] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9739549 (10Clement_Goubert) [09:20:10] (03CR) 10Filippo Giunchedi: [C:03+2] kubernetes: add rsyslog imfile remedy to nodes [puppet] - 10https://gerrit.wikimedia.org/r/1023806 (https://phabricator.wikimedia.org/T357616) (owner: 10Filippo Giunchedi) [09:23:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1248 (re)pooling @ 5%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61133 and previous config saved to /var/cache/conftool/dbconfig/20240424-092353-arnaudb.json [09:24:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1247.eqiad.wmnet with reason: T362746 [09:24:44] T362746: Upgrade s4 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362746 [09:24:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1247.eqiad.wmnet with reason: T362746 [09:24:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P61134 and previous config saved to /var/cache/conftool/dbconfig/20240424-092457-ladsgroup.json [09:25:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1247', diff saved to https://phabricator.wikimedia.org/P61135 and previous config saved to /var/cache/conftool/dbconfig/20240424-092540-arnaudb.json [09:27:32] (03PS1) 10Majavah: logos: Update cawiki 750k logo tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023812 (https://phabricator.wikimedia.org/T363057) [09:27:34] jouncebot: nowandnext [09:27:34] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [09:27:34] In 0 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240424T1000) [09:28:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1247.eqiad.wmnet with OS bookworm [09:28:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023812 (https://phabricator.wikimedia.org/T363057) (owner: 10Majavah) [09:29:12] (03Merged) 10jenkins-bot: logos: Update cawiki 750k logo tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023812 (https://phabricator.wikimedia.org/T363057) (owner: 10Majavah) [09:29:39] !log 80% of external traffix to mw-on-k8s - T362323 [09:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:43] T362323: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323 [09:29:52] !log taavi@deploy1002 Started scap: Backport for [[gerrit:1023812|logos: Update cawiki 750k logo tagline (T363057)]] [09:29:56] T363057: Changing logos and tagline for the 750k article milestone in the Catalan Wikipedia - https://phabricator.wikimedia.org/T363057 [09:32:40] !log taavi@deploy1002 taavi: Backport for [[gerrit:1023812|logos: Update cawiki 750k logo tagline (T363057)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:33:06] !log taavi@deploy1002 taavi: Continuing with sync [09:38:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1248 (re)pooling @ 10%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61136 and previous config saved to /var/cache/conftool/dbconfig/20240424-093859-arnaudb.json [09:39:08] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9739601 (10BTullis) [09:40:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T352010)', diff saved to https://phabricator.wikimedia.org/P61137 and previous config saved to /var/cache/conftool/dbconfig/20240424-094004-ladsgroup.json [09:40:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [09:40:15] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:40:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [09:40:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T352010)', diff saved to https://phabricator.wikimedia.org/P61138 and previous config saved to /var/cache/conftool/dbconfig/20240424-094027-ladsgroup.json [09:41:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1247.eqiad.wmnet with reason: host reimage [09:44:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1247.eqiad.wmnet with reason: host reimage [09:44:45] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:1023812|logos: Update cawiki 750k logo tagline (T363057)]] (duration: 14m 53s) [09:44:50] T363057: Changing logos and tagline for the 750k article milestone in the Catalan Wikipedia - https://phabricator.wikimedia.org/T363057 [09:45:24] !log echo "https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-ca-750k.svg" | mwscript purgeList.php --wiki enwiki # T363057 [09:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:51] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9739626 (10BTullis) [09:46:16] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2098/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023813 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [09:54:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1248 (re)pooling @ 25%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61139 and previous config saved to /var/cache/conftool/dbconfig/20240424-095405-arnaudb.json [09:59:34] (03CR) 10Filippo Giunchedi: "I'm +1 on the firing/resolved and count in square brackets, and -1 for moving the alert group after summary. Would you mind splitting the " [puppet] - 10https://gerrit.wikimedia.org/r/1019840 (https://phabricator.wikimedia.org/T362239) (owner: 10Herron) [09:59:37] (03CR) 10Muehlenhoff: Add server aliases to the cirrus/cfssl proxy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023469 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240424T1000) [10:00:28] (03Abandoned) 10Filippo Giunchedi: alertmanager: tweak irc alert message format [puppet] - 10https://gerrit.wikimedia.org/r/1019829 (https://phabricator.wikimedia.org/T362239) (owner: 10Filippo Giunchedi) [10:04:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1247.eqiad.wmnet with OS bookworm [10:06:48] (03PS1) 10Btullis: Switch the wcqs tlsproxy to use pki [puppet] - 10https://gerrit.wikimedia.org/r/1023815 (https://phabricator.wikimedia.org/T360439) [10:07:39] (03PS1) 10Muehlenhoff: Stop including idp-build in the idp-test role [puppet] - 10https://gerrit.wikimedia.org/r/1023816 [10:07:39] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for rsync/idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1023817 (https://phabricator.wikimedia.org/T135991) [10:09:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1248 (re)pooling @ 50%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61140 and previous config saved to /var/cache/conftool/dbconfig/20240424-100910-arnaudb.json [10:11:19] (03CR) 10Muehlenhoff: Switch the wcqs tlsproxy to use pki (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1023815 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [10:17:08] (03CR) 10Btullis: [V:03+1] Add server aliases to the cirrus/cfssl proxy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023469 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [10:17:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 5%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61141 and previous config saved to /var/cache/conftool/dbconfig/20240424-101713-arnaudb.json [10:18:16] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 151326 [10:19:03] (03PS1) 10NMW03: Enabled subpages for main namespace in ptwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023530 (https://phabricator.wikimedia.org/T362300) [10:19:17] (03PS2) 10Btullis: Switch the wcqs tlsproxy to use pki [puppet] - 10https://gerrit.wikimedia.org/r/1023815 (https://phabricator.wikimedia.org/T360439) [10:19:50] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 151326 [10:20:18] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9739732 (10BTullis) [10:20:45] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2100/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023815 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [10:21:57] !log taavi@cumin1002 START - Cookbook sre.wikireplicas.add-wiki [10:22:04] !log taavi@cumin1002 Added views for new wiki: kuswiki T360302 [10:22:04] !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [10:22:10] T360302: Prepare and check storage layer for kuswiki - https://phabricator.wikimedia.org/T360302 [10:23:32] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9739759 (10BTullis) [10:24:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1248 (re)pooling @ 75%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61142 and previous config saved to /var/cache/conftool/dbconfig/20240424-102416-arnaudb.json [10:29:37] (03PS1) 10Btullis: Switch wdqs::public tlsproxy from cergen to pki [puppet] - 10https://gerrit.wikimedia.org/r/1023819 (https://phabricator.wikimedia.org/T360439) [10:30:13] (03PS1) 10Majavah: wikireplicas: add-wiki: Convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1023821 [10:30:13] (03PS1) 10Majavah: wikireplicas: add-wiki: use runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1023822 [10:30:13] (03PS1) 10Majavah: wikireplicas: add-wiki: stop sourcing novaenv.sh [cookbooks] - 10https://gerrit.wikimedia.org/r/1023823 [10:30:18] (03PS1) 10Alexandros Kosiaris: wikifeeds: Use mesh modules version enabling IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023824 (https://phabricator.wikimedia.org/T255568) [10:30:45] !log taavi@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database bewwiki [10:30:52] !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database bewwiki [10:31:06] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2101/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023819 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [10:31:45] (03PS2) 10Majavah: wikireplicas: add-wiki: use runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1023822 [10:31:45] (03PS2) 10Majavah: wikireplicas: add-wiki: stop sourcing novaenv.sh [cookbooks] - 10https://gerrit.wikimedia.org/r/1023823 [10:32:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 10%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61143 and previous config saved to /var/cache/conftool/dbconfig/20240424-103218-arnaudb.json [10:32:26] (03CR) 10Alexandros Kosiaris: [C:04-1] "Ah,now this rings a bell. The TL;DR is this patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1023824 and once it's d" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007959 (https://phabricator.wikimedia.org/T358017) (owner: 10Sbailey) [10:32:29] !log taavi@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database mywikisource (T363269) [10:32:36] !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database mywikisource (T363269) [10:32:36] (03PS1) 10NMW03: Updated uzwiktionary project namespace name and site name to follow Uzbek grammar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023531 (https://phabricator.wikimedia.org/T362620) [10:32:39] T363269: Prepare and check storage layer for mywikisource - https://phabricator.wikimedia.org/T363269 [10:32:54] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9739804 (10BTullis) [10:34:12] (03CR) 10CI reject: [V:04-1] wikireplicas: add-wiki: Convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1023821 (owner: 10Majavah) [10:34:19] (03CR) 10CI reject: [V:04-1] wikireplicas: add-wiki: use runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1023822 (owner: 10Majavah) [10:35:33] (03CR) 10CI reject: [V:04-1] wikireplicas: add-wiki: use runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1023822 (owner: 10Majavah) [10:35:58] (03Abandoned) 10Jgiannelos: wikifeeds: upgrade to node18 from node16 deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007959 (https://phabricator.wikimedia.org/T358017) (owner: 10Sbailey) [10:36:02] (03CR) 10CI reject: [V:04-1] wikireplicas: add-wiki: stop sourcing novaenv.sh [cookbooks] - 10https://gerrit.wikimedia.org/r/1023823 (owner: 10Majavah) [10:36:39] (03PS1) 10Btullis: Switch wdqs::internal tlsproxy from cergen to pki [puppet] - 10https://gerrit.wikimedia.org/r/1023825 (https://phabricator.wikimedia.org/T360439) [10:36:49] (03PS2) 10Majavah: wikireplicas: add-wiki: Convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/1023821 [10:36:49] (03PS3) 10Majavah: wikireplicas: add-wiki: use runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1023822 [10:36:50] (03PS3) 10Majavah: wikireplicas: add-wiki: stop sourcing novaenv.sh [cookbooks] - 10https://gerrit.wikimedia.org/r/1023823 [10:37:22] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9739818 (10BTullis) [10:37:36] !log taavi@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database iglwiki (T363262) [10:37:43] !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database iglwiki (T363262) [10:37:51] T363262: Prepare and check storage layer for iglwiki - https://phabricator.wikimedia.org/T363262 [10:38:02] (03PS2) 10NMW03: Updated uzwiktionary project namespace name and site name to follow Uzbek grammar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023531 (https://phabricator.wikimedia.org/T362620) [10:38:10] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2102/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023825 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [10:38:30] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9739833 (10BTullis) [10:38:37] !log taavi@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database kaawiktionary (T363255) [10:38:44] !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database kaawiktionary (T363255) [10:38:45] T363255: Prepare and check storage layer for kaawiktionary - https://phabricator.wikimedia.org/T363255 [10:39:08] !log taavi@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database mswikisource (T363249) [10:39:12] T363249: Prepare and check storage layer for mswikisource - https://phabricator.wikimedia.org/T363249 [10:39:14] !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database mswikisource (T363249) [10:39:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1248 (re)pooling @ 100%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61144 and previous config saved to /var/cache/conftool/dbconfig/20240424-103922-arnaudb.json [10:40:01] !log taavi@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database kawikisource (T363242) [10:40:09] T363242: Prepare and check storage layer for kawikisource - https://phabricator.wikimedia.org/T363242 [10:40:19] (03CR) 10Btullis: [V:03+1] Switch the wcqs tlsproxy to use pki (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1023815 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [10:43:30] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310 (10JMeybohm) 03NEW [10:43:59] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9739865 (10JMeybohm) [10:47:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 25%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61145 and previous config saved to /var/cache/conftool/dbconfig/20240424-104724-arnaudb.json [10:50:59] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023816 (owner: 10Muehlenhoff) [10:54:15] (03PS2) 10JMeybohm: wikifeeds: Use mesh modules version enabling IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023824 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [11:00:05] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240424T1100). [11:01:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:01:48] (03PS1) 10Slyngshede: P:idm Allow enable API on test server [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) [11:02:10] (03CR) 10CI reject: [V:04-1] P:idm Allow enable API on test server [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [11:02:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 50%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61146 and previous config saved to /var/cache/conftool/dbconfig/20240424-110230-arnaudb.json [11:05:11] !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database kawikisource (T363242) [11:05:30] T363242: Prepare and check storage layer for kawikisource - https://phabricator.wikimedia.org/T363242 [11:13:55] (03PS2) 10Slyngshede: P:idm Allow enable API on test server [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) [11:17:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 75%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61147 and previous config saved to /var/cache/conftool/dbconfig/20240424-111735-arnaudb.json [11:20:13] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772#9739921 (10cmooney) @ayounsi pointed out another option we may have here to address the switch being single-point of failure. Using d... [11:20:37] (03CR) 10Muehlenhoff: [C:03+2] Stop including idp-build in the idp-test role [puppet] - 10https://gerrit.wikimedia.org/r/1023816 (owner: 10Muehlenhoff) [11:21:18] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2103/console" [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [11:24:36] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1023815 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [11:25:32] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1023825 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [11:25:46] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772#9739939 (10cmooney) [11:26:46] (03PS1) 10FNegri: wmcs:metricsinfra set Grafana scrape interval [puppet] - 10https://gerrit.wikimedia.org/r/1023837 (https://phabricator.wikimedia.org/T363176) [11:27:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1023819 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [11:29:49] (03CR) 10CI reject: [V:04-1] wmcs:metricsinfra set Grafana scrape interval [puppet] - 10https://gerrit.wikimedia.org/r/1023837 (https://phabricator.wikimedia.org/T363176) (owner: 10FNegri) [11:30:22] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 2 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9739949 (10MoritzMuehlenhoff) Looks good. We can't disable DRBD on instance creation currently, simply add it as usual an... [11:32:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 100%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61148 and previous config saved to /var/cache/conftool/dbconfig/20240424-113241-arnaudb.json [11:36:35] (03PS2) 10FNegri: wmcs::metricsinfra: set Grafana scrape interval [puppet] - 10https://gerrit.wikimedia.org/r/1023837 (https://phabricator.wikimedia.org/T363176) [11:37:17] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [11:38:26] (03PS3) 10Slyngshede: P:idm Allow enable API on test server [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) [11:39:31] (03CR) 10Slyngshede: "PCC wasn't actually happy with it" [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [11:40:10] (03PS4) 10Slyngshede: P:idm Allow enable API on test server [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) [11:41:34] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2105/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [11:45:16] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2107/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [11:49:30] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:relforge [11:50:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:relforge [11:53:01] (03CR) 10Volans: [C:03+1] "LGTM, make sure to test it either with test-cookbook or after merging it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1023821 (owner: 10Majavah) [11:53:39] (03CR) 10Volans: [C:03+1] "LGTM, minor message nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1023822 (owner: 10Majavah) [11:53:44] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2108/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [11:54:09] (03PS5) 10Slyngshede: P:idm Allow enable API on test server [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) [11:55:21] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2109/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [11:57:07] (03PS1) 10Muehlenhoff: Automatically restart memcached/mcrouter on idp-test nodes [puppet] - 10https://gerrit.wikimedia.org/r/1023838 (https://phabricator.wikimedia.org/T135991) [11:58:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [11:59:20] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2110/console" [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [12:01:45] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2111/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [12:02:06] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idm Allow enable API on test server [puppet] - 10https://gerrit.wikimedia.org/r/1023829 (https://phabricator.wikimedia.org/T361066) (owner: 10Slyngshede) [12:02:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023838 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:10:07] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:elastic-canary [12:10:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:elastic-canary [12:15:45] (03PS1) 10Daimona Eaytoy: WikiEduDashboard: allow removal when course is not synced [extensions/CampaignEvents] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1023796 (https://phabricator.wikimedia.org/T363187) [12:16:11] (03PS1) 10Daimona Eaytoy: WikiEduDashboard: allow removal when course is not synced [extensions/CampaignEvents] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023797 (https://phabricator.wikimedia.org/T363187) [12:20:31] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:elastic-codfw [12:23:16] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on stat1010.eqiad.wmnet with reason: Connecting GPU power cable [12:23:29] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on stat1010.eqiad.wmnet with reason: Connecting GPU power cable [12:24:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1242.eqiad.wmnet with reason: T362746 [12:24:51] T362746: Upgrade s4 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362746 [12:24:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1242.eqiad.wmnet with reason: T362746 [12:25:01] (03PS1) 10Vgutierrez: hiera: Enable benthos on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1023844 (https://phabricator.wikimedia.org/T362776) [12:25:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1242', diff saved to https://phabricator.wikimedia.org/P61149 and previous config saved to /var/cache/conftool/dbconfig/20240424-122520-arnaudb.json [12:26:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1242.eqiad.wmnet with OS bookworm [12:27:09] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2112/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023844 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [12:32:32] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023838 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:39:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1242.eqiad.wmnet with reason: host reimage [12:40:42] (03PS1) 10Ilias Sarantopoulos: ml-services: enable payload logging in all revscoring services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023846 (https://phabricator.wikimedia.org/T362503) [12:41:48] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2113/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023844 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [12:42:23] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable benthos on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1023844 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [12:42:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1242.eqiad.wmnet with reason: host reimage [12:42:42] (03CR) 10Klausman: [C:03+1] ml-services: enable payload logging in all revscoring services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023846 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [12:43:09] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: enable payload logging in all revscoring services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023846 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [12:43:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:elastic-codfw [12:44:28] (03Merged) 10jenkins-bot: ml-services: enable payload logging in all revscoring services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023846 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [12:45:54] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:50:02] (03PS1) 10Muehlenhoff: configmaster: Enable profile::auto_restarts::service for apache/Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1023847 (https://phabricator.wikimedia.org/T135991) [12:51:01] (03PS2) 10Muehlenhoff: configmaster: Enable profile::auto_restarts::service for apache/Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1023847 (https://phabricator.wikimedia.org/T135991) [12:52:26] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:elastic-eqiad [12:53:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:55:26] (03PS1) 10Slyngshede: Blocklist IPs: Delete if expiry is in the past. [software/bitu] - 10https://gerrit.wikimedia.org/r/1023848 [12:57:27] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough [12:58:27] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:59:44] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240424T1300). [13:00:05] NMW03, sergi0, and Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:23] i can deploy today [13:00:28] hello [13:00:30] hi sergi0 [13:00:49] (03CR) 10Urbanecm: [C:03+2] WikiEduDashboard: allow removal when course is not synced [extensions/CampaignEvents] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1023796 (https://phabricator.wikimedia.org/T363187) (owner: 10Daimona Eaytoy) [13:00:50] (03CR) 10Urbanecm: [C:03+2] WikiEduDashboard: allow removal when course is not synced [extensions/CampaignEvents] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023797 (https://phabricator.wikimedia.org/T363187) (owner: 10Daimona Eaytoy) [13:00:52] let's go [13:00:57] (03PS2) 10Urbanecm: Growth: Enable Levelling up features on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023101 (https://phabricator.wikimedia.org/T348086) [13:01:01] (03CR) 10Urbanecm: [C:03+2] Growth: Enable Levelling up features on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023101 (https://phabricator.wikimedia.org/T348086) (owner: 10Urbanecm) [13:01:10] NMW03: hi, are you around please? [13:01:56] (03Merged) 10jenkins-bot: Growth: Enable Levelling up features on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023101 (https://phabricator.wikimedia.org/T348086) (owner: 10Urbanecm) [13:02:13] i assume not [13:02:21] (03PS1) 10Ssingh: magru: add lvs700[1-3] and related configuration [puppet] - 10https://gerrit.wikimedia.org/r/1023850 (https://phabricator.wikimedia.org/T346722) [13:03:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1242.eqiad.wmnet with OS bookworm [13:03:42] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2114/console" [puppet] - 10https://gerrit.wikimedia.org/r/1023850 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [13:03:55] (03Merged) 10jenkins-bot: WikiEduDashboard: allow removal when course is not synced [extensions/CampaignEvents] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1023796 (https://phabricator.wikimedia.org/T363187) (owner: 10Daimona Eaytoy) [13:04:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CampaignEvents] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023797 (https://phabricator.wikimedia.org/T363187) (owner: 10Daimona Eaytoy) [13:04:03] (03Merged) 10jenkins-bot: WikiEduDashboard: allow removal when course is not synced [extensions/CampaignEvents] (wmf/1.43.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1023797 (https://phabricator.wikimedia.org/T363187) (owner: 10Daimona Eaytoy) [13:04:38] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1023101|Growth: Enable Levelling up features on all wikis (T348086)]], [[gerrit:1023796|WikiEduDashboard: allow removal when course is not synced (T363187)]], [[gerrit:1023797|WikiEduDashboard: allow removal when course is not synced (T363187)]] [13:04:56] T348086: Leveling Up: Scale to all Wikipedias - https://phabricator.wikimedia.org/T348086 [13:04:56] T363187: Removing tracking tool fails if the sync was disabled through other means - https://phabricator.wikimedia.org/T363187 [13:06:28] (03CR) 10Ssingh: [V:03+1] magru: add lvs700[1-3] and related configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023850 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [13:07:32] !log urbanecm@deploy1002 daimona and urbanecm: Backport for [[gerrit:1023101|Growth: Enable Levelling up features on all wikis (T348086)]], [[gerrit:1023796|WikiEduDashboard: allow removal when course is not synced (T363187)]], [[gerrit:1023797|WikiEduDashboard: allow removal when course is not synced (T363187)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:07:46] sergi0: Daimona: can you test at mwdebug1002, please? [13:08:03] yes [13:09:26] Yup [13:09:31] please do :) [13:09:37] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-durum rolling restart_daemons on A:durum [13:09:47] urbanecm I am here [13:09:52] Hm, actually, no. The bug can't be reproduced unless I manually mess with the DB. [13:10:07] Daimona: ah, okay. in that case, if it does not break things, i am happy to deploy [13:10:15] NMW03: okay, i'll do your patches in a minute then [13:10:16] But I can test that everything else is still fine [13:10:22] (03PS2) 10NMW03: Enabled subpages for main namespace in ptwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023530 (https://phabricator.wikimedia.org/T362300) [13:10:25] (03CR) 10Urbanecm: [C:03+2] Enabled subpages for main namespace in ptwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023530 (https://phabricator.wikimedia.org/T362300) (owner: 10NMW03) [13:10:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:11:11] (03Merged) 10jenkins-bot: Enabled subpages for main namespace in ptwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023530 (https://phabricator.wikimedia.org/T362300) (owner: 10NMW03) [13:11:25] NMW03: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1023531 is changing a namespace name. this means all links using the previous namespace name will no longer work. i believe that is not intended? [13:11:52] Yes [13:11:58] urbanecm: seems fine from my side (no errors), the notification jobs are pushed with 48h delays so will check that then [13:12:35] Should I redirect old namespace to new one? [13:12:42] NMW03: yes please, add it as an alias [13:12:46] sergi0: ack, let's deploy then [13:12:47] !log urbanecm@deploy1002 daimona and urbanecm: Continuing with sync [13:12:48] sure one second [13:12:51] (03CR) 10Krinkle: logging: do not explicitly set blackhole handler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023441 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [13:13:21] urbanecm: LGTM, I guess :P [13:13:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 5%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61150 and previous config saved to /var/cache/conftool/dbconfig/20240424-131336-arnaudb.json [13:13:40] Daimona: oh, sorry. i thought you already finished your tests :D [13:13:42] Can't really test much, but there's no smoke to be seen, so [13:14:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:elastic-eqiad [13:14:16] Don't worry, as I said, it's not really testable, so there's only one way to see if it works :P [13:14:43] hehe [13:15:02] (03PS1) 10Slyngshede: P:idm work-around for bug in Debian package. [puppet] - 10https://gerrit.wikimedia.org/r/1023852 [13:15:47] !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors rolling restart_daemons on A:logstash-collector [13:15:49] (03CR) 10Slyngshede: "Merge after Bitu 0.7.0 release, where the package will be pulled in as a dependency." [puppet] - 10https://gerrit.wikimedia.org/r/1023852 (owner: 10Slyngshede) [13:17:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1199', diff saved to https://phabricator.wikimedia.org/P61151 and previous config saved to /var/cache/conftool/dbconfig/20240424-131702-arnaudb.json [13:17:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1199.eqiad.wmnet with reason: T362746 [13:17:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1199.eqiad.wmnet with reason: T362746 [13:17:42] T362746: Upgrade s4 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362746 [13:17:59] (03CR) 10CI reject: [V:04-1] P:idm work-around for bug in Debian package. [puppet] - 10https://gerrit.wikimedia.org/r/1023852 (owner: 10Slyngshede) [13:18:32] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1199.eqiad.wmnet with OS bookworm [13:19:38] (03CR) 10Elukey: [V:03+1 C:03+2] role::aqs: remove old settings not used anymore after the move to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1023453 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:23:01] (03PS3) 10NMW03: Updated uzwiktionary project namespace name and site name to follow Uzbek grammar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023531 (https://phabricator.wikimedia.org/T362620) [13:23:09] (03PS4) 10NMW03: Updated uzwiktionary project namespace name and site name to follow Uzbek grammar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023531 (https://phabricator.wikimedia.org/T362620) [13:23:18] (03CR) 10Urbanecm: [C:03+2] Updated uzwiktionary project namespace name and site name to follow Uzbek grammar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023531 (https://phabricator.wikimedia.org/T362620) (owner: 10NMW03) [13:23:38] urbanecm I think talk page will be handled automatically, right? [13:23:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors (exit_code=0) rolling restart_daemons on A:logstash-collector [13:23:50] (03CR) 10CI reject: [V:04-1] Updated uzwiktionary project namespace name and site name to follow Uzbek grammar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023531 (https://phabricator.wikimedia.org/T362620) (owner: 10NMW03) [13:23:58] (03CR) 10Urbanecm: Updated uzwiktionary project namespace name and site name to follow Uzbek grammar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023531 (https://phabricator.wikimedia.org/T362620) (owner: 10NMW03) [13:24:05] NMW03: good point. that needs to be aliased separately [13:24:12] it'll be renamed automatically, but the aliasing will not happen [13:24:22] alright [13:24:25] CI failed too lol [13:24:36] yea [13:24:59] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1023101|Growth: Enable Levelling up features on all wikis (T348086)]], [[gerrit:1023796|WikiEduDashboard: allow removal when course is not synced (T363187)]], [[gerrit:1023797|WikiEduDashboard: allow removal when course is not synced (T363187)]] (duration: 20m 21s) [13:25:08] sergi0: Daimona: your patches are live [13:25:18] thanks! [13:25:22] T348086: Leveling Up: Scale to all Wikipedias - https://phabricator.wikimedia.org/T348086 [13:25:22] T363187: Removing tracking tool fails if the sync was disabled through other means - https://phabricator.wikimedia.org/T363187 [13:25:25] Nice! Thank you Martin! [13:25:27] np [13:26:46] (03CR) 10Hnowlan: [C:03+1] Replace tabs with 4 spaces in tlsproxy nginx.conf [puppet] - 10https://gerrit.wikimedia.org/r/1023440 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [13:27:18] (03PS1) 10Ssingh: roll-restart-reboot-durum: don't disable Puppet on restart [cookbooks] - 10https://gerrit.wikimedia.org/r/1023854 [13:27:20] (03PS5) 10NMW03: Updated uzwiktionary project namespace name and site name to follow Uzbek grammar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023531 (https://phabricator.wikimedia.org/T362620) [13:28:27] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1023854 (owner: 10Ssingh) [13:28:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 10%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61152 and previous config saved to /var/cache/conftool/dbconfig/20240424-132841-arnaudb.json [13:29:20] (03PS1) 10Clément Goubert: cert-manager: Bump memory in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023855 [13:29:56] (03CR) 10Muehlenhoff: "I created a fixed package, I copied it to my home on idm1001, could you test it? If it works fine, then we can simply upload that to apt.w" [puppet] - 10https://gerrit.wikimedia.org/r/1023852 (owner: 10Slyngshede) [13:30:11] (03CR) 10Elukey: [V:03+1 C:03+2] role::restbase::production: change Cassandra's Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1021915 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:30:37] (03CR) 10Hnowlan: [C:03+1] cert-manager: Bump memory in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023855 (owner: 10Clément Goubert) [13:30:52] (03CR) 10Urbanecm: [C:03+2] Updated uzwiktionary project namespace name and site name to follow Uzbek grammar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023531 (https://phabricator.wikimedia.org/T362620) (owner: 10NMW03) [13:30:56] (03PS1) 10JMeybohm: Kubernetes: Move use_pki_certs from site to common [puppet] - 10https://gerrit.wikimedia.org/r/1023856 [13:31:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023531 (https://phabricator.wikimedia.org/T362620) (owner: 10NMW03) [13:31:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1199.eqiad.wmnet with reason: host reimage [13:31:45] (03Merged) 10jenkins-bot: Updated uzwiktionary project namespace name and site name to follow Uzbek grammar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023531 (https://phabricator.wikimedia.org/T362620) (owner: 10NMW03) [13:31:48] (03CR) 10CI reject: [V:04-1] roll-restart-reboot-durum: don't disable Puppet on restart [cookbooks] - 10https://gerrit.wikimedia.org/r/1023854 (owner: 10Ssingh) [13:32:15] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1023530|Enabled subpages for main namespace in ptwikimedia (T362300)]], [[gerrit:1023531|Updated uzwiktionary project namespace name and site name to follow Uzbek grammar (T362620)]] [13:32:28] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2115/console" [puppet] - 10https://gerrit.wikimedia.org/r/1023856 (owner: 10JMeybohm) [13:32:35] T362300: Enable subpages on pt.wikimedia.org main namespace - https://phabricator.wikimedia.org/T362300 [13:32:35] T362620: Namespace changes on uzwiktionary - https://phabricator.wikimedia.org/T362620 [13:32:39] (03CR) 10Alexandros Kosiaris: [C:03+1] cert-manager: Bump memory in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023855 (owner: 10Clément Goubert) [13:32:40] (03PS2) 10Ssingh: roll-restart-reboot-durum: don't disable Puppet on restart [cookbooks] - 10https://gerrit.wikimedia.org/r/1023854 [13:33:21] (03CR) 10Clément Goubert: [C:03+2] cert-manager: Bump memory in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023855 (owner: 10Clément Goubert) [13:33:37] (03PS1) 10Muehlenhoff: sre.o11y.roll-restart-reboot-logstash-collectors: Also restart Envoy [cookbooks] - 10https://gerrit.wikimedia.org/r/1023857 [13:34:34] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling restart_daemons on A:durum [13:34:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1199.eqiad.wmnet with reason: host reimage [13:34:59] !log urbanecm@deploy1002 urbanecm and nmw03: Backport for [[gerrit:1023530|Enabled subpages for main namespace in ptwikimedia (T362300)]], [[gerrit:1023531|Updated uzwiktionary project namespace name and site name to follow Uzbek grammar (T362620)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:35:08] NMW03: can you test at mwdebug1002, please? [13:35:14] sure [13:35:41] (03Merged) 10jenkins-bot: cert-manager: Bump memory in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023855 (owner: 10Clément Goubert) [13:36:33] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:36:59] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:37:02] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2021.codfw.wmnet: Deploy new TLS Truststore for PKI - elukey@cumin1002 [13:37:31] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:37:57] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:38:19] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for prometheus-logstash-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1023859 (https://phabricator.wikimedia.org/T135991) [13:38:54] !log elukey@cumin1002 END (ERROR) - Cookbook sre.cassandra.roll-restart (exit_code=97) for nodes matching restbase2021.codfw.wmnet: Deploy new TLS Truststore for PKI - elukey@cumin1002 [13:39:35] urbanecm ptwikimedia change is LGTM, namespaces for uzwiktionary didn't work. Can we abandon it? I don't have time to fix it right now. I will fix it by late backport window [13:39:42] okay, sounds good [13:39:44] !log urbanecm@deploy1002 Sync cancelled. [13:40:08] (03PS1) 10Urbanecm: Revert "Updated uzwiktionary project namespace name and site name to follow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023798 (https://phabricator.wikimedia.org/T362620) [13:40:09] ptwikimedia patch is OK [13:40:16] (03CR) 10Urbanecm: [C:03+2] Revert "Updated uzwiktionary project namespace name and site name to follow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023798 (https://phabricator.wikimedia.org/T362620) (owner: 10Urbanecm) [13:40:20] yup yup [13:40:25] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2021.codfw.wmnet: Deploy new TLS Truststore for PKI - elukey@cumin1002 [13:40:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023798 (https://phabricator.wikimedia.org/T362620) (owner: 10Urbanecm) [13:40:30] (03CR) 10Ssingh: [C:03+2] roll-restart-reboot-durum: don't disable Puppet on restart [cookbooks] - 10https://gerrit.wikimedia.org/r/1023854 (owner: 10Ssingh) [13:41:28] (03Merged) 10jenkins-bot: Revert "Updated uzwiktionary project namespace name and site name to follow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023798 (https://phabricator.wikimedia.org/T362620) (owner: 10Urbanecm) [13:41:59] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1023530|Enabled subpages for main namespace in ptwikimedia (T362300)]], [[gerrit:1023531|Updated uzwiktionary project namespace name and site name to follow Uzbek grammar (T362620)]], [[gerrit:1023798|Revert "Updated uzwiktionary project namespace name and site name to follow" (T362620)]] [13:42:17] T362300: Enable subpages on pt.wikimedia.org main namespace - https://phabricator.wikimedia.org/T362300 [13:42:17] T362620: Namespace changes on uzwiktionary - https://phabricator.wikimedia.org/T362620 [13:43:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 25%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61153 and previous config saved to /var/cache/conftool/dbconfig/20240424-134349-arnaudb.json [13:44:42] !log urbanecm@deploy1002 urbanecm and nmw03: Backport for [[gerrit:1023530|Enabled subpages for main namespace in ptwikimedia (T362300)]], [[gerrit:1023531|Updated uzwiktionary project namespace name and site name to follow Uzbek grammar (T362620)]], [[gerrit:1023798|Revert "Updated uzwiktionary project namespace name and site name to follow" (T362620)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdeb [13:44:42] ug) [13:44:42] !log urbanecm@deploy1002 Sync cancelled. [13:44:50] wat? [13:44:58] second try... [13:44:59] =# [13:45:01] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1023848 (owner: 10Slyngshede) [13:45:20] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1023530|Enabled subpages for main namespace in ptwikimedia (T362300)]], [[gerrit:1023531|Updated uzwiktionary project namespace name and site name to follow Uzbek grammar (T362620)]], [[gerrit:1023798|Revert "Updated uzwiktionary project namespace name and site name to follow" (T362620)]] [13:47:04] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough [13:48:01] !log urbanecm@deploy1002 urbanecm and nmw03: Backport for [[gerrit:1023530|Enabled subpages for main namespace in ptwikimedia (T362300)]], [[gerrit:1023531|Updated uzwiktionary project namespace name and site name to follow Uzbek grammar (T362620)]], [[gerrit:1023798|Revert "Updated uzwiktionary project namespace name and site name to follow" (T362620)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdeb [13:48:01] ug) [13:48:06] !log urbanecm@deploy1002 urbanecm and nmw03: Continuing with sync [13:48:12] now it works [13:48:17] T362300: Enable subpages on pt.wikimedia.org main namespace - https://phabricator.wikimedia.org/T362300 [13:48:17] T362620: Namespace changes on uzwiktionary - https://phabricator.wikimedia.org/T362620 [13:48:20] yeah [13:49:37] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2021.codfw.wmnet: Deploy new TLS Truststore for PKI - elukey@cumin1002 [13:53:24] (03CR) 10Herron: "sure thing, will do" [puppet] - 10https://gerrit.wikimedia.org/r/1019840 (https://phabricator.wikimedia.org/T362239) (owner: 10Herron) [13:55:52] (03CR) 10Slyngshede: "Works perfectly. Tested OK on idm-test" [puppet] - 10https://gerrit.wikimedia.org/r/1023852 (owner: 10Slyngshede) [13:55:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1199.eqiad.wmnet with OS bookworm [13:56:14] (03Abandoned) 10Slyngshede: P:idm work-around for bug in Debian package. [puppet] - 10https://gerrit.wikimedia.org/r/1023852 (owner: 10Slyngshede) [13:56:43] (03CR) 10Slyngshede: [C:03+2] Blocklist IPs: Delete if expiry is in the past. [software/bitu] - 10https://gerrit.wikimedia.org/r/1023848 (owner: 10Slyngshede) [13:56:49] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Deploy new TLS Truststore for PKI - elukey@cumin1002 [13:57:41] (03PS1) 10Klausman: /home/klausman: Add a few tmuxp files [puppet] - 10https://gerrit.wikimedia.org/r/1023861 [13:58:44] (03Merged) 10jenkins-bot: Blocklist IPs: Delete if expiry is in the past. [software/bitu] - 10https://gerrit.wikimedia.org/r/1023848 (owner: 10Slyngshede) [13:58:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 50%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61155 and previous config saved to /var/cache/conftool/dbconfig/20240424-135854-arnaudb.json [13:59:29] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1023530|Enabled subpages for main namespace in ptwikimedia (T362300)]], [[gerrit:1023531|Updated uzwiktionary project namespace name and site name to follow Uzbek grammar (T362620)]], [[gerrit:1023798|Revert "Updated uzwiktionary project namespace name and site name to follow" (T362620)]] (duration: 14m 08s) [13:59:46] T362300: Enable subpages on pt.wikimedia.org main namespace - https://phabricator.wikimedia.org/T362300 [13:59:47] T362620: Namespace changes on uzwiktionary - https://phabricator.wikimedia.org/T362620 [14:00:04] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240424T1400) [14:00:15] thanks urbanecm [14:00:35] np [14:02:39] (03PS8) 10Esanders: Turn off DiscussionTools A/B test, and enable features on those wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954920 (https://phabricator.wikimedia.org/T341491) [14:02:42] 10ops-eqiad, 06SRE, 06DC-Ops: hw move: GPU from stat1005 to stat1010 - https://phabricator.wikimedia.org/T358763#9740639 (10Jclark-ctr) 05Open→03Resolved Installed Gpu into stat1010 [14:03:26] (03PS9) 10TChin: Add datasets-config helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) [14:07:11] (03CR) 10Hashar: logging: do not explicitly set blackhole handler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023441 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [14:10:46] (03PS3) 10Hashar: logging: do not explicitly set blackhole handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023441 (https://phabricator.wikimedia.org/T228838) [14:11:21] !log elukey@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching A:restbase-codfw: Deploy new TLS Truststore for PKI - elukey@cumin1002 [14:12:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1199 (re)pooling @ 5%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61156 and previous config saved to /var/cache/conftool/dbconfig/20240424-141241-arnaudb.json [14:13:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1190', diff saved to https://phabricator.wikimedia.org/P61157 and previous config saved to /var/cache/conftool/dbconfig/20240424-141305-arnaudb.json [14:13:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1190.eqiad.wmnet with reason: T362746 [14:13:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1190.eqiad.wmnet with reason: T362746 [14:13:33] T362746: Upgrade s4 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362746 [14:13:36] (03PS1) 10Clément Goubert: eventrouter: Bump memory on wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023862 [14:13:48] (03PS1) 10Muehlenhoff: sre.wdqs.restart-nginx: Also restart Envoy alongside [cookbooks] - 10https://gerrit.wikimedia.org/r/1023863 [14:14:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 75%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61158 and previous config saved to /var/cache/conftool/dbconfig/20240424-141400-arnaudb.json [14:14:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1190.eqiad.wmnet with OS bookworm [14:15:28] (03PS1) 10Slyngshede: Bitu 0.7.0 release [software/bitu] - 10https://gerrit.wikimedia.org/r/1023864 [14:19:23] !log import djangorestframework 3.14.0-2+wmf12u1 to apt.wikimedia.org (bug fix needed for Bitu 0.7.0, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1068747) [14:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:48] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org [14:20:13] !log restarting pdns-rec on dns6001 [14:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:10] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org [14:24:43] (03CR) 10Effie Mouzeli: [C:03+1] Enable profile::auto_restarts::service for testreduce/Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1023734 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:25:28] 10ops-eqiad, 06SRE: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T363280#9740682 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate ticket T362033 [14:26:33] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Deploy new TLS Truststore for PKI - elukey@cumin1002 [14:27:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1190.eqiad.wmnet with reason: host reimage [14:27:46] (03CR) 10Muehlenhoff: [C:03+1] "Typo inline, LGTM otherwise" [software/bitu] - 10https://gerrit.wikimedia.org/r/1023864 (owner: 10Slyngshede) [14:27:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1199 (re)pooling @ 10%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61159 and previous config saved to /var/cache/conftool/dbconfig/20240424-142747-arnaudb.json [14:29:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1242 (re)pooling @ 100%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61160 and previous config saved to /var/cache/conftool/dbconfig/20240424-142905-arnaudb.json [14:31:51] 10ops-eqiad, 06SRE, 06DBA: db1234 has hardware issues - https://phabricator.wikimedia.org/T363102#9740696 (10Jclark-ctr) @ABran-WMF Received replacement Dimm please reach out to me or @VRiley-WMF to schedule replacement I am available today but will be off the next two days [14:32:14] (03PS2) 10Slyngshede: Bitu 0.7.0 release [software/bitu] - 10https://gerrit.wikimedia.org/r/1023864 [14:32:25] (03CR) 10Slyngshede: Bitu 0.7.0 release (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1023864 (owner: 10Slyngshede) [14:32:27] 06SRE, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#9740697 (10aborrero) 05Stalled→03Open reopening -- we might want to take a look at this soon. [14:32:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1190.eqiad.wmnet with reason: host reimage [14:32:45] (03CR) 10Muehlenhoff: [C:03+1] Bitu 0.7.0 release [software/bitu] - 10https://gerrit.wikimedia.org/r/1023864 (owner: 10Slyngshede) [14:34:09] (03CR) 10Slyngshede: [C:03+2] Bitu 0.7.0 release [software/bitu] - 10https://gerrit.wikimedia.org/r/1023864 (owner: 10Slyngshede) [14:35:37] (03Merged) 10jenkins-bot: Bitu 0.7.0 release [software/bitu] - 10https://gerrit.wikimedia.org/r/1023864 (owner: 10Slyngshede) [14:38:41] !log rolling restart of haproxy, pdns-rec and ntp on A:dnsbox [14:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:52] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1199 (re)pooling @ 25%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61162 and previous config saved to /var/cache/conftool/dbconfig/20240424-144252-arnaudb.json [14:45:18] !log installing php7.4 security updates (as shipped in Debian, not our internal component) [14:45:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability, 13Patch-For-Review: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005 - https://phabricator.wikimedia.org/T362990#9740732 (10Jclark-ctr) a:03Jclark-ctr Opened request with Dell You have successfully submitted request SR1893... [14:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:47] (03CR) 10Klausman: [C:03+1] kserve-inference: allow transparent proxy mode for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021981 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [14:50:48] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for testreduce/Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1023734 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:52:37] !log installing exim4/spamassassin on MXes [14:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1190.eqiad.wmnet with OS bookworm [14:54:17] (03CR) 10Klausman: [C:03+1] admin_ng: move Istio configs to mw-api-int-ro for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021490 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [14:54:30] !log dancy@deploy1002 Installing scap version "4.79.0" for 325 hosts [14:55:19] !log dancy@deploy1002 Installation of scap version "4.79.0" completed for 325 hosts [14:55:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 5%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61163 and previous config saved to /var/cache/conftool/dbconfig/20240424-145545-arnaudb.json [14:55:52] (03PS2) 10Klausman: /home/klausman: Add a few tmuxp files [puppet] - 10https://gerrit.wikimedia.org/r/1023861 [14:56:33] (03PS3) 10Klausman: /home/klausman: Add a few tmuxp files [puppet] - 10https://gerrit.wikimedia.org/r/1023861 [14:57:17] (03CR) 10Alexandros Kosiaris: [C:04-1] "Thanks for this, looks pretty good, couple of inline comments." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [14:57:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1199 (re)pooling @ 50%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61164 and previous config saved to /var/cache/conftool/dbconfig/20240424-145758-arnaudb.json [14:58:52] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:44] (03CR) 10Klausman: [C:03+2] /home/klausman: Add a few tmuxp files [puppet] - 10https://gerrit.wikimedia.org/r/1023861 (owner: 10Klausman) [15:00:37] !log starting refinery deployment [15:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:11] !log ebysans@deploy1002 Started deploy [analytics/refinery@a5f2b25]: Regular analytics weekly train [analytics/refinery@a5f2b252] [15:01:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:01:39] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure for an-worker1087 - https://phabricator.wikimedia.org/T362871#9740776 (10Jclark-ctr) server is out of warranty Replaced Failed drive [15:06:46] (03CR) 10Hnowlan: [C:03+1] eventrouter: Bump memory on wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023862 (owner: 10Clément Goubert) [15:07:02] (03CR) 10Clément Goubert: [C:03+2] eventrouter: Bump memory on wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023862 (owner: 10Clément Goubert) [15:07:52] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9740826 (10Jclark-ctr) @Eevans Replaced drive [15:09:48] (03PS1) 10Elukey: sre.cassandra.roll-restart: disable puppet and log target nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1023873 [15:09:48] !log depooling cp4037 to test tls connection to kafka cluster (T358109) [15:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:54] (03Merged) 10jenkins-bot: eventrouter: Bump memory on wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023862 (owner: 10Clément Goubert) [15:10:00] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [15:10:05] T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109 [15:10:18] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:10:32] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:10:39] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:10:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 10%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61165 and previous config saved to /var/cache/conftool/dbconfig/20240424-151050-arnaudb.json [15:10:53] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:11:10] (03PS2) 10Elukey: sre.cassandra.roll-restart: disable puppet and log target nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1023873 [15:12:36] (03CR) 10Volans: "unrelated but kinda related :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1023873 (owner: 10Elukey) [15:12:54] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341 (10RobH) 03NEW [15:13:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1199 (re)pooling @ 75%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61166 and previous config saved to /var/cache/conftool/dbconfig/20240424-151304-arnaudb.json [15:13:11] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9740879 (10RobH) [15:13:24] !log ebysans@deploy1002 Finished deploy [analytics/refinery@a5f2b25]: Regular analytics weekly train [analytics/refinery@a5f2b252] (duration: 12m 13s) [15:13:31] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns1004.wikimedia.org [15:14:15] (03CR) 10Alexandros Kosiaris: [C:03+1] CommonSettings: change jobrunner xff to mw-jobrunner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 (owner: 10Hnowlan) [15:14:40] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns1004.wikimedia.org [15:15:52] (03CR) 10Elukey: "Tested with test-coobook on cumin1002" [cookbooks] - 10https://gerrit.wikimedia.org/r/1023873 (owner: 10Elukey) [15:16:54] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns1005.wikimedia.org [15:17:33] (03CR) 10Elukey: sre.cassandra.roll-restart: disable puppet and log target nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1023873 (owner: 10Elukey) [15:18:00] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns1005.wikimedia.org [15:18:52] (03PS1) 10Fabfur: benthos: kafka brokers certificates can be definitely trusted [puppet] - 10https://gerrit.wikimedia.org/r/1023874 (https://phabricator.wikimedia.org/T358109) [15:20:02] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns1006.wikimedia.org [15:20:19] !log ebysans@deploy1002 Started deploy [analytics/refinery@a5f2b25] (thin): Regular analytics weekly train THIN [analytics/refinery@a5f2b252] [15:20:39] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344 (10RobH) 03NEW [15:21:05] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9740953 (10RobH) [15:21:14] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns1006.wikimedia.org [15:23:16] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns2004.wikimedia.org [15:23:56] !log ebysans@deploy1002 Finished deploy [analytics/refinery@a5f2b25] (thin): Regular analytics weekly train THIN [analytics/refinery@a5f2b252] (duration: 03m 36s) [15:24:21] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9740971 (10Eevans) >>! In T362841#9740826, @Jclark-ctr wrote: > @Eevans Replaced drive Did something happen to `sdf` during the swap? ` [Apr24 15:05] ata9: SATA link down (SStatus 0 SControl 300) [ +5.55... [15:25:47] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns2004.wikimedia.org [15:25:47] !log ebysans@deploy1002 Started deploy [analytics/refinery@a5f2b25] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@a5f2b252] [15:25:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 25%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61167 and previous config saved to /var/cache/conftool/dbconfig/20240424-152556-arnaudb.json [15:26:33] (03CR) 10Vgutierrez: [C:03+1] "we could drop skip_cert_verify entirely as false it's the default value.. but I like that we are verbose about it" [puppet] - 10https://gerrit.wikimedia.org/r/1023874 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [15:27:10] (03CR) 10Cwhite: [C:03+1] Revert "trafficserver: move prometheus-eqiad to prometheus1006" [puppet] - 10https://gerrit.wikimedia.org/r/1023156 (owner: 10Herron) [15:27:41] (03CR) 10Cwhite: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1023859 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:27:49] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns2005.wikimedia.org [15:28:03] (03CR) 10Cwhite: [C:03+1] sre.o11y.roll-restart-reboot-logstash-collectors: Also restart Envoy [cookbooks] - 10https://gerrit.wikimedia.org/r/1023857 (owner: 10Muehlenhoff) [15:28:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1199 (re)pooling @ 100%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61168 and previous config saved to /var/cache/conftool/dbconfig/20240424-152811-arnaudb.json [15:28:28] (03CR) 10Cwhite: [C:03+1] Revert "prometheus: promote prometheus1006 to pushgateway duty" [puppet] - 10https://gerrit.wikimedia.org/r/1023155 (owner: 10Herron) [15:28:39] !log ebysans@deploy1002 Finished deploy [analytics/refinery@a5f2b25] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@a5f2b252] (duration: 02m 51s) [15:28:43] (03CR) 10Cwhite: [C:03+1] Revert "promote prometheus1006 as pushgateway primary" [dns] - 10https://gerrit.wikimedia.org/r/1023154 (owner: 10Herron) [15:29:06] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns2005.wikimedia.org [15:30:25] (03CR) 10Krinkle: [C:03+1] "LGTM, I have no merge here though, so up to Alex/Volans to merge :)" [software] - 10https://gerrit.wikimedia.org/r/1023388 (owner: 10Fabfur) [15:30:31] (03PS2) 10Fabfur: benthos:haproxy_cache: kafka brokers certificates can be verified [puppet] - 10https://gerrit.wikimedia.org/r/1023874 (https://phabricator.wikimedia.org/T358109) [15:30:54] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for prometheus-logstash-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1023859 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:31:09] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns2006.wikimedia.org [15:31:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:32:20] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns2006.wikimedia.org [15:33:50] (03CR) 10Fabfur: [C:03+2] benthos:haproxy_cache: kafka brokers certificates can be verified [puppet] - 10https://gerrit.wikimedia.org/r/1023874 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [15:33:52] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:39] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns3003.wikimedia.org [15:34:53] (03CR) 10Muehlenhoff: [C:03+2] sre.o11y.roll-restart-reboot-logstash-collectors: Also restart Envoy [cookbooks] - 10https://gerrit.wikimedia.org/r/1023857 (owner: 10Muehlenhoff) [15:34:58] (03PS1) 10JHathaway: vrts: fix time limit on generating aliases [puppet] - 10https://gerrit.wikimedia.org/r/1023878 [15:35:18] (03CR) 10CI reject: [V:04-1] vrts: fix time limit on generating aliases [puppet] - 10https://gerrit.wikimedia.org/r/1023878 (owner: 10JHathaway) [15:35:58] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns3003.wikimedia.org [15:36:57] (03PS2) 10JHathaway: vrts: fix time limit on generating aliases [puppet] - 10https://gerrit.wikimedia.org/r/1023878 [15:37:09] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [15:37:42] !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors rolling restart_daemons on A:logstash-collector [15:38:13] (03CR) 10Eevans: [C:03+1] "LGTM; Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1023873 (owner: 10Elukey) [15:38:39] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns3004.wikimedia.org [15:40:00] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns3004.wikimedia.org [15:40:30] (03PS1) 10Fabfur: benthos:haproxy_cache: pass root cas file path as envvar [puppet] - 10https://gerrit.wikimedia.org/r/1023879 (https://phabricator.wikimedia.org/T358109) [15:41:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 50%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61169 and previous config saved to /var/cache/conftool/dbconfig/20240424-154101-arnaudb.json [15:41:38] (03CR) 10Volans: [C:04-1] "Small issue with the puppet disable, LGTM otherwise." [cookbooks] - 10https://gerrit.wikimedia.org/r/1023873 (owner: 10Elukey) [15:41:57] (03CR) 10Pppery: [C:03+1] Replace a strlen(null) call for PHP 8.1 [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1020170 (https://phabricator.wikimedia.org/T342244) (owner: 10Aklapper) [15:42:27] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns4003.wikimedia.org [15:43:27] (03PS5) 10TChin: Add datasets-config helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) [15:44:30] (03PS3) 10Elukey: sre.cassandra.roll-restart: disable puppet and log target nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1023873 [15:44:57] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns4003.wikimedia.org [15:45:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors (exit_code=0) rolling restart_daemons on A:logstash-collector [15:45:29] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1023873 (owner: 10Elukey) [15:45:37] (03PS2) 10Fabfur: benthos:haproxy_cache: pass root cas file path as envvar [puppet] - 10https://gerrit.wikimedia.org/r/1023879 (https://phabricator.wikimedia.org/T358109) [15:46:22] (03PS6) 10TChin: Add datasets-config helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) [15:47:22] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns4004.wikimedia.org [15:48:27] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns4004.wikimedia.org [15:50:32] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns5003.wikimedia.org [15:51:38] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns5003.wikimedia.org [15:52:30] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update Docker image for revscoring-editquality-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023880 (https://phabricator.wikimedia.org/T362663) (owner: 10Elukey) [15:52:34] (03PS10) 10TChin: Add datasets-config helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) [15:53:04] (03Merged) 10jenkins-bot: sre.cassandra.roll-restart: disable puppet and log target nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1023873 (owner: 10Elukey) [15:53:20] (03CR) 10Elukey: [C:03+2] ml-services: update Docker image for revscoring-editquality-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023880 (https://phabricator.wikimedia.org/T362663) (owner: 10Elukey) [15:53:45] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns5004.wikimedia.org [15:53:52] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:54:12] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts aqs1014.eqiad.wmnet [15:54:33] !log Deployed refinery using scap, then deployed onto hdfs. [15:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts aqs1014.eqiad.wmnet [15:54:52] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns5004.wikimedia.org [15:55:57] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts aqs1014.eqiad.wmnet [15:56:06] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner package to 16.9 [puppet] - 10https://gerrit.wikimedia.org/r/1023882 (https://phabricator.wikimedia.org/T363349) [15:56:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 75%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61170 and previous config saved to /var/cache/conftool/dbconfig/20240424-155607-arnaudb.json [15:56:14] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Benthos instances [puppet] - 10https://gerrit.wikimedia.org/r/1023883 (https://phabricator.wikimedia.org/T135991) [15:56:27] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts aqs1014.eqiad.wmnet [15:56:57] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns6002.wikimedia.org [15:57:01] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1023878 (owner: 10JHathaway) [15:57:56] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023882 (https://phabricator.wikimedia.org/T363349) (owner: 10Jelto) [15:57:59] (03CR) 10JHathaway: [C:03+2] vrts: fix time limit on generating aliases [puppet] - 10https://gerrit.wikimedia.org/r/1023878 (owner: 10JHathaway) [15:58:17] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns6002.wikimedia.org [15:58:31] (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner package to 16.9 [puppet] - 10https://gerrit.wikimedia.org/r/1023882 (https://phabricator.wikimedia.org/T363349) (owner: 10Jelto) [15:59:24] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:01:13] (03CR) 10Vgutierrez: benthos:haproxy_cache: pass root cas file path as envvar (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023879 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [16:01:53] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:03:40] 06SRE, 10SRE-Access-Requests, 06Movement-Insights: Requesting membership in airflow-analytics-product-admins for nshahquinn-wmf and hghani - https://phabricator.wikimedia.org/T363288#9741172 (10BCornwall) 05Open→03In progress p:05Triage→03Medium a:03nshahquinn-wmf Could we please add a new, separat... [16:04:23] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:04:39] (03PS3) 10Fabfur: benthos:haproxy_cache: pass root cas file path as envvar [puppet] - 10https://gerrit.wikimedia.org/r/1023879 (https://phabricator.wikimedia.org/T358109) [16:05:06] !log running authdns-update [16:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:10] (03CR) 10Fabfur: benthos:haproxy_cache: pass root cas file path as envvar (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023879 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [16:06:25] (SystemdUnitFailed) firing: (2) generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:43] (03PS1) 10Btullis: Support the move of a GPU from stat1008 to stat1010 [puppet] - 10https://gerrit.wikimedia.org/r/1023885 (https://phabricator.wikimedia.org/T336040) [16:07:33] (03CR) 10Elukey: [C:03+1] Support the move of a GPU from stat1008 to stat1010 [puppet] - 10https://gerrit.wikimedia.org/r/1023885 (https://phabricator.wikimedia.org/T336040) (owner: 10Btullis) [16:08:46] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9741196 (10VRiley-WMF) We have received the PERC from Dell and I have just completed swapping it out. It now looks like the system can now see the PERC (previously, it... [16:09:27] (ProbeDown) firing: (2) Service wdqs1016:443 has failed probes (http_wdqs_internal_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:09:33] (03CR) 10Elukey: [C:04-1] "Sorry for the intrusion Ben, would it be ok to upgrade the drivers too?" [puppet] - 10https://gerrit.wikimedia.org/r/1023885 (https://phabricator.wikimedia.org/T336040) (owner: 10Btullis) [16:10:31] (03CR) 10Elukey: [C:04-1] "We can do it only on the new node, maybe with some specific hiera per-host settings.. I can upload the new patch if you want, so we avoid " [puppet] - 10https://gerrit.wikimedia.org/r/1023885 (https://phabricator.wikimedia.org/T336040) (owner: 10Btullis) [16:11:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 100%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61171 and previous config saved to /var/cache/conftool/dbconfig/20240424-161112-arnaudb.json [16:12:16] (03CR) 10Btullis: "Sure thing. I'm happy to make the new patch. I'll do it now." [puppet] - 10https://gerrit.wikimedia.org/r/1023885 (https://phabricator.wikimedia.org/T336040) (owner: 10Btullis) [16:16:22] (03PS5) 10Dwisehaupt: Add CDN configuration for new community-crm [puppet] - 10https://gerrit.wikimedia.org/r/983951 (https://phabricator.wikimedia.org/T343486) [16:17:25] (03PS2) 10Btullis: Support the move of a GPU from stat1008 to stat1010 [puppet] - 10https://gerrit.wikimedia.org/r/1023885 (https://phabricator.wikimedia.org/T336040) [16:18:21] (03CR) 10Elukey: [C:03+1] "thanks a lot!" [puppet] - 10https://gerrit.wikimedia.org/r/1023885 (https://phabricator.wikimedia.org/T336040) (owner: 10Btullis) [16:18:52] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9741271 (10KFrancis) Hello all, the NDA is out for signatures. I'll confirm when it's complete. Thanks! [16:19:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T352010)', diff saved to https://phabricator.wikimedia.org/P61172 and previous config saved to /var/cache/conftool/dbconfig/20240424-161859-ladsgroup.json [16:19:03] (03CR) 10Dwisehaupt: "This should be ready for review and deployment. Testing over ssh tunnels was successful and we are ready to have others connect for testin" [puppet] - 10https://gerrit.wikimedia.org/r/983951 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [16:19:24] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:21:15] (03PS1) 10Btullis: Add dummy keytabs for new stats servers [labs/private] - 10https://gerrit.wikimedia.org/r/1023889 (https://phabricator.wikimedia.org/T336040) [16:21:35] (03CR) 10Btullis: [V:03+2 C:03+2] Add dummy keytabs for new stats servers [labs/private] - 10https://gerrit.wikimedia.org/r/1023889 (https://phabricator.wikimedia.org/T336040) (owner: 10Btullis) [16:22:21] (03PS2) 10MusikAnimal: [hewiki] enable CodeMirrorV6 and CodeMirrorLineNumberingNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023501 (https://phabricator.wikimedia.org/T357795) [16:24:12] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1023885 (https://phabricator.wikimedia.org/T336040) (owner: 10Btullis) [16:24:27] (ProbeDown) resolved: (2) Service wdqs1016:443 has failed probes (http_wdqs_internal_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:25:00] (03CR) 10Btullis: [V:03+1 C:03+2] Support the move of a GPU from stat1008 to stat1010 [puppet] - 10https://gerrit.wikimedia.org/r/1023885 (https://phabricator.wikimedia.org/T336040) (owner: 10Btullis) [16:26:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:34:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P61173 and previous config saved to /var/cache/conftool/dbconfig/20240424-163407-ladsgroup.json [16:39:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hmonroy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023501 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [16:41:52] (03Merged) 10jenkins-bot: [hewiki] enable CodeMirrorV6 and CodeMirrorLineNumberingNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023501 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [16:42:21] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host stat1010.eqiad.wmnet [16:42:23] !log hmonroy@deploy1002 Started scap: Backport for [[gerrit:1023501|[hewiki] enable CodeMirrorV6 and CodeMirrorLineNumberingNamespaces (T357795 T347211)]] [16:43:01] T357795: CodeMirror 6 deployment - https://phabricator.wikimedia.org/T357795 [16:43:01] T347211: Enable line numbering in all namespaces for all wikis - https://phabricator.wikimedia.org/T347211 [16:43:54] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Deploy new TLS Truststore for PKI - elukey@cumin1002 [16:45:07] !log hmonroy@deploy1002 musikanimal and hmonroy: Backport for [[gerrit:1023501|[hewiki] enable CodeMirrorV6 and CodeMirrorLineNumberingNamespaces (T357795 T347211)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:48:57] (03CR) 10Elukey: [C:03+1] Fix mcrouter module to work out of the box from scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021918 (https://phabricator.wikimedia.org/T355237) (owner: 10JMeybohm) [16:49:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P61174 and previous config saved to /var/cache/conftool/dbconfig/20240424-164914-ladsgroup.json [16:49:24] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9741395 (10VRiley-WMF) 05Open→03In progress [16:49:24] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1010.eqiad.wmnet [16:50:57] !log hmonroy@deploy1002 musikanimal and hmonroy: Continuing with sync [16:51:09] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Apply truststore changes — T352647 - eevans@cumin1002 [16:51:10] (03CR) 10Elukey: [C:03+1] modules: Add restrictedSecurityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022161 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [16:51:32] T352647: Move Cassandra clusters to PKI - https://phabricator.wikimedia.org/T352647 [16:51:33] (03CR) 10Elukey: [C:03+1] eventgate: Update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019007 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [16:52:01] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9741396 (10jcrespo) a:05VRiley-WMF→03jcrespo Will reimage soon. [16:53:20] (03CR) 10Elukey: [C:03+1] eventgate-*: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019018 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [16:53:41] (03CR) 10Elukey: [C:03+1] eventgate: Add securityContext for all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022164 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240424T1700) [17:02:59] !log hmonroy@deploy1002 Finished scap: Backport for [[gerrit:1023501|[hewiki] enable CodeMirrorV6 and CodeMirrorLineNumberingNamespaces (T357795 T347211)]] (duration: 20m 36s) [17:03:05] T357795: CodeMirror 6 deployment - https://phabricator.wikimedia.org/T357795 [17:03:06] T347211: Enable line numbering in all namespaces for all wikis - https://phabricator.wikimedia.org/T347211 [17:04:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T352010)', diff saved to https://phabricator.wikimedia.org/P61175 and previous config saved to /var/cache/conftool/dbconfig/20240424-170421-ladsgroup.json [17:04:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [17:04:27] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:04:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [17:04:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T352010)', diff saved to https://phabricator.wikimedia.org/P61176 and previous config saved to /var/cache/conftool/dbconfig/20240424-170444-ladsgroup.json [17:06:25] (SystemdUnitFailed) resolved: (2) generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:08:51] (03CR) 10Alexandros Kosiaris: [C:03+1] wikifeeds: Use mesh modules version enabling IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023824 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [17:10:17] (03PS1) 10Btullis: Add thirdparty/ceph-reef to wikimedia-bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1023902 (https://phabricator.wikimedia.org/T362993) [17:10:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:10:33] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:10:36] (03CR) 10CI reject: [V:04-1] Add thirdparty/ceph-reef to wikimedia-bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1023902 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [17:11:08] (03PS2) 10Btullis: Add thirdparty/ceph-reef to wikimedia-bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1023902 (https://phabricator.wikimedia.org/T362993) [17:12:24] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9741498 (10Jclark-ctr) Looking at lshw.log and inventory on idrac it looks like all the drives are in order except sdf ,sdh are swapped in slots. after sdf rebuilds i can swap sdh [17:12:27] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2118/console" [puppet] - 10https://gerrit.wikimedia.org/r/1023902 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [17:15:33] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:20:48] !log btullis@cumin1002 START - Cookbook sre.wdqs.restart [17:22:46] 06SRE, 10SRE-Access-Requests: Requesting membership in airflow-analytics-product-admins for hghani - https://phabricator.wikimedia.org/T363360 (10nshahquinn-wmf) 03NEW [17:23:54] 06SRE, 10SRE-Access-Requests, 06Movement-Insights: Requesting membership in airflow-analytics-product-admins for hghani - https://phabricator.wikimedia.org/T363360#9741611 (10nshahquinn-wmf) [17:24:09] 06SRE, 10SRE-Access-Requests, 06Movement-Insights: Requesting membership in airflow-analytics-product-admins for nshahquinn-wmf - https://phabricator.wikimedia.org/T363288#9741614 (10nshahquinn-wmf) a:05nshahquinn-wmf→03None [17:25:14] 06SRE, 10SRE-Access-Requests, 06Movement-Insights: Requesting membership in airflow-analytics-product-admins for nshahquinn-wmf - https://phabricator.wikimedia.org/T363288#9741625 (10nshahquinn-wmf) >>! In T363288#9741173, @BCornwall wrote: > Could we please add a new, separate ticket for @Hghani's access an... [17:29:51] 10ops-eqiad, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: hw troubleshooting: /dev/sdg disk not working properly in cloudcephosd1017.eqiad.wmnet - https://phabricator.wikimedia.org/T359049#9741664 (10Jclark-ctr) [17:31:00] 10ops-eqiad, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: hw troubleshooting: /dev/sdg disk not working properly in cloudcephosd1017.eqiad.wmnet - https://phabricator.wikimedia.org/T359049#9741671 (10dcaro) \o/ the drive is listed now, will add it to the cluster (will take a bit), and close the task once... [17:38:08] (03PS1) 10Pppery: Delete "AM" and "PM" translations breaking search [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1023926 (https://phabricator.wikimedia.org/T363215) [17:39:14] (03CR) 10Pppery: "(I'm well aware that these files are generated code from upstream, and plan to fix the problem higher-up later, before I next run the gene" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1023926 (https://phabricator.wikimedia.org/T363215) (owner: 10Pppery) [17:41:56] !log btullis@cumin1002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [17:45:56] (03CR) 10Pppery: "I know it doesn't matter, but out of curiousity which specific PHP 7 syntax did I add?" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975392 (https://phabricator.wikimedia.org/T351363) (owner: 10Pppery) [17:46:45] 10ops-magru: remote hands directions for racking and cabling magru - https://phabricator.wikimedia.org/T363368 (10RobH) 03NEW p:05Triage→03Medium [17:47:07] 10ops-magru: remote hands directions for racking and cabling magru - https://phabricator.wikimedia.org/T363368#9741762 (10RobH) [17:51:11] 10ops-magru: remote hands directions for racking and cabling magru - https://phabricator.wikimedia.org/T363368#9741769 (10RobH) [17:55:05] (03PS1) 10Bernard Wang: Update wgVectorClientPrefs to wgVectorAppearance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023928 (https://phabricator.wikimedia.org/T362808) [17:55:29] (03PS2) 10Bernard Wang: Update wgVectorClientPrefs to wgVectorAppearance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023928 (https://phabricator.wikimedia.org/T362808) [17:57:03] (03CR) 10Herron: [C:03+2] Revert "prometheus: promote prometheus1006 to pushgateway duty" [puppet] - 10https://gerrit.wikimedia.org/r/1023155 (owner: 10Herron) [17:57:30] (03PS2) 10Herron: Revert "trafficserver: move prometheus-eqiad to prometheus1006" [puppet] - 10https://gerrit.wikimedia.org/r/1023156 [17:57:44] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9741773 (10Eevans) >>! In T362841#9741498, @Jclark-ctr wrote: > Looking at lshw.log and inventory on idrac it looks like all the drives are in order except sdf ,sdh are swapped in slots. after sdf rebuilds i... [17:58:01] (03CR) 10Herron: [C:03+2] Revert "trafficserver: move prometheus-eqiad to prometheus1006" [puppet] - 10https://gerrit.wikimedia.org/r/1023156 (owner: 10Herron) [17:58:23] (03PS2) 10Herron: Revert "promote prometheus1006 as pushgateway primary" [dns] - 10https://gerrit.wikimedia.org/r/1023154 [17:59:17] (03CR) 10Pppery: Merge in changes to qqq.json rather than overwriting them (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975392 (https://phabricator.wikimedia.org/T351363) (owner: 10Pppery) [17:59:28] (03CR) 10Herron: [C:03+2] Revert "promote prometheus1006 as pushgateway primary" [dns] - 10https://gerrit.wikimedia.org/r/1023154 (owner: 10Herron) [18:00:04] brennen and dancy: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240424T1800). [18:00:25] (SystemdUnitFailed) firing: wmf_auto_restart_redis-server.service on idm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:01:26] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9741804 (10Jclark-ctr) Corrected typo [18:01:28] o/ [18:03:10] !log train 1.43.0-wmf.2 (T361396) status: no current blockers, rolling to group1 [18:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:37] T361396: 1.43.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T361396 [18:05:12] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023929 (https://phabricator.wikimedia.org/T361396) [18:05:14] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023929 (https://phabricator.wikimedia.org/T361396) (owner: 10TrainBranchBot) [18:05:56] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023929 (https://phabricator.wikimedia.org/T361396) (owner: 10TrainBranchBot) [18:07:52] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9741843 (10Eevans) Having some trouble adding sdf2 back into the array: `mdadm: Cannot open /dev/sdf2: Device or resource busy` :/ `lang=sh-session eevans@aqs1014:~$ sudo sgdisk -R /dev/sdf /dev/sde Warning... [18:09:33] (03PS1) 10Hashar: wm-patch-demo: only fetch from MediaWiki project [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1023930 (https://phabricator.wikimedia.org/T363355) [18:10:25] (SystemdUnitFailed) firing: (2) wmf_auto_restart_redis-server.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:12:27] (03CR) 10Bartosz Dziewoński: "There are a few other projects: https://gitlab.wikimedia.org/repos/ci-tools/patchdemo/-/blob/master/repository-lists/all.txt" [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1023930 (https://phabricator.wikimedia.org/T363355) (owner: 10Hashar) [18:12:36] (03PS2) 10Hashar: wm-patch-demo: only fetch from MediaWiki project [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1023930 (https://phabricator.wikimedia.org/T363355) [18:13:10] (03CR) 10Hashar: "Oh my I feared that :) Maybe it is ok to hardcode?" [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1023930 (https://phabricator.wikimedia.org/T363355) (owner: 10Hashar) [18:13:48] (03CR) 10Bartosz Dziewoński: "I'm not sure if this is necessary, and I don't want us to have to maintain this list. Maybe let's see first if Patchdemo can deal with the" [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1023930 (https://phabricator.wikimedia.org/T363355) (owner: 10Hashar) [18:13:52] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:15:25] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:16:11] o/ lurking [18:16:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T352010)', diff saved to https://phabricator.wikimedia.org/P61177 and previous config saved to /var/cache/conftool/dbconfig/20240424-181653-ladsgroup.json [18:17:10] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:17:14] (03CR) 10Hashar: [C:04-1] "Sure! I am holding on future modifications." [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1023930 (https://phabricator.wikimedia.org/T363355) (owner: 10Hashar) [18:20:13] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.2 refs T361396 [18:20:34] T361396: 1.43.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T361396 [18:22:52] 06SRE, 10SRE-Access-Requests, 06Movement-Insights: Requesting membership in airflow-analytics-product-admins for nshahquinn-wmf - https://phabricator.wikimedia.org/T363288#9741944 (10mpopov) Approved! (`airflow-analytics-product-admins` membership) [18:22:57] 06SRE, 10SRE-Access-Requests, 06Movement-Insights: Requesting membership in airflow-analytics-product-admins for hghani - https://phabricator.wikimedia.org/T363360#9741946 (10mpopov) Approved! (`airflow-analytics-product-admins` membership) [18:25:51] 10SRE-Access-Requests, 06Fundraising-Backlog: Can we please add our vendor to Google Postmaster Tools - https://phabricator.wikimedia.org/T360907#9741949 (10bsisolak) Confirmed access is working [18:32:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P61178 and previous config saved to /var/cache/conftool/dbconfig/20240424-183200-ladsgroup.json [18:36:26] (03PS4) 10Herron: alertmanager: irc: clarify count and move firing to beginning [puppet] - 10https://gerrit.wikimedia.org/r/1019840 (https://phabricator.wikimedia.org/T362239) [18:41:22] (03CR) 10Dzahn: [C:03+1] "lgtm, I had to take a look at the "sni_support: strict" line being removed but seems fine:" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:47:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P61179 and previous config saved to /var/cache/conftool/dbconfig/20240424-184707-ladsgroup.json [18:50:52] (03PS6) 10Herron: alertmanager: irc: clarify count and move firing to beginning [puppet] - 10https://gerrit.wikimedia.org/r/1019840 (https://phabricator.wikimedia.org/T362239) [18:50:52] (03CR) 10Herron: "OK, please lmk what you think about PS6." [puppet] - 10https://gerrit.wikimedia.org/r/1019840 (https://phabricator.wikimedia.org/T362239) (owner: 10Herron) [18:57:12] !log amastilovic@deploy1002 Started deploy [airflow-dags/analytics@3f994d5]: (no justification provided) [18:57:40] !log amastilovic@deploy1002 Finished deploy [airflow-dags/analytics@3f994d5]: (no justification provided) (duration: 00m 28s) [18:59:03] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Jsn.sherman - https://phabricator.wikimedia.org/T363377 (10jsn.sherman) 03NEW [19:02:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T352010)', diff saved to https://phabricator.wikimedia.org/P61180 and previous config saved to /var/cache/conftool/dbconfig/20240424-190214-ladsgroup.json [19:02:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [19:02:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [19:02:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T352010)', diff saved to https://phabricator.wikimedia.org/P61181 and previous config saved to /var/cache/conftool/dbconfig/20240424-190237-ladsgroup.json [19:02:48] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:02:59] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Jsn.sherman - https://phabricator.wikimedia.org/T363377#9742068 (10Dzahn) [19:06:49] !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:06:58] !log bking@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:08:20] !log bking@deploy1002 stop `consumer-cloudelastic` release to test alerting T359213 [19:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:39] T359213: Adapt Flink-related rdf-streaming-updater alerts for Cirrus Streaming Updater - https://phabricator.wikimedia.org/T359213 [19:09:21] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Jsn.sherman - https://phabricator.wikimedia.org/T363377#9742122 (10Dzahn) @DMburugu please approve if you can confirm :) [19:09:58] (03PS1) 10Ryan Kemper: elastic: add config for elastic110[3-7] [puppet] - 10https://gerrit.wikimedia.org/r/1023937 (https://phabricator.wikimedia.org/T361268) [19:13:56] (CirrusConsumerCloudelasticFlinkJobNotRunning) firing: ... [19:13:56] cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning [19:14:11] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Apply truststore changes — T352647 - eevans@cumin1002 [19:14:29] T352647: Move Cassandra clusters to PKI - https://phabricator.wikimedia.org/T352647 [19:15:03] !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:15:15] !log bking@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:18:45] (CirrusStreamingUpdaterFlinkJobUnstable) firing: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnst [19:18:56] (CirrusConsumerCloudelasticFlinkJobNotRunning) resolved: ... [19:18:56] cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning [19:21:53] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! If it helps we can maybe pre-assign the primary IPs for the hosts - so you can add to common.yaml - but we'd need to manually updat" [puppet] - 10https://gerrit.wikimedia.org/r/1023850 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [19:23:45] (CirrusStreamingUpdaterFlinkJobUnstable) resolved: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUn [19:26:32] (03PS2) 10Ryan Kemper: elastic: add config for elastic110[3-7] [puppet] - 10https://gerrit.wikimedia.org/r/1023937 (https://phabricator.wikimedia.org/T361268) [19:27:44] (03CR) 10Bking: [C:03+2] elastic: add config for elastic110[3-7] [puppet] - 10https://gerrit.wikimedia.org/r/1023937 (https://phabricator.wikimedia.org/T361268) (owner: 10Ryan Kemper) [19:32:15] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release T363349 [19:33:03] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9742318 (10Eevans) > 2:23 PM i am swapping sdf again > 2:24 PM swapped with one that was just erased Ok, the newly erased device was detected as `sdi`. It has been added, and is... [19:35:53] (03PS8) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) [19:49:54] 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399 (10RobH) 03NEW [19:50:18] 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9742501 (10RobH) [19:56:10] (03PS1) 10Bking: cirrus-streaming-updater: link alerts to DPE SRE [alerts] - 10https://gerrit.wikimedia.org/r/1023942 (https://phabricator.wikimedia.org/T359213) [19:56:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability, 13Patch-For-Review: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005 - https://phabricator.wikimedia.org/T362990#9742614 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [19:57:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability, 13Patch-For-Review: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005 - https://phabricator.wikimedia.org/T362990#9742618 (10VRiley-WMF) 05Open→03Resolved [19:57:42] (03CR) 10CI reject: [V:04-1] cirrus-streaming-updater: link alerts to DPE SRE [alerts] - 10https://gerrit.wikimedia.org/r/1023942 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [19:58:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability, 13Patch-For-Review: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005 - https://phabricator.wikimedia.org/T362990#9742616 (10VRiley-WMF) This was a duplicate ticket that was opened for https://phabricator.wikimedia.org/T36... [19:59:55] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2122/co" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240424T2000). nyaa~ [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:16] (03PS2) 10Bking: cirrus-streaming-updater: link alerts to DPE SRE [alerts] - 10https://gerrit.wikimedia.org/r/1023942 (https://phabricator.wikimedia.org/T359213) [20:07:45] (CirrusStreamingUpdaterFlinkJobUnstable) firing: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnst [20:13:23] (03CR) 10Dzahn: scap: introduce bootstrapping mechanism specific to deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [20:17:04] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2123/co" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:20:04] (03PS9) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) [20:20:44] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T363409 (10phaultfinder) 03NEW [20:22:00] (03CR) 10Ebernhardson: [C:03+2] cirrus: Update container image and increase metaspace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020913 (owner: 10Ebernhardson) [20:22:55] (03Merged) 10jenkins-bot: cirrus: Update container image and increase metaspace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020913 (owner: 10Ebernhardson) [20:23:58] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2124/co" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:24:06] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:24:10] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:26:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:30:17] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 4 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:32:45] (CirrusStreamingUpdaterFlinkJobUnstable) resolved: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUn [20:34:06] (03CR) 10Dzahn: [C:03+1] "lgtm - https://puppet-compiler.wmflabs.org/output/1018749/2125/ but try on a single POP first (disable puppet, merge, re-enable..)" [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:37:20] !log Downtiming the Prometheus PoP hosts as part of the cergen to CFSSL migration - T360414 [20:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:42] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [20:37:53] !log denisse@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on prometheus6002.drmrs.wmnet,prometheus5002.eqsin.wmnet,prometheus3003.esams.wmnet,prometheus4002.ulsfo.wmnet with reason: Downtiming the Prometheus PoP hosts as part of the cergen to CFSSL migration - T360414 [20:38:02] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on prometheus6002.drmrs.wmnet,prometheus5002.eqsin.wmnet,prometheus3003.esams.wmnet,prometheus4002.ulsfo.wmnet with reason: Downtiming the Prometheus PoP hosts as part of the cergen to CFSSL migration - T360414 [20:38:35] !log Disabling Puppet on the Prometheus PoP hosts as part of the cergen to CFSSL migration - T360414 [20:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:21] (03CR) 10Andrea Denisse: [V:03+1 C:03+2] prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1018749 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:53:22] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#9742765 (10Dzahn) @Muehlenhoff Where does deploy* (deployment_server role both prod and wmcs) fit in? Since we are still on buster there. But want bullseye deployment_s... [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240424T2100) [21:03:23] 06SRE, 06serviceops: upgrade deployment servers to bullseye / add bullseye support to puppet role - https://phabricator.wikimedia.org/T363415 (10Dzahn) 03NEW [21:04:16] 06SRE, 06serviceops: upgrade deployment servers to bullseye / add bullseye support to puppet role - https://phabricator.wikimedia.org/T363415#9742836 (10Dzahn) [21:04:58] (03PS1) 10Andrea Denisse: Revert "prometheus: Ensure TLS certificates are provided by CFSSL" [puppet] - 10https://gerrit.wikimedia.org/r/1023916 [21:05:18] (03CR) 10CI reject: [V:04-1] Revert "prometheus: Ensure TLS certificates are provided by CFSSL" [puppet] - 10https://gerrit.wikimedia.org/r/1023916 (owner: 10Andrea Denisse) [21:07:13] (03CR) 10Dzahn: [C:03+1] Revert "prometheus: Ensure TLS certificates are provided by CFSSL" [puppet] - 10https://gerrit.wikimedia.org/r/1023916 (owner: 10Andrea Denisse) [21:07:57] (03CR) 10Dzahn: [C:03+1] "CI just dislikes the long lines in the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/1023916 (owner: 10Andrea Denisse) [21:08:20] (03PS2) 10Andrea Denisse: Revert "prometheus: Ensure TLS certificates are provided by CFSSL" [puppet] - 10https://gerrit.wikimedia.org/r/1023916 [21:09:55] (03CR) 10Andrea Denisse: [C:03+2] Revert "prometheus: Ensure TLS certificates are provided by CFSSL" [puppet] - 10https://gerrit.wikimedia.org/r/1023916 (owner: 10Andrea Denisse) [21:10:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:11:27] (03PS1) 10Ryan Kemper: elastic: bring elastic110[3-7] into svc [puppet] - 10https://gerrit.wikimedia.org/r/1023953 (https://phabricator.wikimedia.org/T361268) [21:11:37] (03CR) 10Bking: [C:03+2] Replace tabs with 4 spaces in tlsproxy nginx.conf [puppet] - 10https://gerrit.wikimedia.org/r/1023440 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [21:21:30] (03PS1) 10Dzahn: redis: use python3-redis to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1023954 (https://phabricator.wikimedia.org/T363415) [21:21:50] (03PS1) 10Andrea Denisse: Revert "Revert "prometheus: Ensure TLS certificates are provided by CFSSL"" [puppet] - 10https://gerrit.wikimedia.org/r/1023917 [21:21:50] (03CR) 10CI reject: [V:04-1] redis: use python3-redis to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1023954 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [21:22:08] (03CR) 10Bking: [C:03+2] elastic: bring elastic110[3-7] into svc [puppet] - 10https://gerrit.wikimedia.org/r/1023953 (https://phabricator.wikimedia.org/T361268) (owner: 10Ryan Kemper) [21:22:50] (03CR) 10Andrea Denisse: "Reverting because the change generates an invalid" [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (owner: 10Andrea Denisse) [21:23:08] (03PS2) 10Dzahn: redis: use python3-redis to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1023954 (https://phabricator.wikimedia.org/T363415) [21:24:55] (03PS2) 10Scott French: hieradata: disable etcd replication on conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/1023554 (https://phabricator.wikimedia.org/T358636) [21:24:55] (03PS2) 10Scott French: etcdmirror::instance: absent all resources [puppet] - 10https://gerrit.wikimedia.org/r/1023555 (https://phabricator.wikimedia.org/T358636) [21:24:55] (03PS2) 10Scott French: etcdmirror: reconfigure with full-keyspace replication [puppet] - 10https://gerrit.wikimedia.org/r/1023556 (https://phabricator.wikimedia.org/T358636) [21:24:55] (03PS2) 10Scott French: hieradata: reenable etcd replication on conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/1023557 (https://phabricator.wikimedia.org/T358636) [21:27:17] (03CR) 10Scott French: "Thank you, Riccardo." [puppet] - 10https://gerrit.wikimedia.org/r/1023556 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [21:29:27] dzahn@cumin2002 dzahn: The backup on gitlab1003 is complete, ready to proceed with upgrade. [21:29:33] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023556 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [21:30:25] (03PS1) 10Dzahn: deployment_server: add bullseye support, python3 package names [puppet] - 10https://gerrit.wikimedia.org/r/1023955 (https://phabricator.wikimedia.org/T363415) [21:30:32] (03CR) 10Andrea Denisse: [C:04-1] "I've reverted this change as it produced an invalid Envoy configuration." [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (owner: 10Andrea Denisse) [21:31:46] (03CR) 10Dzahn: "not trying to set a version variable or anything. doing it this way so later the buster cleanup is easy and the change minimal" [puppet] - 10https://gerrit.wikimedia.org/r/1023955 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [21:34:17] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023469 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [21:35:39] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=yes; selector: name=elastic110[3-7]\.eqiad\.wmnet [21:36:41] !log [Elastic] T361268 Pooled new hosts: `elastic110[3-7]` [21:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:00] T361268: Service implementation for elastic110[3-7] - https://phabricator.wikimedia.org/T361268 [21:52:04] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release T363349 [21:53:10] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 12), 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9742947 (10Scott_French) 05Open→03In progress Thanks, all, for the details shared thus far. Whi... [21:54:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (5) Elasticsearch instance elastic1103-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:56:01] (03CR) 10Bking: [V:04-1] "We need to change the team name on these alerts, so symlinking isn't enough. Will check with Observability to see if there's a DRYer way t" [alerts] - 10https://gerrit.wikimedia.org/r/1023942 (https://phabricator.wikimedia.org/T359213) (owner: 10Bking) [21:57:45] (03PS1) 10Scott French: admin_ng: add namespace for commons-impact-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023956 (https://phabricator.wikimedia.org/T361835) [21:57:48] (03PS1) 10Scott French: DNM: services: add commons-impact-analytics service helmfile configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) [21:57:50] (03PS1) 10Scott French: DNM: rest-gateway: route commons-analytics via rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023958 (https://phabricator.wikimedia.org/T361835) [21:59:20] (03PS8) 10JHathaway: Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) [21:59:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (5) Elasticsearch instance elastic1103-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [22:00:06] (03PS1) 10Scott French: kubernetes: add usernames for commons-impact-analytics to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1023959 (https://phabricator.wikimedia.org/T361835) [22:00:07] (03PS1) 10Scott French: DNM: cassandra: add commons_impact_analytics user [puppet] - 10https://gerrit.wikimedia.org/r/1023960 (https://phabricator.wikimedia.org/T361835) [22:00:43] (03CR) 10Gergő Tisza: [C:03+1] "Looks good, although I doubt this code path would have been ever hit in practice." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023441 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [22:02:13] (03CR) 10CI reject: [V:04-1] Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [22:02:21] (03PS1) 10Scott French: service: add commons-impact-analytics AQS 2.0 service [puppet] - 10https://gerrit.wikimedia.org/r/1023961 (https://phabricator.wikimedia.org/T361835) [22:02:22] (03PS1) 10Scott French: DNM: service: move commons-impact-analytics service to production state [puppet] - 10https://gerrit.wikimedia.org/r/1023962 (https://phabricator.wikimedia.org/T361835) [22:04:39] (CirrusSearchNodeIndexingNotIncreasing) resolved: (5) Elasticsearch instance elastic1103-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [22:05:13] (03CR) 10JHathaway: Postfix profile (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [22:06:12] 06SRE, 10SRE-Access-Requests, 06Movement-Insights: Requesting membership in airflow-analytics-product-admins for nshahquinn-wmf - https://phabricator.wikimedia.org/T363288#9742969 (10BCornwall) [22:06:33] (03CR) 10Dzahn: [C:03+2] Add CDN configuration for new community-crm [puppet] - 10https://gerrit.wikimedia.org/r/983951 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:07:21] (03PS9) 10JHathaway: Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) [22:07:52] (03PS1) 10BCornwall: admin: Add nshahquinn-wmf to airflow-analytics-product-admins [puppet] - 10https://gerrit.wikimedia.org/r/1023963 (https://phabricator.wikimedia.org/T363288) [22:09:11] 06SRE, 10SRE-Access-Requests, 06Movement-Insights, 13Patch-For-Review: Requesting membership in airflow-analytics-product-admins for nshahquinn-wmf - https://phabricator.wikimedia.org/T363288#9742982 (10BCornwall) a:03BCornwall [22:10:01] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Jsn.sherman - https://phabricator.wikimedia.org/T363377#9742979 (10BCornwall) 05Open→03In progress p:05Triage→03Medium a:03BCornwall [22:10:12] (03CR) 10CI reject: [V:04-1] Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [22:10:25] (SystemdUnitFailed) firing: (2) wmf_auto_restart_redis-server.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:11:12] 06SRE, 10SRE-Access-Requests, 06Movement-Insights: Requesting membership in airflow-analytics-product-admins for hghani - https://phabricator.wikimedia.org/T363360#9742984 (10BCornwall) 05Open→03In progress p:05Triage→03Medium a:03BCornwall [22:12:16] (03PS1) 10Scott French: wmnet: add CNAME records for commons-impact-analytics (k8s ingress) [dns] - 10https://gerrit.wikimedia.org/r/1023964 (https://phabricator.wikimedia.org/T361835) [22:13:24] 06SRE, 10SRE-Access-Requests, 06Movement-Insights: Requesting membership in airflow-analytics-product-admins for hghani - https://phabricator.wikimedia.org/T363360#9742989 (10Dzahn) The email address contains the -ctr suffix. For contractors please provide an expiry_date and expiry_contact. On that date we w... [22:15:06] (03CR) 10Dzahn: [C:03+1] admin: Add nshahquinn-wmf to airflow-analytics-product-admins [puppet] - 10https://gerrit.wikimedia.org/r/1023963 (https://phabricator.wikimedia.org/T363288) (owner: 10BCornwall) [22:16:42] 06SRE, 10SRE-Access-Requests, 06Movement-Insights: Requesting membership in airflow-analytics-product-admins for hghani - https://phabricator.wikimedia.org/T363360#9742998 (10Hghani) Hi, My contract expiry date is June 30th 2024. I believe the contact should be @OSefu-WMF. [22:18:00] (03CR) 10Scott French: "Thanks for offering to review, Hugh." [puppet] - 10https://gerrit.wikimedia.org/r/1023959 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [22:18:15] (03CR) 10BCornwall: [C:03+2] admin: Add nshahquinn-wmf to airflow-analytics-product-admins [puppet] - 10https://gerrit.wikimedia.org/r/1023963 (https://phabricator.wikimedia.org/T363288) (owner: 10BCornwall) [22:18:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:19:05] (03CR) 10Scott French: "I'll deploy this only after https://gerrit.wikimedia.org/r/1023959 is live." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023956 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [22:19:15] (03CR) 10Dzahn: [C:03+2] "this is currently slow deployed by puppet, in max 10 min all backends are done." [puppet] - 10https://gerrit.wikimedia.org/r/983951 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:20:30] 06SRE, 10SRE-Access-Requests, 06Movement-Insights, 13Patch-For-Review: Requesting membership in airflow-analytics-product-admins for nshahquinn-wmf - https://phabricator.wikimedia.org/T363288#9743002 (10BCornwall) 05In progress→03Resolved @nshahquinn-wmf Give it a few minutes to propagate and your... [22:25:35] 06SRE, 10SRE-Access-Requests, 06Movement-Insights: Requesting membership in airflow-analytics-product-admins for hghani - https://phabricator.wikimedia.org/T363360#9743004 (10BCornwall) [22:26:59] (03PS1) 10BCornwall: admin: Move hghani to airflow-analytics-product-admins [puppet] - 10https://gerrit.wikimedia.org/r/1023965 (https://phabricator.wikimedia.org/T363360) [22:27:01] (03CR) 10Dzahn: [C:03+2] "@dwisehaupt I see a login screen now 😊" [puppet] - 10https://gerrit.wikimedia.org/r/983951 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:28:17] 06SRE, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402#9743007 (10Dzahn) https://community-crm.wikimedia.org/ is now online 🎉 [22:31:44] (03PS1) 10Scott French: hieradata: make etcd in eqiad read-only [puppet] - 10https://gerrit.wikimedia.org/r/1023966 (https://phabricator.wikimedia.org/T358636) [22:31:46] (03PS1) 10Scott French: hieradata: return etcd in eqiad to read-write [puppet] - 10https://gerrit.wikimedia.org/r/1023967 (https://phabricator.wikimedia.org/T358636) [22:41:58] (03CR) 10Dzahn: [C:03+1] admin: Move hghani to airflow-analytics-product-admins [puppet] - 10https://gerrit.wikimedia.org/r/1023965 (https://phabricator.wikimedia.org/T363360) (owner: 10BCornwall) [22:46:43] 06SRE, 10SRE-Access-Requests, 06Movement-Insights, 13Patch-For-Review: Requesting membership in airflow-analytics-product-admins for hghani - https://phabricator.wikimedia.org/T363360#9743033 (10BCornwall) a:05BCornwall→03OSefu-WMF All we're waiting on is @OSefu-WMF 's approver [22:47:21] (03PS2) 10Scott French: hieradata: make etcd in eqiad read-only [puppet] - 10https://gerrit.wikimedia.org/r/1023966 (https://phabricator.wikimedia.org/T358636) [22:47:21] (03PS2) 10Scott French: hieradata: return etcd in eqiad to read-write [puppet] - 10https://gerrit.wikimedia.org/r/1023967 (https://phabricator.wikimedia.org/T358636) [22:47:41] (03PS1) 10Fabfur: benthos:haproxy_cache: field renaming moved to grok pattern [puppet] - 10https://gerrit.wikimedia.org/r/1023969 (https://phabricator.wikimedia.org/T363420) [22:47:59] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023966 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [22:48:18] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023967 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [23:00:11] (03CR) 10Jdlrobson: "What's the plan for merging this patch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023928 (https://phabricator.wikimedia.org/T362808) (owner: 10Bernard Wang) [23:32:44] (03CR) 10Scott French: "Manual PCC run for conf1009: https://puppet-compiler.wmflabs.org/output/1023966/2127" [puppet] - 10https://gerrit.wikimedia.org/r/1023966 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [23:38:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1023537 [23:38:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1023537 (owner: 10TrainBranchBot) [23:45:18] (03PS10) 10JHathaway: Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) [23:48:25] (03CR) 10CI reject: [V:04-1] Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [23:59:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1023537 (owner: 10TrainBranchBot)