[00:01:00] 06SRE, 10Wikimedia-Mailing-lists: Puppet failing on mailman03.mailman.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T329647#9728468 (10Dzahn) This instance was deleted recently per https://wikitech.wikimedia.org/wiki/Nova_Resource:Mailman/SAL [00:02:30] (ProbeDown) firing: (4) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:07:31] 06SRE, 06serviceops: Update grafana link for mediawiki-error-rate-$cluster in icinga check - https://phabricator.wikimedia.org/T281261#9728471 (10Dzahn) This was about https://gerrit.wikimedia.org/r/c/operations/puppet/+/668166/2/modules/profile/manifests/mediawiki/alerts.pp but now this seems all deleted fro... [00:08:15] (03CR) 10Dzahn: "this also made https://phabricator.wikimedia.org/T281261 invalid, right?" [puppet] - 10https://gerrit.wikimedia.org/r/885288 (owner: 10Giuseppe Lavagetto) [00:09:09] 06SRE, 06serviceops: Update grafana link for mediawiki-error-rate-$cluster in icinga check - https://phabricator.wikimedia.org/T281261#9728472 (10Dzahn) p:05Triage→03Low Check was deleted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/885288 by Giuseppe [01:17:51] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp3079 bios settings - https://phabricator.wikimedia.org/T349314#9728534 (10ssingh) 05Open→03Resolved a:03ssingh We fixed it but forgot to close this task so resolving. Thanks @Dzahn! [02:13:51] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:51] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:25] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:53:51] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:03:51] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:55] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:51] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:58:51] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:02:30] (ProbeDown) firing: (4) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:18:51] (SystemdUnitFailed) firing: (2) docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:42:30] (ProbeDown) firing: (6) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:47:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2114.codfw.wmnet with reason: Maintenance [04:48:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2114.codfw.wmnet with reason: Maintenance [04:48:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1173.eqiad.wmnet with reason: Maintenance [04:48:51] (03PS1) 10Marostegui: db1202: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1021681 [04:49:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1173.eqiad.wmnet with reason: Maintenance [04:49:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1202', diff saved to https://phabricator.wikimedia.org/P60987 and previous config saved to /var/cache/conftool/dbconfig/20240419-044906-root.json [04:50:05] (03CR) 10Marostegui: [C:03+2] db1202: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1021681 (owner: 10Marostegui) [04:50:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1202.eqiad.wmnet with OS bookworm [04:57:44] (03PS1) 10Marostegui: Revert "db1202: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1020943 [05:02:17] !log dbmaint Upgrade s7 codfw to Bookworm and MariaDB 10.6 T362745 [05:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:22] T362745: Upgrade s7 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362745 [05:02:25] !log dbmaint Upgrade s7 eqiad to Bookworm and MariaDB 10.6 T362745 [05:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1202.eqiad.wmnet with reason: host reimage [05:06:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1202.eqiad.wmnet with reason: host reimage [05:15:24] (SystemdUnitFailed) firing: (2) docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:21:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P60988 and previous config saved to /var/cache/conftool/dbconfig/20240419-052107-root.json [05:22:22] (03CR) 10Marostegui: [C:03+2] Revert "db1202: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1020943 (owner: 10Marostegui) [05:22:39] (03PS1) 10Muehlenhoff: Add Leo to phabricator-admins [puppet] - 10https://gerrit.wikimedia.org/r/1021694 [05:23:44] (03CR) 10Muehlenhoff: [C:03+2] Add Leo to phabricator-admins [puppet] - 10https://gerrit.wikimedia.org/r/1021694 (owner: 10Muehlenhoff) [05:26:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1202.eqiad.wmnet with OS bookworm [05:36:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P60989 and previous config saved to /var/cache/conftool/dbconfig/20240419-053612-root.json [05:51:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P60990 and previous config saved to /var/cache/conftool/dbconfig/20240419-055118-root.json [05:53:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T352010)', diff saved to https://phabricator.wikimedia.org/P60991 and previous config saved to /var/cache/conftool/dbconfig/20240419-055303-ladsgroup.json [05:53:11] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:53:51] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240419T0600) [06:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P60992 and previous config saved to /var/cache/conftool/dbconfig/20240419-060625-root.json [06:08:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P60993 and previous config saved to /var/cache/conftool/dbconfig/20240419-060810-ladsgroup.json [06:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:13:51] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:21:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P60994 and previous config saved to /var/cache/conftool/dbconfig/20240419-062130-root.json [06:23:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P60995 and previous config saved to /var/cache/conftool/dbconfig/20240419-062317-ladsgroup.json [06:36:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P60996 and previous config saved to /var/cache/conftool/dbconfig/20240419-063636-root.json [06:37:30] (ProbeDown) firing: (8) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:38:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T352010)', diff saved to https://phabricator.wikimedia.org/P60997 and previous config saved to /var/cache/conftool/dbconfig/20240419-063825-ladsgroup.json [06:38:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1235.eqiad.wmnet with reason: Maintenance [06:38:31] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:38:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1235.eqiad.wmnet with reason: Maintenance [06:38:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T352010)', diff saved to https://phabricator.wikimedia.org/P60998 and previous config saved to /var/cache/conftool/dbconfig/20240419-063847-ladsgroup.json [06:42:30] (ProbeDown) firing: (8) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:48:25] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:49:56] 06SRE, 10Bitu, 06DBA, 06Infrastructure-Foundations: Database request for Bitu Cloud DEV installation - https://phabricator.wikimedia.org/T362619#9728736 (10ABran-WMF) @SLyngshede-WMF would you be OK using `idmclouddev` to comply with our //no underscore in the database name// policy? [06:51:14] 06SRE, 10Bitu, 06DBA, 06Infrastructure-Foundations: Database request for Bitu Cloud DEV installation - https://phabricator.wikimedia.org/T362619#9728739 (10SLyngshede-WMF) @ABran-WMF Yes, I have no strong feeling about the database name, so what ever fits policy and naming conventions is absolutely fine. [06:51:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P60999 and previous config saved to /var/cache/conftool/dbconfig/20240419-065142-root.json [06:53:41] (03CR) 10Gmodena: [C:03+1] benthos/haproxy: include haproxy current pid in messages [puppet] - 10https://gerrit.wikimedia.org/r/1021517 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [06:55:10] (03CR) 10Gmodena: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1021517 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [06:58:33] (03CR) 10Muehlenhoff: purged: add PKI cert handling (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240419T0700) [07:03:40] !log imported PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf11u2 to component/php74 (backport of latest PHP security fixes) [07:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:17] !log installing PHP 7.4 security updates on cloudweb and bullseye snapshot hosts [07:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:29] 06SRE, 10Bitu, 06DBA, 06Infrastructure-Foundations: Database request for Bitu Cloud DEV installation - https://phabricator.wikimedia.org/T362619#9728777 (10ABran-WMF) p:05Triage→03Medium a:03ABran-WMF [07:24:23] !log installing Linux 6.1.85 on Bookworm hosts [07:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:33] (03PS1) 10Marostegui: db1194: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1021771 [07:36:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1194', diff saved to https://phabricator.wikimedia.org/P61001 and previous config saved to /var/cache/conftool/dbconfig/20240419-073638-root.json [07:37:25] (03CR) 10Marostegui: [C:03+2] db1194: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1021771 (owner: 10Marostegui) [07:38:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1194.eqiad.wmnet with OS bookworm [07:38:32] (03PS1) 10Marostegui: Revert "db1194: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1021786 [07:38:43] (03PS2) 10Marostegui: Revert "db1194: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1021786 [07:49:19] (03PS1) 10Brouberol: modules: release a new version of app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021843 (https://phabricator.wikimedia.org/T362954) [07:52:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage [07:52:46] (03PS2) 10Brouberol: modules: release a new version of app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021843 (https://phabricator.wikimedia.org/T362954) [07:54:13] (03PS3) 10Brouberol: modules: release a new version of app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021843 (https://phabricator.wikimedia.org/T362954) [07:55:17] (03CR) 10Brouberol: "`diff" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021843 (https://phabricator.wikimedia.org/T362954) (owner: 10Brouberol) [07:55:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage [08:00:43] (03PS1) 10Brouberol: superset-next: Upgrade to Superset 3.1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021851 (https://phabricator.wikimedia.org/T358674) [08:00:43] (03PS1) 10Brouberol: superset: Upgrade to Superset 3.1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021852 (https://phabricator.wikimedia.org/T358674) [08:05:06] (03PS4) 10Brouberol: modules: release a new version of app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021843 (https://phabricator.wikimedia.org/T362954) [08:08:51] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:39] (03PS5) 10Brouberol: modules: release a new version of app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021843 (https://phabricator.wikimedia.org/T362954) [08:10:24] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:21] (03CR) 10Fabfur: [C:03+2] benthos/haproxy: using hiera aliases for benthos socket address [puppet] - 10https://gerrit.wikimedia.org/r/1021505 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [08:11:36] (03CR) 10Fabfur: benthos/haproxy: using hiera aliases for benthos socket address [puppet] - 10https://gerrit.wikimedia.org/r/1021505 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [08:12:23] (03Abandoned) 10Brouberol: superset: upgrade to 3.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013239 (https://phabricator.wikimedia.org/T358674) (owner: 10Brouberol) [08:12:35] (03CR) 10Fabfur: [C:03+2] benthos/haproxy: include haproxy current pid in messages [puppet] - 10https://gerrit.wikimedia.org/r/1021517 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [08:13:51] (JobUnavailable) firing: (2) Reduced availability for job pushgateway in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:15:54] (03PS1) 10Fabfur: Revert "benthos/haproxy: include haproxy current pid in messages" [puppet] - 10https://gerrit.wikimedia.org/r/1021787 [08:17:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1194.eqiad.wmnet with OS bookworm [08:30:18] (03PS1) 10Majavah: libraryupgrader: Add missing --auto to command line [puppet] - 10https://gerrit.wikimedia.org/r/1021862 [08:31:04] (03PS1) 10Fabfur: benthos: ensure haproxy_pid is interpreted as number [puppet] - 10https://gerrit.wikimedia.org/r/1021863 (https://phabricator.wikimedia.org/T358109) [08:34:06] (03CR) 10Fabfur: [C:03+2] benthos: ensure haproxy_pid is interpreted as number [puppet] - 10https://gerrit.wikimedia.org/r/1021863 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [08:36:37] (03CR) 10Majavah: [C:03+2] libraryupgrader: Add missing --auto to command line [puppet] - 10https://gerrit.wikimedia.org/r/1021862 (owner: 10Majavah) [08:40:27] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:42:14] (03PS1) 10Kevin Bazira: ml-services: add logo-detection isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021398 (https://phabricator.wikimedia.org/T362749) [08:43:09] (03PS1) 10Slyngshede: P:idm Make some parameters optional. [puppet] - 10https://gerrit.wikimedia.org/r/1021866 (https://phabricator.wikimedia.org/T362128) [08:44:03] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: add logo-detection isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021398 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [08:46:07] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021398 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [08:46:57] (03Merged) 10jenkins-bot: ml-services: add logo-detection isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021398 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [08:47:01] (03PS1) 10Majavah: O:wmcs: codfw1dev: net_ovs: install dhcp and metadata agents [puppet] - 10https://gerrit.wikimedia.org/r/1021867 (https://phabricator.wikimedia.org/T358761) [08:48:09] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2039/console" [puppet] - 10https://gerrit.wikimedia.org/r/1021866 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [08:48:33] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2040/co" [puppet] - 10https://gerrit.wikimedia.org/r/1021867 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [08:48:45] (03CR) 10Btullis: [C:03+1] "Great!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021851 (https://phabricator.wikimedia.org/T358674) (owner: 10Brouberol) [08:48:54] (03CR) 10Muehlenhoff: P:idm Make some parameters optional. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021866 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [08:50:47] (03PS2) 10Brouberol: Create the MPIC Kubernetes chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci) [08:50:52] (03CR) 10Brouberol: [C:03+2] superset-next: Upgrade to Superset 3.1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021851 (https://phabricator.wikimedia.org/T358674) (owner: 10Brouberol) [08:50:56] (03PS1) 10Muehlenhoff: heat: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1021868 [08:51:40] (03Merged) 10jenkins-bot: superset-next: Upgrade to Superset 3.1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021851 (https://phabricator.wikimedia.org/T358674) (owner: 10Brouberol) [08:51:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1021868 (owner: 10Muehlenhoff) [08:52:55] (03PS2) 10Majavah: O:wmcs: codfw1dev: net_ovs: install dhcp and metadata agents [puppet] - 10https://gerrit.wikimedia.org/r/1021867 (https://phabricator.wikimedia.org/T358761) [08:52:55] (03PS1) 10Majavah: openstack: neutron: fix some non-breaking spaces [puppet] - 10https://gerrit.wikimedia.org/r/1021870 [08:53:41] (03CR) 10Majavah: [C:03+2] openstack: neutron: fix some non-breaking spaces [puppet] - 10https://gerrit.wikimedia.org/r/1021870 (owner: 10Majavah) [08:53:48] (03PS1) 10JMeybohm: _scaffole: Don't include tag in image_name preset responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021871 [08:54:04] (03CR) 10Marostegui: [C:03+2] Revert "db1194: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1021786 (owner: 10Marostegui) [08:54:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61003 and previous config saved to /var/cache/conftool/dbconfig/20240419-085404-root.json [08:54:26] !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:54:58] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [08:55:04] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [08:56:03] (03PS2) 10Slyngshede: P:idm Make some parameters optional. [puppet] - 10https://gerrit.wikimedia.org/r/1021866 (https://phabricator.wikimedia.org/T362128) [08:56:42] (03CR) 10Brouberol: [C:03+1] "Nicely done, and sorry I couldn't be of more help" [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [08:59:42] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for Lina_Farid_WMDE - https://phabricator.wikimedia.org/T362959 (10Lina_Farid_WMDE) 03NEW [08:59:45] (03CR) 10Majavah: [C:03+1] heat: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1021868 (owner: 10Muehlenhoff) [09:00:07] (03CR) 10Btullis: [C:03+1] "LGTB" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021852 (https://phabricator.wikimedia.org/T358674) (owner: 10Brouberol) [09:00:11] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9728960 (10Lina_Farid_WMDE) [09:00:24] (03CR) 10Brouberol: [C:03+2] superset: Upgrade to Superset 3.1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021852 (https://phabricator.wikimedia.org/T358674) (owner: 10Brouberol) [09:00:38] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2041/console" [puppet] - 10https://gerrit.wikimedia.org/r/1021866 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [09:00:42] (03PS1) 10Fabfur: benthos: fix conf to actually transform haproxy_pid field into number [puppet] - 10https://gerrit.wikimedia.org/r/1021873 (https://phabricator.wikimedia.org/T358109) [09:02:00] (03Abandoned) 10Fabfur: Revert "benthos/haproxy: include haproxy current pid in messages" [puppet] - 10https://gerrit.wikimedia.org/r/1021787 (owner: 10Fabfur) [09:02:34] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [09:03:26] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [09:04:28] (03CR) 10Fabfur: [C:03+2] benthos: fix conf to actually transform haproxy_pid field into number [puppet] - 10https://gerrit.wikimedia.org/r/1021873 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [09:09:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61004 and previous config saved to /var/cache/conftool/dbconfig/20240419-090910-root.json [09:16:38] (03PS1) 10Muehlenhoff: Remove obsolete script to detect ever-changing puppet runs [puppet] - 10https://gerrit.wikimedia.org/r/1021875 (https://phabricator.wikimedia.org/T345090) [09:16:40] (03PS1) 10Btullis: Enable monitoring on matomo1003 [puppet] - 10https://gerrit.wikimedia.org/r/1021876 (https://phabricator.wikimedia.org/T349397) [09:16:50] (03PS2) 10Muehlenhoff: Remove obsolete script to detect ever-changing puppet runs [puppet] - 10https://gerrit.wikimedia.org/r/1021875 (https://phabricator.wikimedia.org/T345090) [09:17:30] (ProbeDown) firing: (8) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2042/" [puppet] - 10https://gerrit.wikimedia.org/r/1021876 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [09:21:03] (03CR) 10Fabfur: ncredir,benthos: Provide benthos support on ncredir (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [09:21:05] (03PS1) 10Clément Goubert: build-bare-slim: Stop building wikimedia-buster [puppet] - 10https://gerrit.wikimedia.org/r/1021877 (https://phabricator.wikimedia.org/T362518) [09:24:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61005 and previous config saved to /var/cache/conftool/dbconfig/20240419-092415-root.json [09:24:37] (03PS1) 10JMeybohm: Replace wikimedia-buster base images with buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021878 (https://phabricator.wikimedia.org/T362518) [09:25:40] (03CR) 10Fabfur: ncredir,benthos: Provide benthos support on ncredir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [09:27:30] (ProbeDown) firing: (8) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:33:15] (03CR) 10Btullis: [V:03+1 C:03+2] Enable monitoring on matomo1003 [puppet] - 10https://gerrit.wikimedia.org/r/1021876 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [09:36:58] (03CR) 10JMeybohm: "There are some users still.. (https://codesearch.wmcloud.org/search/?q=wikimedia-buster&files=&excludeFiles=&repos=#releng/release)" [puppet] - 10https://gerrit.wikimedia.org/r/1021877 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [09:39:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61006 and previous config saved to /var/cache/conftool/dbconfig/20240419-093921-root.json [09:39:27] (03CR) 10Clément Goubert: "Fair enough. If we keep it around, we should put the date tagging code back in, because that was *very* confusing." [puppet] - 10https://gerrit.wikimedia.org/r/1021877 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [09:43:12] (03PS1) 10Btullis: Correct the name of the matomo database used for backups [puppet] - 10https://gerrit.wikimedia.org/r/1021881 (https://phabricator.wikimedia.org/T349397) [09:46:10] (03CR) 10JMeybohm: [C:04-1] "Thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [09:47:38] (03PS2) 10Clément Goubert: build-bare-slim: Date tag wikimedia-buster images [puppet] - 10https://gerrit.wikimedia.org/r/1021877 (https://phabricator.wikimedia.org/T362518) [09:50:53] (03CR) 10Vgutierrez: [V:03+1] ncredir,benthos: Provide benthos support on ncredir (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [09:54:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61007 and previous config saved to /var/cache/conftool/dbconfig/20240419-095427-root.json [09:58:25] (03CR) 10Brouberol: [C:03+1] "Oops, I should have seen that as well." [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [09:59:10] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9729097 (10Clement_Goubert) I suppose that can be hotswapped? Let us know if it can't, we'll drain and cordon the host for the disk swap. [09:59:56] (03CR) 10Fabfur: ncredir,benthos: Provide benthos support on ncredir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [10:01:53] (03CR) 10Btullis: [V:03+1 C:03+2] Correct the name of the matomo database used for backups [puppet] - 10https://gerrit.wikimedia.org/r/1021881 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [10:03:07] (03CR) 10Vgutierrez: [V:03+1] ncredir,benthos: Provide benthos support on ncredir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [10:09:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61008 and previous config saved to /var/cache/conftool/dbconfig/20240419-100933-root.json [10:13:51] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:15:03] (03CR) 10Clément Goubert: [C:03+1] "DNM before 20240423 to avoid unintended consequences on WMF holiday" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021878 (https://phabricator.wikimedia.org/T362518) (owner: 10JMeybohm) [10:22:22] (03CR) 10Majavah: [C:03+2] O:wmcs: codfw1dev: net_ovs: install dhcp and metadata agents [puppet] - 10https://gerrit.wikimedia.org/r/1021867 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [10:24:08] 06SRE, 06serviceops: Update grafana link for mediawiki-error-rate-$cluster in icinga check - https://phabricator.wikimedia.org/T281261#9729176 (10Clement_Goubert) 05Open→03Invalid Yes, alert was moved to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-sre/media... [10:24:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61009 and previous config saved to /var/cache/conftool/dbconfig/20240419-102438-root.json [10:28:39] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1021866 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [10:29:21] (03PS1) 10Clément Goubert: CommonSettings.php: Fix jobrunner hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1021886 (https://phabricator.wikimedia.org/T349796) [10:29:58] (03PS39) 10Klausman: deployment_server: Change Puppet query for ML Cassandra Clusters [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) [10:32:07] (03PS40) 10Klausman: deployment_server: Change Puppet query for ML Cassandra Clusters [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) [10:35:09] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2045/co" [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [10:36:11] (03PS1) 10Btullis: Swith matomo1002 to the insetup::data_engineering role [puppet] - 10https://gerrit.wikimedia.org/r/1021889 (https://phabricator.wikimedia.org/T349397) [10:36:28] (03PS2) 10Btullis: Switch matomo1002 to the insetup::data_engineering role [puppet] - 10https://gerrit.wikimedia.org/r/1021889 (https://phabricator.wikimedia.org/T349397) [10:37:15] (03PS3) 10Btullis: Switch matomo1002 to the insetup::data_engineering role [puppet] - 10https://gerrit.wikimedia.org/r/1021889 (https://phabricator.wikimedia.org/T349397) [10:39:17] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2046/co" [puppet] - 10https://gerrit.wikimedia.org/r/1021889 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [10:40:36] (03CR) 10Slyngshede: [C:04-1] "Link to Phabricator and revision seems wrong." [puppet] - 10https://gerrit.wikimedia.org/r/1021875 (https://phabricator.wikimedia.org/T345090) (owner: 10Muehlenhoff) [10:40:40] (03CR) 10Klausman: [V:03+1] "The broken PCC run is due to me thinking "I'll just clean up some variable names" and getting it wrong --- fixed now." [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [10:41:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [10:41:34] (03CR) 10Slyngshede: [V:03+1] P:idm Make some parameters optional. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021866 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [10:41:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [10:41:41] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idm Make some parameters optional. [puppet] - 10https://gerrit.wikimedia.org/r/1021866 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [10:41:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1183 (T352010)', diff saved to https://phabricator.wikimedia.org/P61010 and previous config saved to /var/cache/conftool/dbconfig/20240419-104144-ladsgroup.json [10:41:52] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:48:25] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:56] (03PS3) 10Santiago Faci: Create the MPIC Kubernetes chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) [10:50:11] (03PS1) 10Btullis: Update prometheus config to reflect matomo profile change [puppet] - 10https://gerrit.wikimedia.org/r/1021892 (https://phabricator.wikimedia.org/T349397) [10:51:55] (03CR) 10Santiago Faci: "Thanks for the comments! They have been fixed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci) [10:52:03] (03PS1) 10Btullis: Remove the piwik role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1021893 (https://phabricator.wikimedia.org/T349397) [10:52:47] (03CR) 10Slyngshede: [C:03+2] Initial documentation for the Bitu API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1020802 (owner: 10Slyngshede) [10:53:04] (03PS2) 10Btullis: Remove the piwik role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1021893 (https://phabricator.wikimedia.org/T349397) [10:53:30] (03CR) 10Slyngshede: [C:03+2] Keymanagement, fix parsing and display of FIDO/U2F keys [software/bitu] - 10https://gerrit.wikimedia.org/r/1020836 (owner: 10Slyngshede) [10:53:45] (03PS1) 10Majavah: openstack: neutron: set dhcp interface driver correctly [puppet] - 10https://gerrit.wikimedia.org/r/1021894 [10:53:58] (03PS2) 10Majavah: openstack: neutron: set dhcp interface driver correctly [puppet] - 10https://gerrit.wikimedia.org/r/1021894 (https://phabricator.wikimedia.org/T358761) [10:54:10] (03PS3) 10Majavah: openstack: neutron: set dhcp interface driver correctly [puppet] - 10https://gerrit.wikimedia.org/r/1021894 (https://phabricator.wikimedia.org/T358761) [10:54:11] (03Merged) 10jenkins-bot: Initial documentation for the Bitu API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1020802 (owner: 10Slyngshede) [10:55:00] (03Merged) 10jenkins-bot: Keymanagement, fix parsing and display of FIDO/U2F keys [software/bitu] - 10https://gerrit.wikimedia.org/r/1020836 (owner: 10Slyngshede) [10:55:22] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1021894 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [10:56:59] (03CR) 10Majavah: [V:03+1 C:03+2] openstack: neutron: set dhcp interface driver correctly [puppet] - 10https://gerrit.wikimedia.org/r/1021894 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240419T0700) [11:00:04] eoghan, jelto, arnoldokoth, and mutante: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240419T1100). [11:00:05] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudnet2007-dev.codfw.wmnet [11:00:51] no gitlab upgrade today [11:02:24] (03PS1) 10Muehlenhoff: Install a Puppet generator to create a known hosts file for Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1021896 (https://phabricator.wikimedia.org/T309724) [11:02:45] (03CR) 10CI reject: [V:04-1] Install a Puppet generator to create a known hosts file for Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1021896 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [11:03:40] (03PS2) 10Muehlenhoff: Install a Puppet generator to create a known hosts file for Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1021896 (https://phabricator.wikimedia.org/T309724) [11:05:13] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2007-dev.codfw.wmnet [11:08:59] (03CR) 10Btullis: [V:03+1 C:03+2] Switch matomo1002 to the insetup::data_engineering role [puppet] - 10https://gerrit.wikimedia.org/r/1021889 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [11:15:59] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts matomo1002.eqiad.wmnet [11:17:30] (ProbeDown) firing: (8) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:21:33] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [11:22:30] (ProbeDown) firing: (8) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:22:56] (03PS45) 10Klausman: deployment_server: Change Puppet query for ML Cassandra Clusters [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) [11:22:57] (03CR) 10Klausman: [V:03+1] "I've inlined the generated Hash entirely, resulting in all the clusters being expanded. For the sake of easier use in deployment charts (l" [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [11:23:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T352010)', diff saved to https://phabricator.wikimedia.org/P61013 and previous config saved to /var/cache/conftool/dbconfig/20240419-112309-ladsgroup.json [11:23:14] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:24:59] (03CR) 10Klausman: "One addendum: currently, this picks up what looks like test instances (at the bottom of the PCC diff). I am not sure whether those should " [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [11:27:02] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: matomo1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [11:27:13] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [11:28:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: matomo1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [11:28:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:28:07] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts matomo1002.eqiad.wmnet [11:28:18] (03CR) 10Cathal Mooney: [C:03+2] Add magru to homer-public [homer/public] - 10https://gerrit.wikimedia.org/r/1019292 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [11:28:43] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:28:56] (03Merged) 10jenkins-bot: Add magru to homer-public [homer/public] - 10https://gerrit.wikimedia.org/r/1019292 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [11:31:16] (03CR) 10Klausman: [C:03+1] admin_ng: move Istio configs to mw-api-int-ro for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021490 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [11:31:47] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9729282 (10cmooney) @ssingh I've reserved the following addresses in Netbox for the LVS now, let me know if you need any more info or if I can hel... [11:32:40] (03PS1) 10Muehlenhoff: Switch matomo role to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1021899 (https://phabricator.wikimedia.org/T349619) [11:37:30] (ProbeDown) firing: (10) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:38:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P61014 and previous config saved to /var/cache/conftool/dbconfig/20240419-113816-ladsgroup.json [11:42:13] (03PS46) 10Klausman: deployment_server: Add Cassandra to autogenerated external svcs [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) [11:48:59] (03PS1) 10Btullis: Add the verbose flag to the geoipupdate command [puppet] - 10https://gerrit.wikimedia.org/r/1021901 (https://phabricator.wikimedia.org/T358268) [11:49:32] (03CR) 10CI reject: [V:04-1] Add the verbose flag to the geoipupdate command [puppet] - 10https://gerrit.wikimedia.org/r/1021901 (https://phabricator.wikimedia.org/T358268) (owner: 10Btullis) [11:50:50] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2058/co" [puppet] - 10https://gerrit.wikimedia.org/r/1021901 (https://phabricator.wikimedia.org/T358268) (owner: 10Btullis) [11:53:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P61015 and previous config saved to /var/cache/conftool/dbconfig/20240419-115323-ladsgroup.json [11:56:52] (03PS1) 10Jcrespo: dbbackups: Add dbprov1005 to the hosts that can dump eqiad backup sources [puppet] - 10https://gerrit.wikimedia.org/r/1021903 (https://phabricator.wikimedia.org/T362509) [11:57:45] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [11:59:37] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 75% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021904 (https://phabricator.wikimedia.org/T362323) [12:00:02] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding network interface DNS magru. - cmooney@cumin1002" [12:00:55] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding network interface DNS magru. - cmooney@cumin1002" [12:00:55] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:02:09] (03PS1) 10Clément Goubert: trafficserver: move 75% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1021905 (https://phabricator.wikimedia.org/T362323) [12:08:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T352010)', diff saved to https://phabricator.wikimedia.org/P61017 and previous config saved to /var/cache/conftool/dbconfig/20240419-120831-ladsgroup.json [12:08:33] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [12:08:36] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:08:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [12:08:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T352010)', diff saved to https://phabricator.wikimedia.org/P61018 and previous config saved to /var/cache/conftool/dbconfig/20240419-120853-ladsgroup.json [12:12:30] (ProbeDown) firing: (10) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:13:51] (JobUnavailable) firing: (2) Reduced availability for job pushgateway in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:13:51] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:12] (03PS1) 10Ilias Sarantopoulos: ml-services: update keras version in logo detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021908 (https://phabricator.wikimedia.org/T362749) [12:19:13] !log depool ncredir2001 [12:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:11] (03PS1) 10Arturo Borrero Gonzalez: wmcs: openstack_apis_response: increase threshold [alerts] - 10https://gerrit.wikimedia.org/r/1021909 [12:23:00] (03CR) 10Klausman: [C:03+1] ml-services: update keras version in logo detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021908 (https://phabricator.wikimedia.org/T362749) (owner: 10Ilias Sarantopoulos) [12:26:50] (03PS4) 10Santiago Faci: Create the MPIC Kubernetes chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) [12:29:04] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db1125.eqiad.wmnet onto db1125.eqiad.wmnet [12:29:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.copy (exit_code=0) Will create a clone of db1125.eqiad.wmnet onto db1125.eqiad.wmnet [12:29:45] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db1178.eqiad.wmnet onto db1178.eqiad.wmnet [12:29:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.copy (exit_code=0) Will create a clone of db1178.eqiad.wmnet onto db1178.eqiad.wmnet [12:30:54] (03PS3) 10Muehlenhoff: Remove obsolete script to detect ever-changing puppet runs [puppet] - 10https://gerrit.wikimedia.org/r/1021875 (https://phabricator.wikimedia.org/T345909) [12:31:14] (03CR) 10Muehlenhoff: "Indeed, fixed." [puppet] - 10https://gerrit.wikimedia.org/r/1021875 (https://phabricator.wikimedia.org/T345909) (owner: 10Muehlenhoff) [12:33:33] (03PS7) 10Esanders: Turn off DiscussionTools A/B test, and enable features on those wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954920 (https://phabricator.wikimedia.org/T341491) [12:36:50] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1021875 (https://phabricator.wikimedia.org/T345909) (owner: 10Muehlenhoff) [12:37:03] (03CR) 10Slyngshede: [C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1021875 (https://phabricator.wikimedia.org/T345909) (owner: 10Muehlenhoff) [12:37:06] (03CR) 10Kamila Součková: [C:03+1] mw-web, mw-api-ext: Raise replicas for 75% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021904 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [12:37:25] (03CR) 10Kamila Součková: [C:03+1] trafficserver: move 75% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1021905 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [12:42:04] (03PS1) 10Slyngshede: API: Introduce settings parameter to enable API. [software/bitu] - 10https://gerrit.wikimedia.org/r/1021912 [12:46:51] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021400 [12:47:33] (03PS2) 10Btullis: Add the verbose flag to the geoipupdate command [puppet] - 10https://gerrit.wikimedia.org/r/1021901 (https://phabricator.wikimedia.org/T358268) [12:48:08] (03CR) 10CI reject: [V:04-1] Add the verbose flag to the geoipupdate command [puppet] - 10https://gerrit.wikimedia.org/r/1021901 (https://phabricator.wikimedia.org/T358268) (owner: 10Btullis) [12:50:51] 06SRE, 10Search-Console-access-request: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513#9729434 (10SCherukuwada) 05Open→03Resolved I'm comfortable closing this task as resolved given that I've been getting search console requests from people fo... [12:51:29] (03PS3) 10Btullis: Add the verbose flag to the geoipupdate command [puppet] - 10https://gerrit.wikimedia.org/r/1021901 (https://phabricator.wikimedia.org/T358268) [12:52:03] (03CR) 10CI reject: [V:04-1] Add the verbose flag to the geoipupdate command [puppet] - 10https://gerrit.wikimedia.org/r/1021901 (https://phabricator.wikimedia.org/T358268) (owner: 10Btullis) [12:52:30] (ProbeDown) firing: (12) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:53:02] (03PS1) 10Kevin Bazira: ml-services: upgrade OS in logo-detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021401 (https://phabricator.wikimedia.org/T362749) [12:56:24] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1021506 (owner: 10Elukey) [12:56:29] (03CR) 10Elukey: [C:03+2] profile::httpbb: fix liftwing_staging tests [puppet] - 10https://gerrit.wikimedia.org/r/1021506 (owner: 10Elukey) [12:56:37] (03PS1) 10Kevin Bazira: ml-services: upgrade OS in logo-detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021402 (https://phabricator.wikimedia.org/T362749) [12:57:30] (ProbeDown) firing: (14) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:58:17] (03PS1) 10Elukey: role::restbase::production: change Cassandra's Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1021915 (https://phabricator.wikimedia.org/T352647) [12:58:51] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update keras version in logo detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021908 (https://phabricator.wikimedia.org/T362749) (owner: 10Ilias Sarantopoulos) [12:58:52] (03PS2) 10Kevin Bazira: ml-services: upgrade OS in logo-detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021401 (https://phabricator.wikimedia.org/T362749) [12:59:20] (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] ml-services: update keras version in logo detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021908 (https://phabricator.wikimedia.org/T362749) (owner: 10Ilias Sarantopoulos) [12:59:37] (03PS3) 10Kevin Bazira: ml-services: upgrade OS in logo-detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021401 (https://phabricator.wikimedia.org/T362749) [12:59:47] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2061/co" [puppet] - 10https://gerrit.wikimedia.org/r/1021915 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:00:21] (03PS1) 10Ilias Sarantopoulos: Revert "ml-services: update keras version in logo detection" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021789 [13:00:48] (03PS2) 10JMeybohm: _scaffole: Don't include tag in image_name preset responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021871 [13:00:48] (03PS1) 10JMeybohm: New module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021917 (https://phabricator.wikimedia.org/T362978) [13:00:50] (03PS1) 10JMeybohm: Fix mcrouter module to work our of the box from scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021918 (https://phabricator.wikimedia.org/T355237) [13:00:53] (03PS1) 10JMeybohm: modules: Add restrictedSecurityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021919 (https://phabricator.wikimedia.org/T362978) [13:01:36] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021403 [13:01:55] (03CR) 10Klausman: [C:03+1] Revert "ml-services: update keras version in logo detection" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021789 (owner: 10Ilias Sarantopoulos) [13:02:30] (ProbeDown) firing: (14) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:03:20] (03CR) 10Ilias Sarantopoulos: [C:03+2] Revert "ml-services: update keras version in logo detection" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021789 (owner: 10Ilias Sarantopoulos) [13:03:25] (SystemdUnitFailed) resolved: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:22] (03Merged) 10jenkins-bot: Revert "ml-services: update keras version in logo detection" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021789 (owner: 10Ilias Sarantopoulos) [13:04:25] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9729450 (10cmooney) >>! In T362421#9710346, @ayounsi wrote: > Prefixes assigned in Netbox: https://netbox.wikimedia.org/ipam/prefixes/?site_id=11 Thanks! > Next step is to c... [13:04:59] (03PS4) 10Kevin Bazira: ml-services: upgrade OS in logo-detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021401 (https://phabricator.wikimedia.org/T362749) [13:05:09] (03PS1) 10Cathal Mooney: Add dummy IPs and uncomment vars for magru [homer/public] - 10https://gerrit.wikimedia.org/r/1021920 (https://phabricator.wikimedia.org/T362421) [13:06:20] (03CR) 10Ilias Sarantopoulos: ml-services: upgrade OS in logo-detection (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021401 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [13:09:59] (03CR) 10Cathal Mooney: [C:03+2] Add dummy IPs and uncomment vars for magru [homer/public] - 10https://gerrit.wikimedia.org/r/1021920 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [13:10:34] (03Merged) 10jenkins-bot: Add dummy IPs and uncomment vars for magru [homer/public] - 10https://gerrit.wikimedia.org/r/1021920 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [13:12:10] (03PS1) 10Jforrester: [WIP] Switch php7.4-cli to bullseye and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T362981) [13:13:23] (03PS5) 10Kevin Bazira: ml-services: upgrade OS in logo-detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021401 (https://phabricator.wikimedia.org/T362749) [13:14:04] (03CR) 10Kevin Bazira: ml-services: upgrade OS in logo-detection (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021401 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [13:18:35] (03PS1) 10Elukey: ml-services: add request payload logging to all revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021923 (https://phabricator.wikimedia.org/T362663) [13:20:55] (03CR) 10Majavah: [WIP] Switch php7.4-cli to bullseye and cascade (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T362981) (owner: 10Jforrester) [13:21:16] (03PS2) 10Elukey: ml-services: add request payload logging to all revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021923 (https://phabricator.wikimedia.org/T362663) [13:22:18] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db1178.eqiad.wmnet onto db1178.eqiad.wmnet [13:22:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.copy (exit_code=0) Will create a clone of db1178.eqiad.wmnet onto db1178.eqiad.wmnet [13:22:34] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db1178.eqiad.wmnet onto db1178.eqiad.wmnet [13:22:34] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.copy (exit_code=0) Will create a clone of db1178.eqiad.wmnet onto db1178.eqiad.wmnet [13:22:56] (03CR) 10Jforrester: [WIP] Switch php7.4-cli to bullseye and cascade (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T362981) (owner: 10Jforrester) [13:24:25] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021404 [13:24:28] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: upgrade OS in logo-detection (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021401 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [13:24:50] (03CR) 10Kevin Bazira: [C:03+2] ml-services: upgrade OS in logo-detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021401 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [13:24:50] (03CR) 10Klausman: [C:03+1] ml-services: add request payload logging to all revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021923 (https://phabricator.wikimedia.org/T362663) (owner: 10Elukey) [13:25:33] (03CR) 10Klausman: [C:03+1] ml-services: upgrade OS in logo-detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021401 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [13:25:40] (03Merged) 10jenkins-bot: ml-services: upgrade OS in logo-detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021401 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [13:27:03] (03CR) 10Majavah: [WIP] Switch php7.4-cli to bullseye and cascade (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T362981) (owner: 10Jforrester) [13:27:34] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db1178.eqiad.wmnet onto db1178.eqiad.wmnet [13:27:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.copy (exit_code=0) Will create a clone of db1178.eqiad.wmnet onto db1178.eqiad.wmnet [13:28:11] (03Abandoned) 10Majavah: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021404 (owner: 10PipelineBot) [13:28:15] !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:31:40] (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] "Merging this here, will hold on updating the submodule in the deploy repo until after testing." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) (owner: 10Pppery) [13:36:58] (03CR) 10Jforrester: [WIP] Switch php7.4-cli to bullseye and cascade (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T362981) (owner: 10Jforrester) [13:37:49] (03PS2) 10Jforrester: [WIP] Switch php7.4-cli to bullseye and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T362981) [13:41:04] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021405 [13:42:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T352010)', diff saved to https://phabricator.wikimedia.org/P61020 and previous config saved to /var/cache/conftool/dbconfig/20240419-134204-ladsgroup.json [13:42:09] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:49:32] (03Abandoned) 10Bking: elastic: assign prod role to elastic2088 [puppet] - 10https://gerrit.wikimedia.org/r/988735 (https://phabricator.wikimedia.org/T353392) (owner: 10Bking) [13:51:48] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2062/console" [puppet] - 10https://gerrit.wikimedia.org/r/1021505 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [13:51:50] (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] "Confirmed those langs show up in the dropdown, and nothing else seems to break. We will most likely deploy the current state of this repo " [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) (owner: 10Pppery) [13:52:19] (03CR) 10Vgutierrez: [V:03+1 C:03+1] benthos/haproxy: using hiera aliases for benthos socket address [puppet] - 10https://gerrit.wikimedia.org/r/1021505 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [13:57:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P61021 and previous config saved to /var/cache/conftool/dbconfig/20240419-135711-ladsgroup.json [13:57:27] (03CR) 10Jcrespo: "Should be merged and deployed before Monday to avoid dump errors." [puppet] - 10https://gerrit.wikimedia.org/r/1021903 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [13:58:35] (03PS1) 10EoghanGaffney: [apt-staging] Package puller updates [puppet] - 10https://gerrit.wikimedia.org/r/1021948 [13:58:57] (03CR) 10CI reject: [V:04-1] [apt-staging] Package puller updates [puppet] - 10https://gerrit.wikimedia.org/r/1021948 (owner: 10EoghanGaffney) [14:12:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P61022 and previous config saved to /var/cache/conftool/dbconfig/20240419-141218-ladsgroup.json [14:13:51] (SystemdUnitFailed) firing: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:09] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985 (10cmooney) 03NEW p:05Triage→03Low [14:15:51] (03PS5) 10Elukey: admin_ng: move Istio configs to mw-api-int-ro for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021490 (https://phabricator.wikimedia.org/T353622) [14:16:17] (03CR) 10Elukey: "Updated the change to be able to support multiple use cases, like explicit vs implicit sidecar proxy." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021490 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [14:17:00] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#9729639 (10cmooney) [14:20:00] (03PS8) 10Pppery: Merge in changes to qqq.json rather than overwriting them [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975392 (https://phabricator.wikimedia.org/T351363) [14:20:12] (03PS10) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) [14:20:29] (03PS7) 10Pppery: Undo qqq.json overwrites [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) [14:20:53] (03CR) 10Vgutierrez: ncredir,benthos: Provide benthos support on ncredir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1021485 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [14:21:08] (03PS1) 10AikoChou: ml-services: update batch revertrisk LA image in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021966 (https://phabricator.wikimedia.org/T358744) [14:21:16] (03CR) 10Pppery: [C:03+1] Replace a strlen(null) call for PHP 8.1 [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1020170 (https://phabricator.wikimedia.org/T342244) (owner: 10Aklapper) [14:22:01] (03CR) 10AikoChou: [C:03+1] ml-services: add request payload logging to all revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021923 (https://phabricator.wikimedia.org/T362663) (owner: 10Elukey) [14:22:55] (03CR) 10FNegri: [C:04-1] "I looked at the graph for the past month, and it seems that this change would not make a difference to the number of alerts we receive: wh" [alerts] - 10https://gerrit.wikimedia.org/r/1021909 (owner: 10Arturo Borrero Gonzalez) [14:23:22] (03CR) 10Andrew Bogott: "This seems OK. Usually this alert seems to fire after one or more of the services flaps but we should be getting an alert for that anyway." [alerts] - 10https://gerrit.wikimedia.org/r/1021909 (owner: 10Arturo Borrero Gonzalez) [14:24:16] (03CR) 10Andrew Bogott: "...ok, I rescind this having read fnegri's comment. If this doesn't actually prevent the alert then it's probably a step backwards." [alerts] - 10https://gerrit.wikimedia.org/r/1021909 (owner: 10Arturo Borrero Gonzalez) [14:26:55] (03PS1) 10Cathal Mooney: Set magru DHCP relay server to install1004 [homer/public] - 10https://gerrit.wikimedia.org/r/1021967 (https://phabricator.wikimedia.org/T362421) [14:27:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T352010)', diff saved to https://phabricator.wikimedia.org/P61023 and previous config saved to /var/cache/conftool/dbconfig/20240419-142726-ladsgroup.json [14:27:31] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:28:05] (03CR) 10Cathal Mooney: [C:03+2] Set magru DHCP relay server to install1004 [homer/public] - 10https://gerrit.wikimedia.org/r/1021967 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [14:28:42] (03PS1) 10Majavah: openstack: neutron: Connect OVS agents to provider networks [puppet] - 10https://gerrit.wikimedia.org/r/1021968 (https://phabricator.wikimedia.org/T358761) [14:28:44] (03Merged) 10jenkins-bot: Set magru DHCP relay server to install1004 [homer/public] - 10https://gerrit.wikimedia.org/r/1021967 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [14:29:11] (03CR) 10CI reject: [V:04-1] openstack: neutron: Connect OVS agents to provider networks [puppet] - 10https://gerrit.wikimedia.org/r/1021968 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [14:29:46] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: add request payload logging to all revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021923 (https://phabricator.wikimedia.org/T362663) (owner: 10Elukey) [14:30:33] (03PS2) 10Majavah: openstack: neutron: Connect OVS agents to provider networks [puppet] - 10https://gerrit.wikimedia.org/r/1021968 (https://phabricator.wikimedia.org/T358761) [14:30:47] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update batch revertrisk LA image in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021966 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [14:31:54] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1021968 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [14:32:04] (03CR) 10Brouberol: [C:03+1] "I checked the PCC output, and had a conversation with Tobias, in which I checked the output of various nodetool status commands and compar" [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [14:32:37] (03CR) 10Cathal Mooney: [C:03+2] Netbox custom script to add additional IPv4 addresses to host [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1017064 (https://phabricator.wikimedia.org/T358096) (owner: 10Cathal Mooney) [14:33:31] (03Merged) 10jenkins-bot: Netbox custom script to add additional IPv4 addresses to host [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1017064 (https://phabricator.wikimedia.org/T358096) (owner: 10Cathal Mooney) [14:37:13] (03CR) 10Majavah: [V:03+1 C:03+2] openstack: neutron: Connect OVS agents to provider networks [puppet] - 10https://gerrit.wikimedia.org/r/1021968 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [14:38:51] (JobUnavailable) firing: (3) Reduced availability for job pushgateway in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:45:45] (03CR) 10AikoChou: [C:03+2] ml-services: update batch revertrisk LA image in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021966 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [14:46:41] (03Merged) 10jenkins-bot: ml-services: update batch revertrisk LA image in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021966 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [14:59:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T352010)', diff saved to https://phabricator.wikimedia.org/P61025 and previous config saved to /var/cache/conftool/dbconfig/20240419-145907-ladsgroup.json [14:59:13] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:00:24] (JobUnavailable) firing: (3) Reduced availability for job pushgateway in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:43] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#9729706 (10cmooney) [15:09:01] (03PS2) 10EoghanGaffney: [apt-staging] Package puller updates [puppet] - 10https://gerrit.wikimedia.org/r/1021948 [15:11:25] !log repool ncredir2001 [15:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:41] (03PS1) 10Elukey: kserve-inference: allow transparent proxy mode for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021981 (https://phabricator.wikimedia.org/T353622) [15:14:02] (03CR) 10Elukey: [C:03+2] ml-services: add request payload logging to all revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021923 (https://phabricator.wikimedia.org/T362663) (owner: 10Elukey) [15:14:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P61026 and previous config saved to /var/cache/conftool/dbconfig/20240419-151415-ladsgroup.json [15:14:47] (03Abandoned) 10Elukey: ml-services: force HTTP in revert-risk agnostic staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/984215 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [15:17:55] (03PS2) 10Elukey: kserve-inference: allow transparent proxy mode for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021981 (https://phabricator.wikimedia.org/T353622) [15:18:41] (03CR) 10CI reject: [V:04-1] kserve-inference: allow transparent proxy mode for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021981 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [15:18:51] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P61027 and previous config saved to /var/cache/conftool/dbconfig/20240419-152922-ladsgroup.json [15:35:43] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [15:35:47] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [15:35:52] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [15:35:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [15:36:16] (03CR) 10JMeybohm: [C:04-1] "> One addendum: currently, this picks up what looks like test instances (at the bottom of the PCC diff). I am not sure whether those shoul" [puppet] - 10https://gerrit.wikimedia.org/r/1020194 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:39:12] 10ops-eqiad, 06DC-Ops, 06SRE Observability: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005 - https://phabricator.wikimedia.org/T362990 (10colewhite) 03NEW p:05Triage→03High [15:39:14] (03PS3) 10Elukey: kserve-inference: allow transparent proxy mode for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021981 (https://phabricator.wikimedia.org/T353622) [15:39:21] 10ops-eqiad, 06DC-Ops, 06SRE Observability: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005 - https://phabricator.wikimedia.org/T362990#9729838 (10colewhite) [15:41:01] 10ops-eqiad, 06DC-Ops, 06SRE Observability: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005 - https://phabricator.wikimedia.org/T362990#9729846 (10colewhite) [15:42:14] (03PS1) 10Cwhite: promote prometheus1006 as pushgateway primary [dns] - 10https://gerrit.wikimedia.org/r/1022027 (https://phabricator.wikimedia.org/T362989) [15:43:28] (03PS1) 10Cwhite: prometheus: promote prometheus1006 to pushgateway duty [puppet] - 10https://gerrit.wikimedia.org/r/1022028 (https://phabricator.wikimedia.org/T362989) [15:44:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T352010)', diff saved to https://phabricator.wikimedia.org/P61028 and previous config saved to /var/cache/conftool/dbconfig/20240419-154430-ladsgroup.json [15:44:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [15:44:36] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:44:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [15:48:59] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:52:34] (03CR) 10Herron: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1022028 (https://phabricator.wikimedia.org/T362989) (owner: 10Cwhite) [15:52:52] (03CR) 10Cwhite: [C:03+2] prometheus: promote prometheus1006 to pushgateway duty [puppet] - 10https://gerrit.wikimedia.org/r/1022028 (https://phabricator.wikimedia.org/T362989) (owner: 10Cwhite) [15:55:02] (03CR) 10Cwhite: [C:03+2] promote prometheus1006 as pushgateway primary [dns] - 10https://gerrit.wikimedia.org/r/1022027 (https://phabricator.wikimedia.org/T362989) (owner: 10Cwhite) [15:55:59] (03PS1) 10Pppery: Phabricator: Delete chatlog group [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1022053 [15:56:36] (03PS2) 10Pppery: Phabricator: Delete chatlog group [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1022053 (https://phabricator.wikimedia.org/T318763) [15:58:27] (03CR) 10Clément Goubert: [C:03+2] mw-web, mw-api-ext: Raise replicas for 75% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021904 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [15:59:17] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 75% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021904 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [16:00:41] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [16:01:01] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [16:01:08] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [16:01:15] 10ops-eqiad, 06SRE, 10Observability-Metrics: Memory upgrade request for prometheus100[56] - https://phabricator.wikimedia.org/T360687#9729922 (10herron) 05Resolved→03Open Reopening -- today we experienced a memory issue on prometheus1005 which presumably relates to this maintenance. Could we arrange to... [16:01:23] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [16:01:31] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [16:01:49] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [16:01:54] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [16:02:08] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [16:08:02] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022029 [16:08:33] (03PS1) 10Clément Goubert: mw-api-ext: Add 20 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022062 [16:10:07] (03PS2) 10Clément Goubert: mw-api-ext: Add 20 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022062 [16:10:15] (03CR) 10Elukey: [C:03+1] "typo in the commit msg :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021871 (owner: 10JMeybohm) [16:11:52] 06SRE, 10DNS, 06Traffic: Authenticating wikimedia.org domain with MailChimp - https://phabricator.wikimedia.org/T362921#9729980 (10ssingh) Update is that we will need to add a DKIM record for MailChimp so a patch will follow. Rest everything seems to be in order. [16:12:30] (ProbeDown) firing: (14) Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:14:59] (03CR) 10Elukey: [C:03+1] "Checked the versions against the current ones in modules" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021917 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [16:17:24] (03Abandoned) 10Arlolra: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019780 (owner: 10PipelineBot) [16:17:30] (ProbeDown) firing: (14) Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:18:26] (03CR) 10Elukey: Fix mcrouter module to work our of the box from scaffold (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021918 (https://phabricator.wikimedia.org/T355237) (owner: 10JMeybohm) [16:23:04] (03PS1) 10CDanis: Update comments on Enterprise IPs in wikimedia_nets [puppet] - 10https://gerrit.wikimedia.org/r/1022069 [16:24:16] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9730020 (10ssingh) Hi @KFrancis: @Lina_Farid_WMDE will require an NDA as well as I don't see their name on the spreadsheet. Thank you as always! [16:24:23] (03CR) 10Ssingh: [C:03+1] Update comments on Enterprise IPs in wikimedia_nets [puppet] - 10https://gerrit.wikimedia.org/r/1022069 (owner: 10CDanis) [16:27:33] (03CR) 10Clément Goubert: [C:03+1] Update comments on Enterprise IPs in wikimedia_nets [puppet] - 10https://gerrit.wikimedia.org/r/1022069 (owner: 10CDanis) [16:36:50] (03CR) 10CDanis: [C:03+2] Update comments on Enterprise IPs in wikimedia_nets [puppet] - 10https://gerrit.wikimedia.org/r/1022069 (owner: 10CDanis) [16:40:21] (03PS1) 10Ssingh: wikimedia.org: add DKIM records for Mailchimp [dns] - 10https://gerrit.wikimedia.org/r/1022075 (https://phabricator.wikimedia.org/T362921) [16:46:50] (03CR) 10Ssingh: "The records are from the dashboard but you can find them at https://mailchimp.com/help/set-up-email-domain-authentication/ for confirmatio" [dns] - 10https://gerrit.wikimedia.org/r/1022075 (https://phabricator.wikimedia.org/T362921) (owner: 10Ssingh) [16:51:01] (03PS13) 10Pppery: Update the PHP files Phabricator reads to show the latest translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) [16:52:46] (03PS14) 10Pppery: Update the PHP files Phabricator reads to show the latest translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) [16:54:35] (03CR) 10Pppery: "(sorry that this ended up so huge - the reason is that https://gerrit.wikimedia.org/r/c/phabricator/translations/+/1015960 / T360861 meant" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) (owner: 10Pppery) [16:56:02] (03PS1) 10CDanis: Disable Enterprise bypassing CDN rate limits [puppet] - 10https://gerrit.wikimedia.org/r/1022092 [16:57:36] (03CR) 10Ssingh: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022062 (owner: 10Clément Goubert) [16:58:23] (03CR) 10Ssingh: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1022092 (owner: 10CDanis) [17:00:55] (03PS1) 10Cathal Mooney: Reverses for 3 new network connections in magru [dns] - 10https://gerrit.wikimedia.org/r/1022098 (https://phabricator.wikimedia.org/T362421) [17:01:50] (03CR) 10CI reject: [V:04-1] Reverses for 3 new network connections in magru [dns] - 10https://gerrit.wikimedia.org/r/1022098 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [17:04:41] (03PS11) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 [17:05:13] (03PS12) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 [17:07:10] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [17:08:37] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9730187 (10ssingh) @Lina_Farid_WMDE: to speed up things, you can also send an email to @KFrancis ( https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) ; kfrancis@wikimedia.org) fr... [17:48:45] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:49:19] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:55:20] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9730356 (10ssingh) Thanks @cmooney, looks good! One small update to the above since we will most likely transpose these to `hieradata/common/lvs/i... [17:58:57] !log sudo cookbook -d sre.dns.netbox "test" [17:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:40] (03CR) 10Ssingh: [C:03+1] "--- /dev/null" [dns] - 10https://gerrit.wikimedia.org/r/1022098 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [18:01:58] 06SRE, 06serviceops: Refactor memcached modules - https://phabricator.wikimedia.org/T284454#9730381 (10Dzahn) p:05Triage→03Low setting priority to low - to get it out of "untriaged incoming SRE tickets list". Just guessing based on the way the ticket is phrased and the age of it. Of course not trying to te... [18:06:11] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9730390 (10KFrancis) @Lina_Farid_WMDE Hello Lina! Please send me your MWDE email address to kfrancis@wikimedia.org and I'll get the agreement out to you to sign. Thanks! [18:07:45] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on mr1-ulsfo,mr1-ulsfo IPv6,mr1-ulsfo.oob,mr1-ulsfo.oob IPv6 with reason: disabling oob link on mr1-ulsfo to stop the SSH attempts long enough to get a homer run in [18:08:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on mr1-ulsfo,mr1-ulsfo IPv6,mr1-ulsfo.oob,mr1-ulsfo.oob IPv6 with reason: disabling oob link on mr1-ulsfo to stop the SSH attempts long enough to get a homer run in [18:08:11] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9730419 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2c797c95-485f-45b4-85c7-e8514173ae11) set by cmooney@cumin1002 for 0:20:00 on 4 host(s) and their se... [18:18:34] (03PS4) 10Ryan Kemper: wdqs.data-transfer: fix netbox object not callable [cookbooks] - 10https://gerrit.wikimedia.org/r/1021588 (https://phabricator.wikimedia.org/T347624) [18:24:03] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [18:24:10] (03CR) 10CDobbins: [V:03+1] "Thank you for the feedback!" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [18:26:08] (03PS3) 10JMeybohm: _scaffold: Don't include tag in image_name preset responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021871 [18:26:08] (03PS2) 10JMeybohm: New module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021917 (https://phabricator.wikimedia.org/T362978) [18:26:08] (03PS2) 10JMeybohm: Fix mcrouter module to work out of the box from scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021918 (https://phabricator.wikimedia.org/T355237) [18:26:09] (03PS2) 10JMeybohm: modules: Add restrictedSecurityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021919 (https://phabricator.wikimedia.org/T362978) [18:28:12] (03CR) 10Bking: [C:03+1] wdqs.data-transfer: fix netbox object not callable [cookbooks] - 10https://gerrit.wikimedia.org/r/1021588 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [18:29:08] (03CR) 10Ryan Kemper: [C:03+2] wdqs.data-transfer: fix netbox object not callable [cookbooks] - 10https://gerrit.wikimedia.org/r/1021588 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [18:32:14] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding more reverse v6 INCLUDES into dns for magru transport links - cmooney@cumin1002" [18:32:28] (03CR) 10Cathal Mooney: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1022098 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [18:33:18] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding more reverse v6 INCLUDES into dns for magru transport links - cmooney@cumin1002" [18:33:19] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:33:41] (03CR) 10Cathal Mooney: [C:03+2] Reverses for 3 new network connections in magru [dns] - 10https://gerrit.wikimedia.org/r/1022098 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [18:33:53] (03PS2) 10Cathal Mooney: Reverses for 3 new network connections in magru [dns] - 10https://gerrit.wikimedia.org/r/1022098 (https://phabricator.wikimedia.org/T362421) [18:34:36] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T362508, journal in uncertain state) xfer wikidata from wdqs2022.codfw.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling both afterwards [18:34:53] (03CR) 10Cathal Mooney: [V:03+2 C:03+2] Reverses for 3 new network connections in magru [dns] - 10https://gerrit.wikimedia.org/r/1022098 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [18:34:54] T362508: WDQS updater misbehaving in codfw - https://phabricator.wikimedia.org/T362508 [18:37:30] (ProbeDown) firing: (12) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:38:51] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:47:30] (ProbeDown) firing: (12) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:50:34] !log [WDQS] T363004 Restarted wdqs2010 and wdqs2024 to clear out their in-application-memory ban lists [18:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:45] T363004: Investigate WDQS ProbeDown alerts - https://phabricator.wikimedia.org/T363004 [18:52:30] (ProbeDown) firing: (12) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:52:50] (03PS1) 10Dwisehaupt: crm: Shift http web_root and site_name to hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/1022145 (https://phabricator.wikimedia.org/T343486) [18:54:56] (03CR) 10Dwisehaupt: "Ran across an issue moving from wmcloud to prod VPS where something was hardcoded. This shifts it to a lookup so we can override in cloud " [puppet] - 10https://gerrit.wikimedia.org/r/1022145 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [18:57:30] (ProbeDown) resolved: (6) Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:03:51] (JobUnavailable) firing: Reduced availability for job thanos-sidecar in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:07:10] (03PS1) 10JHathaway: WIP: ci-test [puppet] - 10https://gerrit.wikimedia.org/r/1022154 [19:07:30] (ProbeDown) firing: (6) Service wdqs2018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:08:02] (03CR) 10Dzahn: [C:03+1] "yea, very good to turn those into lookups. and nowadays the style guide is also fine with default values. I think a long time ago the gol" [puppet] - 10https://gerrit.wikimedia.org/r/1022145 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [19:09:19] (03PS2) 10JHathaway: WIP: ci-test [puppet] - 10https://gerrit.wikimedia.org/r/1022154 [19:09:52] (03CR) 10CI reject: [V:04-1] WIP: ci-test [puppet] - 10https://gerrit.wikimedia.org/r/1022154 (owner: 10JHathaway) [19:12:30] (ProbeDown) resolved: (4) Service wdqs2018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:14:42] (03CR) 10Dzahn: [C:03+1] "@Dwisehaupt You may be already aware but wanted to mention in cloud you can set the Hiera values either in the repo (puppet/hieradata/clou" [puppet] - 10https://gerrit.wikimedia.org/r/1022145 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [19:15:03] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1022145/2066/crm2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1022145 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [19:15:59] !log [WDQS] T363004 Restarted wdqs2012 to clear out its in-application-memory ban lists (it had pybal's twisted user agent banned) [19:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:04] T363004: Investigate WDQS ProbeDown alerts - https://phabricator.wikimedia.org/T363004 [19:18:51] (SystemdUnitFailed) firing: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:19:19] (03PS1) 10JMeybohm: modules: Add restrictedSecurityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022161 (https://phabricator.wikimedia.org/T362978) [19:21:20] (03Abandoned) 10JMeybohm: modules: Add restrictedSecurityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021919 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [19:24:03] (03PS3) 10JMeybohm: eventgate: Update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019007 (https://phabricator.wikimedia.org/T359423) [19:24:03] (03PS2) 10JMeybohm: eventgate-*: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019018 (https://phabricator.wikimedia.org/T359423) [19:24:03] (03PS1) 10JMeybohm: eventgate: Add securityContext for all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022164 (https://phabricator.wikimedia.org/T362978) [19:24:57] (03CR) 10CI reject: [V:04-1] eventgate: Add securityContext for all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022164 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [19:25:33] (03PS2) 10JMeybohm: eventgate: Add securityContext for all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022164 (https://phabricator.wikimedia.org/T362978) [19:26:05] (03CR) 10JMeybohm: _scaffold: Don't include tag in image_name preset responses (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021871 (owner: 10JMeybohm) [19:26:14] (03CR) 10JMeybohm: Fix mcrouter module to work out of the box from scaffold (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021918 (https://phabricator.wikimedia.org/T355237) (owner: 10JMeybohm) [19:35:24] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:53] (03CR) 10Btullis: [C:03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1021899 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [19:41:17] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [19:49:45] !log jforrester@deploy1002 Started deploy [integration/docroot@c090350]: I1c1c2564d5e78483c766f77ae4c4c74b14578493 trivial CI fix [19:49:51] !log jforrester@deploy1002 Finished deploy [integration/docroot@c090350]: I1c1c2564d5e78483c766f77ae4c4c74b14578493 trivial CI fix (duration: 00m 06s) [19:50:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:51:38] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [19:52:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 36.16% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:55:26] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:56:36] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T362508, journal in uncertain state) xfer wikidata from wdqs2022.codfw.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling both afterwards [19:56:42] T362508: WDQS updater misbehaving in codfw - https://phabricator.wikimedia.org/T362508 [19:57:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:58:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 37.19% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:03:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 35.18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:10:19] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.31% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:12:20] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:12:47] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:15:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:17:34] 06SRE, 10Wikimedia-Mailing-lists: Cross post to multiple mailling lists is only received once by recipient - https://phabricator.wikimedia.org/T345691#9730713 (10hashar) I still have the issue when I post MediaWiki train related messages to both `ops-l` and `wikitech-l`. I then end up wondering whether the ema... [20:21:36] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:22:07] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:39:11] 06SRE, 10Wikimedia-Mailing-lists: Cross post to multiple mailling lists is only received once by recipient - https://phabricator.wikimedia.org/T345691#9730725 (10Dzahn) Sending a single mail to multiple lists at once have always been discouraged though, to be honest. [20:44:04] (03CR) 10Dzahn: [V:03+1 C:03+2] "[crm2001:/etc/apache2/sites-enabled] $ grep ServerName 50-community-crm.conf" [puppet] - 10https://gerrit.wikimedia.org/r/1022145 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [20:53:18] Is there anyone around who can help with stuck global renames? [20:55:04] hi [20:55:16] T362941 and T362942 [20:55:17] T362941: Unblock stuck global rename of Gzsimonfbi to Renamed user 2409354752759 - https://phabricator.wikimedia.org/T362941 [20:55:17] T362942: Unblock stuck global rename of Kou.i5h to Renamed user 8356771833137 - https://phabricator.wikimedia.org/T362942 [20:55:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:57:22] * taavi looks [21:00:26] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:02:35] !log taavi@mwmaint1002 ~ $ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=eowiki --logwiki=metawiki 'Gzsimonfbi' 'Renamed user 2409354752759' # T362941 [21:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:39] T362941: Unblock stuck global rename of Gzsimonfbi to Renamed user 2409354752759 - https://phabricator.wikimedia.org/T362941 [21:02:44] (03PS1) 10Dzahn: stewards: create a local git repo for user db data [puppet] - 10https://gerrit.wikimedia.org/r/1022170 (https://phabricator.wikimedia.org/T361547) [21:03:30] !log taavi@mwmaint1002 ~ $ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=metawiki --logwiki=metawiki Kou.i5h 'Renamed user 8356771833137' # T362942 [21:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:36] T362942: Unblock stuck global rename of Kou.i5h to Renamed user 8356771833137 - https://phabricator.wikimedia.org/T362942 [21:04:06] JJMC89: unstucked both. not sure what happened there, no errors that I could see [21:04:22] (03CR) 10Dzahn: "code taken mostly from modules/puppetmaster/manifests/gitclone.pp" [puppet] - 10https://gerrit.wikimedia.org/r/1022170 (https://phabricator.wikimedia.org/T361547) (owner: 10Dzahn) [21:04:42] thanks taavi - not sure either - they were the only two of 100+ I did yesterday [21:08:11] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:11:01] (03PS1) 10Dzahn: stewards: move patches to parameters with lookups [puppet] - 10https://gerrit.wikimedia.org/r/1022177 [21:11:23] (03CR) 10CI reject: [V:04-1] stewards: move patches to parameters with lookups [puppet] - 10https://gerrit.wikimedia.org/r/1022177 (owner: 10Dzahn) [21:11:27] (03PS2) 10Dzahn: stewards: move pathes to parameters with lookups [puppet] - 10https://gerrit.wikimedia.org/r/1022177 [21:11:48] (03CR) 10CI reject: [V:04-1] stewards: move pathes to parameters with lookups [puppet] - 10https://gerrit.wikimedia.org/r/1022177 (owner: 10Dzahn) [21:12:56] (RoutinatorRsyncErrors) resolved: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:13:45] (03PS3) 10Dzahn: stewards: move pathes to parameters with lookups [puppet] - 10https://gerrit.wikimedia.org/r/1022177 [21:13:51] (SystemdUnitFailed) resolved: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:23:45] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9730823 (10Dzahn) >>! In T351202#968... [21:36:18] (03PS3) 10JHathaway: WIP: ci-test [puppet] - 10https://gerrit.wikimedia.org/r/1022154 [21:37:34] (03CR) 10JHathaway: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1022154 (owner: 10JHathaway) [21:42:08] (03PS4) 10JHathaway: puppetmaster: fix failing specs [puppet] - 10https://gerrit.wikimedia.org/r/1022154 [21:42:24] (03PS1) 10Dzahn: lists: start a class for automating certain subscriptions [puppet] - 10https://gerrit.wikimedia.org/r/1022193 (https://phabricator.wikimedia.org/T351202) [21:43:08] (03CR) 10JHathaway: [C:03+2] puppetmaster: fix failing specs [puppet] - 10https://gerrit.wikimedia.org/r/1022154 (owner: 10JHathaway) [21:44:56] (03PS4) 10Btullis: Add the verbose flag to the geoipupdate command [puppet] - 10https://gerrit.wikimedia.org/r/1021901 (https://phabricator.wikimedia.org/T358268) [21:45:12] (03PS2) 10Dzahn: lists: start a class for automating certain subscriptions [puppet] - 10https://gerrit.wikimedia.org/r/1022193 (https://phabricator.wikimedia.org/T351202) [21:45:59] (03CR) 10JHathaway: "rebased, after fixing the puppetmaster spec, I8e5d475e99a64db22a03e0f7c02905f34caa73d4" [puppet] - 10https://gerrit.wikimedia.org/r/1021901 (https://phabricator.wikimedia.org/T358268) (owner: 10Btullis) [21:48:26] (03CR) 10JHathaway: "Why create a separate script, rather than querying puppetdb in puppet and generating the file during the puppet run?" [puppet] - 10https://gerrit.wikimedia.org/r/1021896 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [21:49:14] (03CR) 10Dzahn: [C:03+1] Add the verbose flag to the geoipupdate command [puppet] - 10https://gerrit.wikimedia.org/r/1021901 (https://phabricator.wikimedia.org/T358268) (owner: 10Btullis) [21:49:45] (03CR) 10JHathaway: [C:03+1] "looks good, thanks" [dns] - 10https://gerrit.wikimedia.org/r/1022075 (https://phabricator.wikimedia.org/T362921) (owner: 10Ssingh) [22:11:16] !oncall [22:11:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:20:24] (SystemdUnitFailed) firing: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:54] (03CR) 10Dzahn: [C:03+1] "seems good - also https://phabricator.wikimedia.org/T358268#9730932" [puppet] - 10https://gerrit.wikimedia.org/r/1021901 (https://phabricator.wikimedia.org/T358268) (owner: 10Btullis) [22:30:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [22:30:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [22:31:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:31:26] btullis: I think the geoipupdate command knows when there was no change and then doesn't overwrite the file [22:31:45] the files from Apr 17 and Apr 19 have identical checksums [22:32:36] my guess is the timestamp only means when was the last time there was an actual new release of the DBs, not last time it checked for updates [22:36:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:41:39] (03PS1) 10Dzahn: codsearch: use thirdparty-ci repo to get docker-ce on buster [puppet] - 10https://gerrit.wikimedia.org/r/1022215 [22:42:00] (03CR) 10CI reject: [V:04-1] codsearch: use thirdparty-ci repo to get docker-ce on buster [puppet] - 10https://gerrit.wikimedia.org/r/1022215 (owner: 10Dzahn) [22:45:55] (03PS2) 10Dzahn: codsearch: use thirdparty-ci repo to get docker-ce on buster [puppet] - 10https://gerrit.wikimedia.org/r/1022215 [22:46:18] (03CR) 10CI reject: [V:04-1] codsearch: use thirdparty-ci repo to get docker-ce on buster [puppet] - 10https://gerrit.wikimedia.org/r/1022215 (owner: 10Dzahn) [22:46:26] (RoutinatorRsyncErrors) resolved: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:47:08] (03PS3) 10Dzahn: codsearch: use thirdparty-ci repo to get docker-ce on buster [puppet] - 10https://gerrit.wikimedia.org/r/1022215 [22:47:34] (03CR) 10CI reject: [V:04-1] codsearch: use thirdparty-ci repo to get docker-ce on buster [puppet] - 10https://gerrit.wikimedia.org/r/1022215 (owner: 10Dzahn) [22:48:45] (03PS4) 10Dzahn: codsearch: use thirdparty-ci repo to get docker-ce on buster [puppet] - 10https://gerrit.wikimedia.org/r/1022215 [22:48:51] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:57:25] (03PS5) 10Dzahn: codsearch: use thirdparty-ci repo to get docker-ce on buster [puppet] - 10https://gerrit.wikimedia.org/r/1022215 (https://phabricator.wikimedia.org/T362518) [23:03:51] (JobUnavailable) firing: Reduced availability for job thanos-sidecar in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:37:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1022031 [23:38:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1022031 (owner: 10TrainBranchBot) [23:54:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T352010)', diff saved to https://phabricator.wikimedia.org/P61029 and previous config saved to /var/cache/conftool/dbconfig/20240419-235405-ladsgroup.json [23:54:12] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010