[00:01:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:09:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T342617)', diff saved to https://phabricator.wikimedia.org/P50169 and previous config saved to /var/cache/conftool/dbconfig/20230808-000859-ladsgroup.json
[00:09:04] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[00:24:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P50170 and previous config saved to /var/cache/conftool/dbconfig/20230808-002405-ladsgroup.json
[00:38:34] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945839
[00:38:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945839 (owner: 10TrainBranchBot)
[00:39:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P50171 and previous config saved to /var/cache/conftool/dbconfig/20230808-003911-ladsgroup.json
[00:45:43] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:54:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T342617)', diff saved to https://phabricator.wikimedia.org/P50172 and previous config saved to /var/cache/conftool/dbconfig/20230808-005418-ladsgroup.json
[00:54:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
[00:54:22] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[00:54:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
[00:54:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T342617)', diff saved to https://phabricator.wikimedia.org/P50173 and previous config saved to /var/cache/conftool/dbconfig/20230808-005439-ladsgroup.json
[00:58:33] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945839 (owner: 10TrainBranchBot)
[01:03:34] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T343774 (10phaultfinder)
[01:40:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T342617)', diff saved to https://phabricator.wikimedia.org/P50174 and previous config saved to /var/cache/conftool/dbconfig/20230808-014007-ladsgroup.json
[01:40:12] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[01:53:29] <wikibugs>	 (03PS1) 10Andrew Bogott: mwopenstackclients: add allvolumes() shortcut [puppet] - 10https://gerrit.wikimedia.org/r/946642
[01:53:31] <wikibugs>	 (03PS1) 10Andrew Bogott: WIP: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643
[01:54:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: add allvolumes() shortcut [puppet] - 10https://gerrit.wikimedia.org/r/946642 (owner: 10Andrew Bogott)
[01:55:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P50175 and previous config saved to /var/cache/conftool/dbconfig/20230808-015513-ladsgroup.json
[01:56:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 (owner: 10Andrew Bogott)
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T0200)
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:10:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P50176 and previous config saved to /var/cache/conftool/dbconfig/20230808-021020-ladsgroup.json
[02:18:41] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:18:51] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:22:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:25:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T342617)', diff saved to https://phabricator.wikimedia.org/P50177 and previous config saved to /var/cache/conftool/dbconfig/20230808-022526-ladsgroup.json
[02:25:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[02:25:30] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[02:25:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[02:25:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T342617)', diff saved to https://phabricator.wikimedia.org/P50178 and previous config saved to /var/cache/conftool/dbconfig/20230808-022547-ladsgroup.json
[02:30:57] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:07] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:33] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:41:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T0300)
[03:02:15] <wikibugs>	 (03PS2) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643
[03:02:17] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644
[03:05:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 (owner: 10Andrew Bogott)
[03:05:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644 (owner: 10Andrew Bogott)
[03:09:26] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644
[03:09:28] <wikibugs>	 (03PS3) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643
[03:10:24] <wikibugs>	 (03PS3) 10Andrew Bogott: wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644
[03:10:26] <wikibugs>	 (03PS4) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643
[03:11:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:25:54] <wikibugs>	 (03PS5) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643
[03:27:56] <wikibugs>	 (03PS1) 10Anzx: Update piwiki legacy vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946645 (https://phabricator.wikimedia.org/T305950)
[03:32:21] <wikibugs>	 (03PS1) 10Anzx: Update idwiktionary old vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946666 (https://phabricator.wikimedia.org/T341175)
[03:35:48] <wikibugs>	 (03CR) 10Andrew Bogott: "I've only tested this with --noop but it seems like it should work..." [puppet] - 10https://gerrit.wikimedia.org/r/946643 (owner: 10Andrew Bogott)
[04:23:43] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:28:19] <icinga-wm>	 PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wcqs-blazegraph.service,wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:51:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:21:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:43:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:45:54] <wikibugs>	 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 3 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Kelson)
[05:47:47] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ayounsi)
[05:48:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:50:19] <wikibugs>	 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 3 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Kelson) @akosiaris @MSantos May I underline Vadim's request: carifying if we (at Kiwix) can still benefit from the `mobile-sect...
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T0600)
[06:00:04] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T0600).
[06:08:25] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:14:17] <wikibugs>	 (03PS2) 10Elukey: admin_ng: allow host headers for base domain in istio mesh configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946593 (https://phabricator.wikimedia.org/T343740)
[06:22:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:35:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Overall lgtm, I have one general usability doubt." [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert)
[06:36:58] <wikibugs>	 (03PS1) 10Stevemunene: Add datahub_staging cname [dns] - 10https://gerrit.wikimedia.org/r/946851 (https://phabricator.wikimedia.org/T343236)
[06:38:27] <wikibugs>	 (03PS1) 10KartikMistry: testwiki: Enable Section Translation for 7 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946852 (https://phabricator.wikimedia.org/T343211)
[06:51:49] <wikibugs>	 (03CR) 10Vgutierrez: "change makes sense but now the puppetization is inconsistent. ip_reputation can be enabled for an upload node but it would be a NOOP in te" [puppet] - 10https://gerrit.wikimedia.org/r/946566 (owner: 10Giuseppe Lavagetto)
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T0700)
[07:00:04] <jouncebot>	 aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:01:47] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: allow host headers for base domain in istio mesh configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946593 (https://phabricator.wikimedia.org/T343740) (owner: 10Elukey)
[07:03:03] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:06:14] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[07:06:35] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[07:06:35] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:07:04] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[07:07:25] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[07:07:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[07:07:46] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[07:25:35] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: cache: move vendor proxy lookup to cluster_fe_ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/946566
[07:28:20] <wikibugs>	 (03PS1) 10Ayounsi: BGPalerter: mute software-update notifications [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600)
[07:29:54] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42796/console" [puppet] - 10https://gerrit.wikimedia.org/r/946566 (owner: 10Giuseppe Lavagetto)
[07:30:20] <wikibugs>	 (03PS2) 10Ayounsi: BGPalerter: mute software-update notifications [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600)
[07:31:07] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi)
[07:31:37] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] cache: move vendor proxy lookup to cluster_fe_ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/946566 (owner: 10Giuseppe Lavagetto)
[07:32:00] <wikibugs>	 (03PS3) 10Ayounsi: BGPalerter: mute software-update notifications [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600)
[07:32:13] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi)
[07:34:34] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: cache: move vendor proxy lookup to cluster_fe_ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/946566
[07:34:45] <wikibugs>	 (03PS4) 10Ayounsi: BGPalerter: mute software-update notifications [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600)
[07:35:02] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi)
[07:52:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:00:25] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:05:11] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:05:35] <icinga-wm>	 PROBLEM - Host ml-serve2004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:07:51] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:09:39] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:10:48] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:14:12] <jynus>	 cannot ssh to ml-serve2004, either a host or a network issue
[08:18:09] <jynus>	 probably host, as I see no cmd line on the mgmt
[08:21:20] <jynus>	 14 minutes is too long to wait for a reboot, and elukey is not around, so I am going to force a soft power restart
[08:21:46] <jynus>	 oh, he is
[08:21:57] <jynus>	 so waiting for his ok, maybe it is just maintenance
[08:22:00] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) Maybe it makes sense to create a dedicated task to discuss the general usage and policies for developer account naming? The...
[08:24:25] <wikibugs>	 (03CR) 10Btullis: Add datahub_staging cname (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/946851 (https://phabricator.wikimedia.org/T343236) (owner: 10Stevemunene)
[08:25:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T342617)', diff saved to https://phabricator.wikimedia.org/P50179 and previous config saved to /var/cache/conftool/dbconfig/20230808-082539-ladsgroup.json
[08:25:44] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[08:30:16] <elukey>	 jynus: checking thanks :)
[08:30:21] <elukey>	 (I was afk)
[08:31:28] <jynus>	 no problem, in fact only because you were around I didn't take further action
[08:31:44] <jynus>	 le me know if a reboot is needed
[08:33:04] <elukey>	 !log powercycle ml-serve2004 - mgmt console without tty available, DIMM errors in getsel
[08:33:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:18] <elukey>	 jynus: kicked off one, there are some DIMM errors in getsel though, not great
[08:33:26] <jynus>	 :-(
[08:33:36] <jynus>	 yeah, it looked like it was stuck
[08:36:19] <icinga-wm>	 RECOVERY - Host ml-serve2004 is UP: PING OK - Packet loss = 0%, RTA = 34.33 ms
[08:37:43] <jynus>	 my fear was that you were doing a reimage or some other maintenance, so I chose to wait for your feedback, rather than a small chance of ruining something
[08:37:57] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 111, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:38:18] <elukey>	 jynus: thanks!
[08:38:34] <elukey>	 for the moment let's see if it was a transient issue or not, in case of another freeze I'll involve dcops
[08:39:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:40:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P50180 and previous config saved to /var/cache/conftool/dbconfig/20230808-084045-ladsgroup.json
[08:40:49] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:41:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/946566 (owner: 10Giuseppe Lavagetto)
[08:44:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:45:32] <jynus>	 !log restart debmonitor2003 services
[08:45:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:47] <icinga-wm>	 PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:46:03] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:46:17] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/7 UP : OSPFv3: 4/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:46:19] <icinga-wm>	 PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:46:43] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:47:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache: move vendor proxy lookup to cluster_fe_ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/946566 (owner: 10Giuseppe Lavagetto)
[08:47:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:47:51] <wikibugs>	 (03PS1) 10Majavah: base: set Precedence: Bulk header in notify_maintainers [puppet] - 10https://gerrit.wikimedia.org/r/946924
[08:50:51] <icinga-wm>	 RECOVERY - BFD status on cr2-drmrs is OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:51:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert "cache: move vendor proxy lookup to cluster_fe_ratelimit" [puppet] - 10https://gerrit.wikimedia.org/r/946647
[08:51:11] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/946924 (owner: 10Majavah)
[08:51:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "cache: move vendor proxy lookup to cluster_fe_ratelimit" [puppet] - 10https://gerrit.wikimedia.org/r/946647 (owner: 10Giuseppe Lavagetto)
[08:51:47] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:51:53] <_joe_>	 cdanis: can I merge your change?
[08:51:57] <_joe_>	 err dcaro 
[08:52:02] <_joe_>	 sorry cdanis :)
[08:52:05] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:52:13] <dcaro>	 _joe_: yep :)
[08:52:17] <dcaro>	 thanks
[08:52:17] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:52:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:52:41] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:52:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T342617)', diff saved to https://phabricator.wikimedia.org/P50181 and previous config saved to /var/cache/conftool/dbconfig/20230808-085255-ladsgroup.json
[08:53:01] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[08:55:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P50182 and previous config saved to /var/cache/conftool/dbconfig/20230808-085551-ladsgroup.json
[08:57:23] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: cache: expand ip reputation lookup cases [puppet] - 10https://gerrit.wikimedia.org/r/946925
[09:01:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache: expand ip reputation lookup cases [puppet] - 10https://gerrit.wikimedia.org/r/946925 (owner: 10Giuseppe Lavagetto)
[09:08:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P50183 and previous config saved to /var/cache/conftool/dbconfig/20230808-090801-ladsgroup.json
[09:10:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T342617)', diff saved to https://phabricator.wikimedia.org/P50184 and previous config saved to /var/cache/conftool/dbconfig/20230808-091058-ladsgroup.json
[09:11:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance
[09:11:02] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[09:11:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance
[09:11:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T342617)', diff saved to https://phabricator.wikimedia.org/P50185 and previous config saved to /var/cache/conftool/dbconfig/20230808-091119-ladsgroup.json
[09:15:52] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] rest-gateway: fix pathing for knowledge-gap [deployment-charts] - 10https://gerrit.wikimedia.org/r/945806 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan)
[09:15:59] <icinga-wm>	 PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:16:13] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:16:29] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/7 UP : OSPFv3: 4/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:16:31] <icinga-wm>	 PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:16:51] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:18:19] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: fix pathing for knowledge-gap [deployment-charts] - 10https://gerrit.wikimedia.org/r/945806 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan)
[09:22:03] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:22:35] <icinga-wm>	 RECOVERY - BFD status on cr2-drmrs is OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:23:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P50186 and previous config saved to /var/cache/conftool/dbconfig/20230808-092308-ladsgroup.json
[09:23:47] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:24:05] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:24:25] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:38:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T342617)', diff saved to https://phabricator.wikimedia.org/P50187 and previous config saved to /var/cache/conftool/dbconfig/20230808-093814-ladsgroup.json
[09:38:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[09:38:18] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[09:38:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[09:38:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T342617)', diff saved to https://phabricator.wikimedia.org/P50188 and previous config saved to /var/cache/conftool/dbconfig/20230808-093835-ladsgroup.json
[09:43:58] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[09:44:23] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[09:44:39] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:44:49] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:45:09] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:45:33] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:47:13] <wikibugs>	 (03PS1) 10Volans: admin: add email to user maryana [puppet] - 10https://gerrit.wikimedia.org/r/946927 (https://phabricator.wikimedia.org/T342797)
[09:48:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10darthmon_wmde) a:03roti_WMDE
[09:50:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin: add email to user maryana [puppet] - 10https://gerrit.wikimedia.org/r/946927 (https://phabricator.wikimedia.org/T342797) (owner: 10Volans)
[09:51:43] <wikibugs>	 (03CR) 10Volans: [C: 03+2] admin: add email to user maryana [puppet] - 10https://gerrit.wikimedia.org/r/946927 (https://phabricator.wikimedia.org/T342797) (owner: 10Volans)
[09:52:11] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[09:52:29] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[09:55:15] <wikibugs>	 (03PS2) 10Hnowlan: trafficserver: route wikifeeds requests via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/945558 (https://phabricator.wikimedia.org/T339119)
[09:59:16] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: route knowledge-gap path via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/946928 (https://phabricator.wikimedia.org/T342213)
[10:00:00] <volans>	 !log restart ferm on mirror1001 to pick new IP address for debian syncproxy2
[10:00:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1000)
[10:01:18] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:02:44] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: enable ldap group sync on active GitLab server [puppet] - 10https://gerrit.wikimedia.org/r/945612 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto)
[10:04:15] <wikibugs>	 (03PS2) 10Filippo Giunchedi: admin: add adri to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/946515 (https://phabricator.wikimedia.org/T342969)
[10:06:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:08:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add adri to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/946515 (https://phabricator.wikimedia.org/T342969) (owner: 10Filippo Giunchedi)
[10:10:09] <wikibugs>	 (03PS2) 10Hnowlan: rest-gateway: add availability route [deployment-charts] - 10https://gerrit.wikimedia.org/r/945784 (https://phabricator.wikimedia.org/T339119)
[10:11:17] <wikibugs>	 (03Abandoned) 10Hnowlan: rest-gateway: add availability route [deployment-charts] - 10https://gerrit.wikimedia.org/r/945784 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan)
[10:13:42] <wikibugs>	 (03PS1) 10Elukey: admin_ng: increase cpu limits for knative-serving pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/946929
[10:15:27] <icinga-wm>	 RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[10:15:47] <wikibugs>	 (03PS2) 10Elukey: admin_ng: change cpu limits for knative-serving pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/946929
[10:16:47] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Access will be live in 30min, I'm optimistically resolving the task though please reopen if sth is amiss!
[10:21:20] <taavi>	 !log update T343294 mitigations
[10:21:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "Untested and I don't think it'll work as-is, though definitely +1 on the idea" [puppet] - 10https://gerrit.wikimedia.org/r/946559 (owner: 10Herron)
[10:33:19] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Revert "mediawiki: Reduce requests for canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/945798 (owner: 10Clément Goubert)
[10:33:33] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Revert "mediawiki: Reduce requests for canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/945798 (owner: 10Clément Goubert)
[10:34:35] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mediawiki: Reduce requests for canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/945798 (owner: 10Clément Goubert)
[10:36:17] <claime>	 !log deploying mw-on-k8s - https://gerrit.wikimedia.org/r/945798
[10:36:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:48] <wikibugs>	 (03PS11) 10Clément Goubert: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748)
[10:42:24] <wikibugs>	 (03CR) 10Clément Goubert: mediawiki: set requests based on php.workers (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert)
[10:50:21] <wikibugs>	 (03PS1) 10Samtar: IS: Ensure edit recovery is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946932 (https://phabricator.wikimedia.org/T342858)
[10:55:24] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+1] IS: Ensure edit recovery is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946932 (https://phabricator.wikimedia.org/T342858) (owner: 10Samtar)
[10:56:14] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10MatthewVernon) This needs approval by @mark or @joanna_borun (per `data.yaml`), I think. So I've tagged them to approve (or otherwise) this request :)
[11:00:53] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[11:05:23] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10joanna_borun) Apologies for the delay in completing this task. Our Infrastructure Foundations team is currently in the process of evaluating the global root access policy and po...
[11:16:37] <icinga-wm>	 PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:35:59] <icinga-wm>	 PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdn1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops
[11:37:07] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Allow HTTP return traffic from apt to network devices on TCP 8080 [homer/public] - 10https://gerrit.wikimedia.org/r/942639 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney)
[11:38:01] <wikibugs>	 (03Merged) 10jenkins-bot: Allow HTTP return traffic from apt to network devices on TCP 8080 [homer/public] - 10https://gerrit.wikimedia.org/r/942639 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney)
[11:41:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:41:37] <claime>	 ^checking
[11:42:55] <claime>	 Sharp fall in rps coinciding
[11:43:10] <claime>	 It's not worker saturation
[11:45:47] <TheresNoTime>	 jouncebot: nowandnext
[11:45:47] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 14 minute(s)
[11:45:47] <jouncebot>	 In 0 hour(s) and 14 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1200)
[11:46:37] <TheresNoTime>	 claime: was going to deploy a prod no-op (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/946932), want me to hold off?
[11:47:26] <claime>	 TheresNoTime: Yeah, please
[11:47:52] <TheresNoTime>	 okay :)
[11:47:56] <claime>	 Great I love when my firefox crashes in the middle of debugging something
[11:51:14] <claime>	 I'm not seeing anything particularly flagrant rn
[11:51:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:51:21] <claime>	 It's recovering
[11:52:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:54:05] <claime>	 My firefox keeps crashing on grafana >_>
[11:59:20] <claime>	 I'm not finding anything, TheresNoTime go ahead with your deployment, I'll keep digging in the logs
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1200)
[12:01:42] * TheresNoTime will wait for that window ^ to start/finish
[12:01:51] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH)
[12:06:27] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:10:04] <wikibugs>	 (03PS9) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691)
[12:14:37] <wikibugs>	 (03PS5) 10JMeybohm: CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033)
[12:15:39] <wikibugs>	 (03CR) 10JMeybohm: CI: Generate deployment fixtures from actual hiera data (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:17:26] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH)
[12:18:39] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: change cpu limits for knative-serving pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/946929 (owner: 10Elukey)
[12:23:00] <wikibugs>	 (03PS10) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691)
[12:24:39] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:25:06] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:25:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[12:26:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[12:26:26] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[12:28:08] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[12:28:25] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[12:30:01] <wikibugs>	 (03PS1) 10JMeybohm: Update apertium to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946940 (https://phabricator.wikimedia.org/T300033)
[12:30:19] <TheresNoTime>	 jouncebot: nowandnext
[12:30:19] <jouncebot>	 For the next 0 hour(s) and 29 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1200)
[12:30:19] <jouncebot>	 In 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1300)
[12:30:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:30:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946932 (https://phabricator.wikimedia.org/T342858) (owner: 10Samtar)
[12:30:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:31:28] <wikibugs>	 (03PS11) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691)
[12:31:37] <wikibugs>	 (03Merged) 10jenkins-bot: IS: Ensure edit recovery is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946932 (https://phabricator.wikimedia.org/T342858) (owner: 10Samtar)
[12:32:01] <wikibugs>	 (03PS2) 10Stang: newiki: Fix templateeditor config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944983 (https://phabricator.wikimedia.org/T343257)
[12:32:02] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:946932|IS: Ensure edit recovery is disabled (T342858)]]
[12:32:05] <stashbot>	 T342858: Enable edit recovery on en.wikipedia.beta - https://phabricator.wikimedia.org/T342858
[12:33:19] <wikibugs>	 (03CR) 10David Caro: "Now tested in toolsbeta 😊" [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[12:34:22] <logmsgbot>	 !log samtar@deploy1002 samtar: Backport for [[gerrit:946932|IS: Ensure edit recovery is disabled (T342858)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[12:34:28] <logmsgbot>	 !log samtar@deploy1002 samtar: Continuing with sync
[12:34:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[12:35:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:35:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:36:43] <wikibugs>	 (03PS2) 10JMeybohm: Update apertium to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946940 (https://phabricator.wikimedia.org/T300033)
[12:37:32] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] vrts: send /var/log/{clamav,freshclam}.log to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/945781 (owner: 10AOkoth)
[12:40:20] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:946932|IS: Ensure edit recovery is disabled (T342858)]] (duration: 08m 18s)
[12:40:24] <stashbot>	 T342858: Enable edit recovery on en.wikipedia.beta - https://phabricator.wikimedia.org/T342858
[12:44:09] <wikibugs>	 (03CR) 10Samtar: IS-labs: Enable edit recovery on en.wikipedia.beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942419 (https://phabricator.wikimedia.org/T342858) (owner: 10Samtar)
[12:53:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10roti_WMDE)
[12:56:01] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Release 0.9.1-1+wmf12u1 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/945633 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh)
[12:57:06] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7] (wcqs): 0.3.124
[12:57:53] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7] (wcqs): 0.3.124 (duration: 00m 46s)
[12:58:07] <icinga-wm>	 RECOVERY - Blazegraph process -wcqs-blazegraph- on wcqs2003 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[12:58:35] <icinga-wm>	 RECOVERY - Blazegraph Port for wcqs-blazegraph on wcqs2003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[12:58:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10roti_WMDE)
[12:59:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10roti_WMDE)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1300)
[13:00:06] <jouncebot>	 koi and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:29] <Lucas_WMDE>	 I’ll know in a few minutes whether I can deploy or not
[13:00:55] <taavi>	 I can deploy
[13:00:59] <aanzx>	 o/
[13:01:01] <koi>	 o/
[13:01:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10roti_WMDE) Signed L3 and published my ssh key on https://meta.wikimedia.org/wiki/User:Robert_Timm_(WMDE)  Note: I updated the key in this ticket. The key now listed above and on m...
[13:02:40] <sukhe>	 !log reprepro -C main include bookworm-wikimedia anycast-healthchecker_0.9.1-1+wmf12u1_amd64.changes: T342154
[13:02:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:02:59] <stashbot>	 T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154
[13:03:50] <taavi>	 koi: I feel somewhat uneasy changing the group that was added 5(!) years ago without any community discussion
[13:05:19] <koi>	 i think this is a simple bug fix, they add such group per community consensus but seem configurate wrongly
[13:05:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wcqs2003:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[13:06:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946645 (https://phabricator.wikimedia.org/T305950) (owner: 10Anzx)
[13:06:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946666 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx)
[13:07:13] <wikibugs>	 (03Merged) 10jenkins-bot: Update piwiki legacy vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946645 (https://phabricator.wikimedia.org/T305950) (owner: 10Anzx)
[13:07:16] <wikibugs>	 (03Merged) 10jenkins-bot: Update idwiktionary old vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946666 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx)
[13:07:49] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6001.drmrs.wmnet with OS bookworm
[13:08:10] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:946645|Update piwiki legacy vector logo (T305950)]], [[gerrit:946666|Update idwiktionary old vector logo (T341175)]]
[13:08:14] <stashbot>	 T341175: Change the Indonesian Wiktionary's name and project namespace from Wiktionary to Wikikamus - https://phabricator.wikimedia.org/T341175
[13:08:15] <stashbot>	 T305950: Change logo for pi.wikipedia.org back to default - https://phabricator.wikimedia.org/T305950
[13:09:17] <taavi>	 Lucas_WMDE, urbanecm: any thoughts re templateeditor above?
[13:09:42] <logmsgbot>	 !log taavi@deploy1002 anzx and taavi: Backport for [[gerrit:946645|Update piwiki legacy vector logo (T305950)]], [[gerrit:946666|Update idwiktionary old vector logo (T341175)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:09:48] <aanzx>	 Testing 
[13:10:56] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10roti_WMDE) a:05roti_WMDE→03None
[13:11:31] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:12:08] <aanzx>	 Taavi tested looks good 
[13:12:10] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/933434 (https://phabricator.wikimedia.org/T340465) (owner: 10Volans)
[13:12:23] <sukhe>	 BFD/BGP alerts expected in drmrs
[13:12:40] <logmsgbot>	 !log taavi@deploy1002 anzx and taavi: Continuing with sync
[13:12:51] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:12:58] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Release 3.99.0~alpha2-2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/945637 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh)
[13:14:28] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Install hosts: fallback to drmrs [cookbooks] - 10https://gerrit.wikimedia.org/r/933434 (https://phabricator.wikimedia.org/T340465) (owner: 10Volans)
[13:16:00] <wikibugs>	 (03PS1) 10Volans: ganeti-netbox-sync: do not set platform anymore [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/946948 (https://phabricator.wikimedia.org/T336623)
[13:17:31] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] ganeti-netbox-sync: do not set platform anymore [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/946948 (https://phabricator.wikimedia.org/T336623) (owner: 10Volans)
[13:17:47] <wikibugs>	 (03CR) 10Volans: [C: 03+2] ganeti-netbox-sync: do not set platform anymore [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/946948 (https://phabricator.wikimedia.org/T336623) (owner: 10Volans)
[13:17:54] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host durum6001.drmrs.wmnet with OS bookworm
[13:18:21] <wikibugs>	 (03Merged) 10jenkins-bot: ganeti-netbox-sync: do not set platform anymore [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/946948 (https://phabricator.wikimedia.org/T336623) (owner: 10Volans)
[13:18:24] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6001.drmrs.wmnet with OS bookworm
[13:18:34] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[13:18:40] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[13:18:41] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[13:18:59] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:946645|Update piwiki legacy vector logo (T305950)]], [[gerrit:946666|Update idwiktionary old vector logo (T341175)]] (duration: 10m 48s)
[13:19:03] <stashbot>	 T341175: Change the Indonesian Wiktionary's name and project namespace from Wiktionary to Wikikamus - https://phabricator.wikimedia.org/T341175
[13:19:03] <stashbot>	 T305950: Change logo for pi.wikipedia.org back to default - https://phabricator.wikimedia.org/T305950
[13:19:30] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T343774 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact
[13:19:40] <taavi>	 or TheresNoTime, around? I could use a 2O on a config patch
[13:19:48] <TheresNoTime>	 taavi: hi, yes
[13:19:54] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10Volans)
[13:19:55] <TheresNoTime>	 which?
[13:20:10] <taavi>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/944983/
[13:20:21] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Add support for knams as PoP in tooling and automation - https://phabricator.wikimedia.org/T340465 (10Volans) 05Open→03Resolved All changes required have been merged, if anything else come up later we can re-open this...
[13:20:44] <taavi>	 it's updating what looks like a mistake in the original patch. but that original patch was 5 years ago, and as far as I can tell this is the first time anyone noticed it, so I'm a bit uneasy deploying that without any futher on-wiki discussions
[13:20:44] <TheresNoTime>	 looking, anything in particular you're concerned about?
[13:20:53] <TheresNoTime>	 ah
[13:21:11] <sukhe>	 !log reprepro -C main include bookworm-wikimedia gdnsd_3.99.0~alpha2-2_amd64.changes: T342154
[13:21:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:14] <stashbot>	 T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154
[13:21:48] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "LGTM" [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/946635 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[13:22:38] <TheresNoTime>	 taavi: I'd personally be okay with it, seeing as it was a misconfiguration — should probably ensure its announced on-wiki though by the patch author
[13:22:50] <wikibugs>	 (03PS2) 10Ladsgroup: Stop writing to old columns of externallinks in ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946597 (https://phabricator.wikimedia.org/T342683)
[13:22:57] <Amir1>	 jouncebot: nowandnext
[13:22:58] <jouncebot>	 For the next 0 hour(s) and 37 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1300)
[13:22:58] <jouncebot>	 In 2 hour(s) and 37 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1600)
[13:23:07] <Amir1>	 please let me know once you're done!
[13:23:16] <taavi>	 will do
[13:23:17] <Amir1>	 (or if I can squeeze a patch :D)
[13:23:35] <taavi>	 TheresNoTime: sounds good, ^ koi: see TNT above
[13:23:51] <wikibugs>	 (03PS3) 10Majavah: newiki: Fix templateeditor config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944983 (https://phabricator.wikimedia.org/T343257) (owner: 10Stang)
[13:24:15] <TheresNoTime>	 just to ensure that if there *are* any objections, its noticed and can be discussed promptly
[13:24:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944983 (https://phabricator.wikimedia.org/T343257) (owner: 10Stang)
[13:24:19] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944983 (https://phabricator.wikimedia.org/T343257) (owner: 10Stang)
[13:24:26] <koi>	 taavi, TheresNoTime, thanks for the msg, will do
[13:25:24] <wikibugs>	 (03Merged) 10jenkins-bot: CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[13:26:07] <wikibugs>	 (03Merged) 10jenkins-bot: newiki: Fix templateeditor config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944983 (https://phabricator.wikimedia.org/T343257) (owner: 10Stang)
[13:26:13] <wikibugs>	 (03PS3) 10JMeybohm: Update apertium to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946940 (https://phabricator.wikimedia.org/T300033)
[13:26:23] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:944983|newiki: Fix templateeditor config (T343257)]]
[13:26:26] <stashbot>	 T343257: Page protection not showing in Nepali Wiki - https://phabricator.wikimedia.org/T343257
[13:27:51] <logmsgbot>	 !log taavi@deploy1002 taavi and stang: Backport for [[gerrit:944983|newiki: Fix templateeditor config (T343257)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:27:58] <taavi>	 koi: please test
[13:28:06] <koi>	 looking
[13:28:35] <Lucas_WMDE>	 (it turned out I couldn’t deploy after all, sorry)
[13:29:00] <koi>	 taavi, LGTM
[13:29:46] <taavi>	 syncing
[13:29:47] <logmsgbot>	 !log taavi@deploy1002 taavi and stang: Continuing with sync
[13:36:08] <volans>	 !log set platform to null on all devices and VMs in Netbox - T336623
[13:36:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:12] <stashbot>	 T336623: Netbox device's platform field inconsistency - https://phabricator.wikimedia.org/T336623
[13:36:12] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:944983|newiki: Fix templateeditor config (T343257)]] (duration: 09m 49s)
[13:36:15] <stashbot>	 T343257: Page protection not showing in Nepali Wiki - https://phabricator.wikimedia.org/T343257
[13:36:18] <taavi>	 ok, done
[13:36:21] <taavi>	 Amir1: your turn
[13:36:28] <Amir1>	 awesome
[13:36:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (7) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:36:39] <wikibugs>	 (03PS3) 10Ladsgroup: Stop writing to old columns of externallinks in ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946597 (https://phabricator.wikimedia.org/T342683)
[13:36:48] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Stop writing to old columns of externallinks in ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946597 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup)
[13:37:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946597 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup)
[13:37:32] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to old columns of externallinks in ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946597 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup)
[13:37:46] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:946597|Stop writing to old columns of externallinks in ruwikinews (T342683)]]
[13:37:49] <stashbot>	 T342683: Stop writing to the old externallinks columns in beta cluster and production - https://phabricator.wikimedia.org/T342683
[13:37:58] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[13:39:03] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T340156 (10Jhancock.wm) a:05Papaul→03Jhancock.wm
[13:39:10] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:946597|Stop writing to old columns of externallinks in ruwikinews (T342683)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:41:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:41:49] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Continuing with sync
[13:43:02] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage
[13:46:31] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage
[13:47:47] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:946597|Stop writing to old columns of externallinks in ruwikinews (T342683)]] (duration: 10m 00s)
[13:47:50] <stashbot>	 T342683: Stop writing to the old externallinks columns in beta cluster and production - https://phabricator.wikimedia.org/T342683
[13:49:14] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:50:14] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:52:30] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10hnowlan)
[13:52:39] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan)
[13:53:28] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10hnowlan) 05Open→03Resolved a:03jijiki
[13:56:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance
[13:56:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance
[13:56:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50189 and previous config saved to /var/cache/conftool/dbconfig/20230808-135636-ladsgroup.json
[13:56:41] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[13:58:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50190 and previous config saved to /var/cache/conftool/dbconfig/20230808-135847-ladsgroup.json
[14:03:12] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1082.eqiad.wmnet with OS bullseye
[14:03:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance
[14:03:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance
[14:03:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50192 and previous config saved to /var/cache/conftool/dbconfig/20230808-140331-ladsgroup.json
[14:03:38] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[14:06:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:10:54] <_joe_>	 !log updated conftool, requestctl on puppetmasters to 2.3.1 to fix bugs with requestctl log
[14:10:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:26] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10hnowlan)
[14:16:33] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:17:43] <wikibugs>	 (03CR) 10Herron: [V: 03+1] thanos-fe: switch to cfssl (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/946559 (owner: 10Herron)
[14:17:55] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488)
[14:20:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wcqs2003:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[14:31:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50194 and previous config saved to /var/cache/conftool/dbconfig/20230808-143119-ladsgroup.json
[14:31:26] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[14:32:59] <wikibugs>	 (03PS1) 10Ssingh: bird::anycast_hc: temporarily remove validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/946952
[14:35:15] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T342694 (10Jhancock.wm)
[14:35:21] <wikibugs>	 (03PS2) 10Eevans: admin: add darthmon to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/946632 (https://phabricator.wikimedia.org/T342968)
[14:37:24] <wikibugs>	 (03CR) 10Bartosz Dziewoński: Revert "logspam.pl: Filter out some persistent noise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz Dziewoński)
[14:37:59] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T342694 (10Jhancock.wm) @Papaul here are the ports on the switches Fasw-c8a: 16 Fasw-c8b: 16 mgmt: 18
[14:38:06] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Revert "logspam.pl: Filter out some persistent noise" [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254)
[14:38:28] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T340156 (10Jhancock.wm)
[14:39:37] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "I see you've already scheduled this for deployment, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz Dziewoński)
[14:39:49] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T340156 (10Jhancock.wm) @Papaul ports are as follows  Fasw-c8a: 15 Fasw-c8b: 15 mgmt: 16
[14:40:37] <wikibugs>	 (03PS12) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691)
[14:43:33] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] bird::anycast_hc: temporarily remove validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/946952 (owner: 10Ssingh)
[14:46:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P50195 and previous config saved to /var/cache/conftool/dbconfig/20230808-144625-ladsgroup.json
[14:47:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I'll let John comment on the envoy+cfssl bits and multiple ports bits, rest looks good" [puppet] - 10https://gerrit.wikimedia.org/r/946559 (owner: 10Herron)
[14:49:58] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:50:10] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:51:14] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/946954
[14:52:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/946954 (owner: 10Marostegui)
[14:52:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:54:28] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6001.drmrs.wmnet with OS bookworm
[14:57:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:01:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P50196 and previous config saved to /var/cache/conftool/dbconfig/20230808-150131-ladsgroup.json
[15:11:10] <wikibugs>	 (03PS1) 10Ssingh: bird::anycast_hc: re-add validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/946958
[15:13:03] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] bird::anycast_hc: re-add validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/946958 (owner: 10Ssingh)
[15:13:25] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/946958/42798/durum6002.drmrs.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/946958 (owner: 10Ssingh)
[15:13:27] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] bird::anycast_hc: re-add validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/946958 (owner: 10Ssingh)
[15:14:44] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "Tested in tools too, merging:" [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[15:14:53] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6002.drmrs.wmnet with OS bookworm
[15:15:25] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] replica_cnf_api: refactor to use multiple backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[15:16:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50197 and previous config saved to /var/cache/conftool/dbconfig/20230808-151637-ladsgroup.json
[15:16:41] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[15:18:38] <icinga-wm>	 PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:19:20] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1082.eqiad.wmnet with reason: host reimage
[15:19:24] <sukhe>	 expected
[15:19:28] <sukhe>	 BGP/BFD
[15:19:29] <sukhe>	 drmrs
[15:19:46] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:22:49] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1082.eqiad.wmnet with reason: host reimage
[15:23:26] <wikibugs>	 (03PS7) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[15:26:23] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:32:57] <wikibugs>	 (03CR) 10Cathal Mooney: Policy and definition updates for post-migration esams ranges (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney)
[15:34:07] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:36:01] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10jijiki)
[15:36:13] <wikibugs>	 (03CR) 10Effie Mouzeli: "oh wow" [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[15:37:56] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage
[15:41:04] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage
[15:44:02] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1082.eqiad.wmnet with OS bullseye
[15:48:47] <wikibugs>	 (03PS4) 10Andrew Bogott: wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644
[15:48:49] <wikibugs>	 (03PS6) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643
[15:48:51] <wikibugs>	 (03PS1) 10Andrew Bogott: Add backy2 class to codfw1dev cinder-backup nodes [puppet] - 10https://gerrit.wikimedia.org/r/946962
[15:48:53] <wikibugs>	 (03PS1) 10Andrew Bogott: wip deploy [puppet] - 10https://gerrit.wikimedia.org/r/946963
[15:49:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add backy2 class to codfw1dev cinder-backup nodes [puppet] - 10https://gerrit.wikimedia.org/r/946962 (owner: 10Andrew Bogott)
[15:49:49] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10bd808) >>! In T320390#9076268, @Jelto wrote: > As far as I understand login and registration of new accounts works fine and the co...
[15:51:07] <Lucas_WMDE>	 urbanecm: in order to test how IP Masking affects Wikidata-related extensions, it would be useful to have the feature enabled on Beta Wikidata (add it to wmgEnableIPMasking)
[15:51:19] <Lucas_WMDE>	 do you have any concerns or reservations about that? is there a phab task where we should track this?
[15:51:35] <Lucas_WMDE>	 (I found https://phabricator.wikimedia.org/T327420 but that sounds like it’s mainly for testing your team’s own features)
[15:52:26] <Lucas_WMDE>	 also, we don’t necessarily need it enabled permanently – can temporary accounts be turned off again later, or is it better to keep them enabled?
[15:52:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wip deploy [puppet] - 10https://gerrit.wikimedia.org/r/946963 (owner: 10Andrew Bogott)
[15:53:49] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1083.eqiad.wmnet with OS bullseye
[15:56:29] <wikibugs>	 (03PS5) 10Andrew Bogott: wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644
[15:56:31] <wikibugs>	 (03PS7) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643
[15:56:33] <wikibugs>	 (03PS2) 10Andrew Bogott: wip deploy [puppet] - 10https://gerrit.wikimedia.org/r/946963
[15:56:35] <wikibugs>	 (03PS1) 10Andrew Bogott: Add eqiad backy2 config to cloudbackup200[12] [puppet] - 10https://gerrit.wikimedia.org/r/946964
[15:57:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add eqiad backy2 config to cloudbackup200[12] [puppet] - 10https://gerrit.wikimedia.org/r/946964 (owner: 10Andrew Bogott)
[16:00:04] <jouncebot>	 jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1600).
[16:00:04] <jouncebot>	 dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:12] <dancy>	 o/
[16:00:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wip deploy [puppet] - 10https://gerrit.wikimedia.org/r/946963 (owner: 10Andrew Bogott)
[16:02:06] <wikibugs>	 (03PS6) 10Andrew Bogott: wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644
[16:02:08] <wikibugs>	 (03PS8) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643
[16:02:10] <wikibugs>	 (03PS3) 10Andrew Bogott: wip deploy [puppet] - 10https://gerrit.wikimedia.org/r/946963
[16:02:12] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudbackup200[12]: remove some spurious config from the last patch [puppet] - 10https://gerrit.wikimedia.org/r/946965
[16:06:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wip deploy [puppet] - 10https://gerrit.wikimedia.org/r/946963 (owner: 10Andrew Bogott)
[16:08:40] <wikibugs>	 (03PS5) 10Volans: WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi)
[16:10:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi)
[16:12:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T342617)', diff saved to https://phabricator.wikimedia.org/P50200 and previous config saved to /var/cache/conftool/dbconfig/20230808-161244-ladsgroup.json
[16:12:52] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[16:13:09] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6002.drmrs.wmnet with OS bookworm
[16:13:36] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:13:54] <wikibugs>	 (03PS6) 10Volans: WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi)
[16:14:04] <icinga-wm>	 RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:15:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi)
[16:27:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P50201 and previous config saved to /var/cache/conftool/dbconfig/20230808-162750-ladsgroup.json
[16:28:46] <dancy>	 jbond/rzl: Is the puppet window happening today?
[16:31:12] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:36:58] <wikibugs>	 10SRE, 10SRE-OnFire, 10Incident Tooling: Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10lmata)
[16:37:14] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:42:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P50202 and previous config saved to /var/cache/conftool/dbconfig/20230808-164256-ladsgroup.json
[16:42:58] <wikibugs>	 (03PS4) 10Ryan Kemper: elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T343820)
[16:43:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T343820) (owner: 10Ryan Kemper)
[16:58:00] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1083.eqiad.wmnet with reason: host reimage
[16:58:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T342617)', diff saved to https://phabricator.wikimedia.org/P50203 and previous config saved to /var/cache/conftool/dbconfig/20230808-165803-ladsgroup.json
[16:58:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[16:58:06] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[16:58:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[16:58:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50204 and previous config saved to /var/cache/conftool/dbconfig/20230808-165824-ladsgroup.json
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1700)
[17:01:51] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1083.eqiad.wmnet with reason: host reimage
[17:02:43] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:05:18] <wikibugs>	 (03PS1) 10Btullis: Correct the role for the new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/946978 (https://phabricator.wikimedia.org/T343762)
[17:05:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T342617)', diff saved to https://phabricator.wikimedia.org/P50205 and previous config saved to /var/cache/conftool/dbconfig/20230808-170521-ladsgroup.json
[17:05:26] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[17:06:56] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42799/console" [puppet] - 10https://gerrit.wikimedia.org/r/946978 (https://phabricator.wikimedia.org/T343762) (owner: 10Btullis)
[17:07:43] <jinxer-wm>	 (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:12:30] <icinga-wm>	 RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:19:12] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Correct the role for the new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/946978 (https://phabricator.wikimedia.org/T343762) (owner: 10Btullis)
[17:20:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P50206 and previous config saved to /var/cache/conftool/dbconfig/20230808-172027-ladsgroup.json
[17:21:44] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:22:04] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:22:04] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wcqs1001.eqiad.wmnet with OS bullseye
[17:24:21] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wcqs1002.eqiad.wmnet with OS bullseye
[17:24:43] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1083.eqiad.wmnet with OS bullseye
[17:28:54] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:29:14] <wikibugs>	 (03PS1) 10JMeybohm: deployment_server::general: Globally enable mesh.certmanager [puppet] - 10https://gerrit.wikimedia.org/r/946981 (https://phabricator.wikimedia.org/T300033)
[17:30:09] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:30:23] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:31:33] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:31:45] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum4001.ulsfo.wmnet with OS bookworm
[17:31:57] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[17:33:07] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:33:31] <sukhe>	 ^ expected, BGP
[17:33:53] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:34:37] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:34:39] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:34:53] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:35:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P50207 and previous config saved to /var/cache/conftool/dbconfig/20230808-173534-ladsgroup.json
[17:35:36] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs1001.eqiad.wmnet with reason: host reimage
[17:36:29] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:37:40] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs1002.eqiad.wmnet with reason: host reimage
[17:37:49] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:38:44] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wcqs1001.eqiad.wmnet with reason: host reimage
[17:38:48] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:39:04] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:40:10] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:41:14] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wcqs1002.eqiad.wmnet with reason: host reimage
[17:46:37] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:48:25] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:50:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T342617)', diff saved to https://phabricator.wikimedia.org/P50208 and previous config saved to /var/cache/conftool/dbconfig/20230808-175040-ladsgroup.json
[17:50:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance
[17:50:44] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[17:50:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance
[17:51:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T342617)', diff saved to https://phabricator.wikimedia.org/P50209 and previous config saved to /var/cache/conftool/dbconfig/20230808-175101-ladsgroup.json
[17:52:18] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage
[17:55:49] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage
[17:56:29] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wcqs1001.eqiad.wmnet with OS bullseye
[18:00:04] <jouncebot>	 brennen and dancy: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1800).
[18:00:46] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:01:12] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.916 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:02:51] <brennen>	 (no train this week.)
[18:03:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup200[12]: remove some spurious config from the last patch [puppet] - 10https://gerrit.wikimedia.org/r/946965 (owner: 10Andrew Bogott)
[18:04:52] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:05:44] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:09:28] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:10:04] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 7.421 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:12:10] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host durum4001.ulsfo.wmnet with OS bookworm
[18:12:27] <wikibugs>	 (03CR) 10Ahmon Dancy: "Today's puppet window didn't happen so I moved this to Thursday's" [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz Dziewoński)
[18:12:28] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum4001.ulsfo.wmnet with OS bullseye
[18:12:34] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:21:14] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:21:18] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:22:12] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:27:17] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage
[18:29:44] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:31:39] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage
[18:31:48] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.853 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:31:54] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 9.933 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:33:25] <wikibugs>	 (03CR) 10Brennen Bearnes: Revert "logspam.pl: Filter out some persistent noise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz Dziewoński)
[18:38:28] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[18:40:14] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:40:46] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:40:52] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:43:12] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:43:16] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal
[18:43:37] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124
[18:43:42] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.941 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:43:46] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 3.405 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:45:27] <logmsgbot>	 !log ryankemper@deploy1002 deploy aborted: 0.3.124 (duration: 01m 50s)
[18:45:50] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: whitelist new qlever endpoints
[18:46:10] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal
[18:46:58] <RhinosF1>	 ryankemper: are you doing something with wcqs ? That pybal alert is for it
[18:47:45] <inflatador>	 RhinosF1 I am reimaging some wcqs servers. They should be depooled though
[18:48:00] <inflatador>	 there's only 3 hosts so maybe I tripped a threshold
[18:48:58] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: whitelist new qlever endpoints (duration: 03m 08s)
[18:49:16] <RhinosF1>	 inflatador: probably
[18:49:16] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=wcqs,name=eqiad
[18:51:01] <inflatador>	 RhinosF1 eqiad is depooled now, should be OK. Thanks for the heads up!
[18:51:30] <RhinosF1>	 Np :)
[18:52:50] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:52:56] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:52:56] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 90, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:54:02] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@aa5f5b7]: whitelist new qlever endpoints take 2
[18:54:28] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:54:32] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:54:35] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4001.ulsfo.wmnet with OS bullseye
[18:55:22] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:59:58] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:02:48] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:04:52] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:04:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[19:04:58] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:05:36] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@aa5f5b7]: whitelist new qlever endpoints take 2 (duration: 11m 34s)
[19:06:30] <ryankemper>	 !log [WDQS] Depooled `wdqs1006` while it catches up on 7 hours of lag
[19:06:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:40] <gehel>	 inflatador: ^
[19:07:35] <inflatador>	 ACK
[19:11:25] <wikibugs>	 (03PS1) 10Kimberly Sarabia: Deploy to CN language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946997 (https://phabricator.wikimedia.org/T335886)
[19:12:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Deploy to CN language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946997 (https://phabricator.wikimedia.org/T335886) (owner: 10Kimberly Sarabia)
[19:15:00] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:15:36] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:15:42] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:17:58] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:19:14] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[19:19:40] <wikibugs>	 (03PS2) 10Kimberly Sarabia: Deploy to CN language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946997 (https://phabricator.wikimedia.org/T335886)
[19:20:08] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.163 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:20:16] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 9.211 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:21:43] <wikibugs>	 (03CR) 10Physikerwelt: "... now that restbase does not use png images from mathoid anymore, we can also deploy the "new" mathoid version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/906694 (owner: 10PipelineBot)
[19:21:48] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[19:22:45] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on s2/s3/s5/s6 group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946998 (https://phabricator.wikimedia.org/T315353)
[19:23:14] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[19:24:20] <icinga-wm>	 PROBLEM - Blazegraph Port for wcqs-blazegraph on wcqs1001 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:24:20] <icinga-wm>	 PROBLEM - Blazegraph process -wcqs-blazegraph- on wcqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:24:22] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wcqs-blazegraph.service,wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:28:03] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wcqs[1001-1003].eqiad.wmnet with reason: T331300
[19:28:06] <stashbot>	 T331300: Ensure WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300
[19:28:18] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wcqs[1001-1003].eqiad.wmnet with reason: T331300
[19:29:16] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:29:22] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:31:40] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:33:44] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on s2/s3/s5/s6 group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946998 (https://phabricator.wikimedia.org/T315353)
[19:35:20] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.994 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:36:04] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:38:30] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 7.957 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:45:20] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:46:02] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:46:08] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:49:52] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:52:00] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:52:02] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50421 bytes in 0.510 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:59:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T2000). nyaa~
[20:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:13] <urbanecm>	 i can deploy today
[20:00:20] <MatmaRex>	 hi
[20:00:32] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:00:40] * urbanecm presumes he needs to do the config change first
[20:00:56] <MatmaRex>	 urbanecm: can you double-check if i got the set operations correctly in my ugly config change? :P
[20:00:57] <MatmaRex>	 yes
[20:01:16] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:01:22] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:02:02] <MatmaRex>	 oh, i guess i can look at https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/3841/consoleFull
[20:02:28] <MatmaRex>	 there are no wikis where `"wgDiscussionToolsEnablePermalinksBackend": true,` got removed, so it's probably correct
[20:03:26] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:03:58] <bd808>	 Is https://lists.wikimedia.org/ wicked slow for everyone or just me?
[20:04:12] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.778 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:04:16] <urbanecm>	 i got myself confused by commit message saying "enable", and the patch adding a bunch of "=> false"  rows
[20:04:22] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 6.795 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:04:27] <urbanecm>	 bd808: given the recovery...try now? :))
[20:05:23] <MatmaRex>	 urbanecm: yes, i had to swap the defaults, that's intended (but ugly)
[20:05:32] <urbanecm>	 MatmaRex: just to ensure i understand the goal: you want to set the variable to true on all wikis, except for group2 wikis that are in s1 or s7?
[20:05:42] <MatmaRex>	 yes
[20:06:54] * urbanecm does a few of double checks
[20:08:46] <urbanecm>	 MatmaRex: most of s7 wikis are finished by now (only viwiki is still running). do we want to enable the variable on all of s7 at once?
[20:09:50] <MatmaRex>	 urbanecm: if you want to fiddle with running a bunch more maintenance scripts that seem to fail silently all the time, then sure :P
[20:10:07] <MatmaRex>	 i think i'd rather have fewer crappy scripts and wait longer
[20:10:49] <urbanecm>	 makes sense; i thought the persistRevisionThreadItems.php step for those wikis is already completed, but i might be wrong on that :)
[20:11:10] <MatmaRex>	 yeah, you're right that it's completed on s7 except viwiki
[20:11:17] <bd808>	 no joy in hoping that mailman got faster when the icinga check resolved. 14 to 25 seconds for `time curl https://lists.wikimedia.org/postorius/lists/` from my local. Less than a second for [[mw:Main Page]] (with some cache busting to keep the CDN from being the difference)
[20:11:25] <MatmaRex>	 but i don't want to run scripts separately on every wiki
[20:11:35] <MatmaRex>	 and i also don't want to have two runs on s7 in progress at once
[20:11:46] <urbanecm>	 makes sense. let's go as is then.
[20:11:49] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable wgDiscussionToolsEnablePermalinksBackend on s2/s3/s5/s6 group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946998 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński)
[20:11:54] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:11:56] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:12:01] <MatmaRex>	 (it would work, i just don't want to document what is going on when we do that)
[20:12:14] <MatmaRex>	 (the task is already messed up enough)
[20:12:28] <wikibugs>	 (03Merged) 10jenkins-bot: Enable wgDiscussionToolsEnablePermalinksBackend on s2/s3/s5/s6 group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946998 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński)
[20:12:36] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:12:37] <MatmaRex>	 thank you!
[20:13:01] <urbanecm>	 not disputing, just was trying to understand the reasoning :)
[20:13:08] <wikibugs>	 10SRE, 10AQS2.0, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman)
[20:13:18] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@aa5f5b7]: whitelist new qlever endpoints take 3 T339347
[20:13:23] <stashbot>	 T339347: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347
[20:13:26] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:946998|Enable wgDiscussionToolsEnablePermalinksBackend on s2/s3/s5/s6 group2 (T315353)]]
[20:13:29] <stashbot>	 T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353
[20:13:39] <urbanecm>	 MatmaRex: there's nothing that can be meaningfully tested on mwdebug with this change, right?
[20:14:28] <ryankemper>	 !log [WDQS] Lag caught up on `wdqs1006`; repooled -> `ryankemper@wdqs1006:~$ sudo pool`
[20:14:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:44] <MatmaRex>	 urbanecm: i could check that Special:FindComment works, one second
[20:14:57] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and matmarex: Backport for [[gerrit:946998|Enable wgDiscussionToolsEnablePermalinksBackend on s2/s3/s5/s6 group2 (T315353)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:15:04] <urbanecm>	 just in time :)
[20:15:12] <urbanecm>	 MatmaRex: okay, pausing, let me know how it looks like
[20:16:12] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@aa5f5b7]: whitelist new qlever endpoints take 3 T339347 (duration: 02m 54s)
[20:16:34] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Feel free to schedule for backport deploy any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 (owner: 10D3r1ck01)
[20:17:05] <MatmaRex>	 so, this page: https://pl.wikipedia.org/wiki/Specjalna:Znajdź_komentarz?idorname=c-Jcubic-20230530085900-Matma_Rex-20230529230500
[20:17:16] <MatmaRex>	 shows no results in production
[20:17:21] <MatmaRex>	 but it has results on mwdebug
[20:17:26] <MatmaRex>	 which looks perfect :)
[20:17:42] <urbanecm>	 awesome
[20:17:46] <urbanecm>	 proceeding
[20:17:46] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and matmarex: Continuing with sync
[20:17:51] <MatmaRex>	 (the results are outdated, and that will be fixed by the second script run)
[20:20:42] <urbanecm>	 in the meantime: i saw T332738 linked from the config patch, and i'm not sure it would actually resolve the issue. even if group2 was useable in the configuration, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/901697 probably wouldn't do what you intended to (group0+group1+group2 has all wikis, so that patch would be functionally equivalent to `'default' => true`)
[20:20:43] <stashbot>	 T332738: 'group2' doesn't work to specify configuration in operations/mediawiki-config InitialiseSettings.php etc. - https://phabricator.wikimedia.org/T332738
[20:21:42] <urbanecm>	 what _would_ help is the ability to combine dblists, and the ability to do `'group2 - s7' => false`, for example, but that doesn't seem to be requested in that task
[20:22:20] <wikibugs>	 10SRE, 10Traffic: Add a reboot action to the Wikimedia DNS restart cookbook - https://phabricator.wikimedia.org/T342182 (10BCornwall) 05In progress→03Resolved
[20:22:21] <urbanecm>	 MatmaRex: if i'm missing something obvious, happy to hear 'em, otherwise we can leave it as an async thing for the task :)
[20:22:38] <MatmaRex>	 urbanecm: so my first approach to today's patch was default=>false, group0=>true, group1=>true, s2=>true, s3=>true, etc.
[20:22:44] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Deploy to CN language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946997 (https://phabricator.wikimedia.org/T335886) (owner: 10Kimberly Sarabia)
[20:22:55] <MatmaRex>	 i think that if that worked, it would be the same as what i ended up doing, and it would be a more obvious diff
[20:23:03] <wikibugs>	 10SRE, 10Traffic: dnsbox: Add gdnsd to bird's BindsTo systemd service - https://phabricator.wikimedia.org/T336973 (10BCornwall) 05In progress→03Resolved
[20:23:43] <MatmaRex>	 and 's2' etc. also looks to be not usable in the config files, like 'group2'
[20:24:22] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:946998|Enable wgDiscussionToolsEnablePermalinksBackend on s2/s3/s5/s6 group2 (T315353)]] (duration: 10m 55s)
[20:24:25] <stashbot>	 T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353
[20:24:28] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:24:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:25:06] <MatmaRex>	 i hope that's not my fault, but if it is, just revert
[20:25:18] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.358 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:25:20] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:25:24] <urbanecm>	 very likely isn't
[20:26:27] <urbanecm>	 MatmaRex: gotcha. the issue with that approach would actually be combination of groupX and sX lists. to give an example: if that was allowed, what would we want `default => false, group1 => true, s7 => false` to do for ie. `frwiktionary`? should it be `false`, because it is a s7 wiki? or should it be true, because it is a group1 wiki?
[20:26:42] <urbanecm>	 anyway, we're synced, so starting the scripts now
[20:27:15] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Rebuild against Bullseye [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/946635 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[20:27:16] <MatmaRex>	 yeah, that would be ambiguous
[20:27:58] <kimberly_sarabia>	 Hello. I have a last-minute patch I just added to the late backport window. If it's too late, no worries. Let me know
[20:28:36] <urbanecm>	 kimberly_sarabia: can you add it to the Wikitech calendar please? i think we can make it too :)
[20:28:40] <urbanecm>	 (and hello!)
[20:29:13] <urbanecm>	 !log mwmaint1002: `foreachwikiindblist 'group2 & s2' extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all --touched-after=20230615000000` (T315353)
[20:29:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:30:01] <kimberly_sarabia>	 urbanecm: Thanks! I think I added it: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T2000
[20:30:04] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:30:06] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:30:26] <urbanecm>	 !log mwmaint1002: `foreachwikiindblist 'group2 & s3' extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all --touched-after=20230615000000` (T315353)
[20:30:27] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: whitelist new qlever endpoints take 4 (forgot git pull) T339347
[20:30:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:31] <urbanecm>	 !log mwmaint1002: `foreachwikiindblist 'group2 & s5' extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all --touched-after=20230615000000` (T315353)
[20:30:32] <stashbot>	 T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353
[20:30:37] <stashbot>	 T339347: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347
[20:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:05] <urbanecm>	 !log mwmaint1002: `foreachwikiindblist 'group2 & s6' extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all --touched-after=20230615000000` (T315510)
[20:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:09] <stashbot>	 T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510
[20:31:24] <urbanecm>	 MatmaRex: scripts started, my apologies for accidentally pasting wrong task id to the log entries
[20:31:57] <wikibugs>	 (03PS3) 10Urbanecm: Deploy to CN language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946997 (https://phabricator.wikimedia.org/T335886) (owner: 10Kimberly Sarabia)
[20:32:01] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Deploy to CN language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946997 (https://phabricator.wikimedia.org/T335886) (owner: 10Kimberly Sarabia)
[20:32:14] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:32:30] <jinxer-wm>	 (Traffic bill over quota) firing: (2) Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got worse   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[20:32:41] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy to CN language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946997 (https://phabricator.wikimedia.org/T335886) (owner: 10Kimberly Sarabia)
[20:34:11] <MatmaRex>	 urbanecm: thanks, no problem
[20:34:40] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:946997|Deploy to CN language wikis (T335886)]]
[20:34:44] <stashbot>	 T335886: Deploy Vector 2022 as the default desktop skin to next set of wikis - https://phabricator.wikimedia.org/T335886
[20:34:49] <urbanecm>	 kimberly_sarabia: starting with your patch now! :)
[20:35:05] <kimberly_sarabia>	 urbanecm: thanks
[20:36:08] <logmsgbot>	 !log urbanecm@deploy1002 ksarabia and urbanecm: Backport for [[gerrit:946997|Deploy to CN language wikis (T335886)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:36:24] <urbanecm>	 kimberly_sarabia: can you test your patch at mwdebug1001 and let me know how it looks like please?
[20:36:42] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:37:30] <jinxer-wm>	 (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[20:37:50] <kimberly_sarabia>	 urbanecm: LGTM
[20:37:55] <urbanecm>	 thanks, proceeding
[20:37:56] <logmsgbot>	 !log urbanecm@deploy1002 ksarabia and urbanecm: Continuing with sync
[20:41:12] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: whitelist new qlever endpoints take 4 (forgot git pull) T339347 (duration: 10m 44s)
[20:41:14] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:41:20] <stashbot>	 T339347: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347
[20:43:49] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:946997|Deploy to CN language wikis (T335886)]] (duration: 09m 08s)
[20:43:52] <stashbot>	 T335886: Deploy Vector 2022 as the default desktop skin to next set of wikis - https://phabricator.wikimedia.org/T335886
[20:43:56] <urbanecm>	 kimberly_sarabia: should be deployed now
[20:43:57] <urbanecm>	 anything else?
[20:44:17] <kimberly_sarabia>	 urbanecm: tysm! all set
[20:44:21] <urbanecm>	 awesome
[20:44:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:45:44] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:49:30] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:49:32] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50421 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:49:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:52:30] <jinxer-wm>	 (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[20:52:41] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177] (wcqs): f1a6177
[20:52:59] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177] (wcqs): f1a6177 (duration: 00m 18s)
[20:53:02] <icinga-wm>	 RECOVERY - Blazegraph Port for wcqs-blazegraph on wcqs1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:53:08] <icinga-wm>	 RECOVERY - Blazegraph process -wcqs-blazegraph- on wcqs1001 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[20:54:08] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:54:10] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:56:12] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:56:20] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:56:33] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:57:20] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wcqs1002.eqiad.wmnet with OS bullseye
[20:57:30] <jinxer-wm>	 (Traffic bill over quota) resolved: (2) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[20:58:12] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177] (wcqs): f1a6177
[20:58:29] <urbanecm>	 MatmaRex: are you still here by any chance?
[20:58:30] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177] (wcqs): f1a6177 (duration: 00m 17s)
[20:58:37] <urbanecm>	 i got this error with the script https://www.irccloud.com/pastebin/J1r6BZCe/
[20:58:45] <urbanecm>	 and i don't recall seeing it during the previous runs
[20:58:47] <urbanecm>	 any idea?
[20:59:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans)
[21:00:26] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans) p:05Triage→03Medium a:03Eevans
[21:02:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans) >>! In T343596#9074468, @Seddon wrote: > Approved (direct manager)  @thcipriani you are the full list of approvers for group //restricted//; Ok to proceed?
[21:02:19] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wcqs1003.eqiad.wmnet with OS bullseye
[21:02:35] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wcqs,name=eqiad
[21:04:30] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=no; selector: name=wcqs1003.eqiad.wmnet,service=wcqs
[21:06:08] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 108
[21:06:23] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 108
[21:06:54] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:09:12] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.303 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:09:16] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:10:26] <wikibugs>	 (03PS1) 10BCornwall: Rebuild against Bookworm, not Bullseye [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/947009
[21:12:42] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Rebuild against Bookworm, not Bullseye [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/947009 (owner: 10BCornwall)
[21:12:57] <urbanecm>	 filled my ping as T343859
[21:12:57] <stashbot>	 T343859: DiscussionTools: LogicException: Database can't find our row and won't let us insert it - https://phabricator.wikimedia.org/T343859
[21:13:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans)
[21:13:50] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:13:52] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:14:46] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Rebuild against Bookworm, not Bullseye [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/947009 (owner: 10BCornwall)
[21:15:33] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs1003.eqiad.wmnet with reason: host reimage
[21:15:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans)
[21:16:00] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:17:24] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:18:02] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wcqs1003.eqiad.wmnet with reason: host reimage
[21:21:18] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans) Hi @Tsevener, we'll need to verify your key out-of-band (of Phabricator).  It might be easiest to add it to a user page (for example, https://www.mediawiki.org/wiki/User:Adee_Ritman...
[21:22:10] <brett>	 !log Exported varnish-modules 0.15.0-4 for bookworm-wikimedia (T342154)
[21:22:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:15] <stashbot>	 T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154
[21:24:19] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10thcipriani) >>! In T343596#9078742, @Eevans wrote: >>>! In T343596#9074468, @Seddon wrote: >> Approved (direct manager) >  > @thcipriani you are the full list of approvers for group //restr...
[21:24:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10thcipriani)
[21:24:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10Eevans) a:03Eevans
[21:28:02] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:30:32] <MatmaRex>	 urbanecm: sorry, was away for a bit. hopefully it's not very common?
[21:31:07] <MatmaRex>	 it's a new error from new code. it should be impossible :) so i'll ask amir to look at it
[21:32:31] <MatmaRex>	 looks like it's from the last deployment, not from the maintenance script. a few hundred since yesterday
[21:32:34] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:33:30] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.567 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:34:58] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 2.936 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:37:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10Eevans)
[21:38:20] <MatmaRex>	 (i replied on the task)
[21:43:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10Eevans)
[21:44:34] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Remove unneeded $wgDefaultUserOptions['visualeditor-enable'] settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933998 (https://phabricator.wikimedia.org/T340696)
[21:44:48] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 04-1] "(waiting for deployment train)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933998 (https://phabricator.wikimedia.org/T340696) (owner: 10Bartosz Dziewoński)
[21:45:32] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:45:38] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:46:08] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:46:11] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wcqs1003.eqiad.wmnet with OS bullseye
[21:46:23] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10Eevans) p:05Triage→03Medium a:03Eevans
[21:46:43] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[21:47:30] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:48:26] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:48:30] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50421 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:57:36] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:57:38] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[21:57:40] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:58:10] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wcqs-blazegraph.service,wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:58:38] <icinga-wm>	 PROBLEM - Blazegraph process -wcqs-blazegraph- on wcqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:58:50] <icinga-wm>	 PROBLEM - Blazegraph Port for wcqs-blazegraph on wcqs1003 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:59:42] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:59:53] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Move visual editor out of Beta Features (without changing prefs) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947015 (https://phabricator.wikimedia.org/T335056)
[22:00:45] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 04-1] "(not scheduled yet)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947015 (https://phabricator.wikimedia.org/T335056) (owner: 10Bartosz Dziewoński)
[22:03:38] <wikibugs>	 (03PS3) 10Eevans: admin: add darthmon to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/946632 (https://phabricator.wikimedia.org/T342968)
[22:03:53] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177] (wcqs): f1a6177
[22:04:11] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177] (wcqs): f1a6177 (duration: 00m 17s)
[22:04:40] <icinga-wm>	 RECOVERY - Blazegraph process -wcqs-blazegraph- on wcqs1003 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:04:50] <icinga-wm>	 RECOVERY - Blazegraph Port for wcqs-blazegraph on wcqs1003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:07:08] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:08:08] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.335 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:08:08] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 1.425 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:13:14] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:16:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:18:46] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:18:52] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:19:20] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:21:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:23:26] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 9.905 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:23:42] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:24:42] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.457 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:33:07] <urbanecm>	 !log mwmaint1002: stop persistRevisionThreadItems.php frwiki instance because of T343859 (cc T315510)
[22:33:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:13] <stashbot>	 T343859: DiscussionTools: LogicException: Database can't find our row and won't let us insert it - https://phabricator.wikimedia.org/T343859
[22:33:14] <stashbot>	 T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510
[22:33:58] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:34:06] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:34:30] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:38:54] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:39:50] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:40:06] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 7.825 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:50:34] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:50:42] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:51:06] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:58:38] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:59:36] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.529 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:02:02] <wikibugs>	 (03PS1) 10Eevans: admin: add roti to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/947022 (https://phabricator.wikimedia.org/T342972)
[23:04:10] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:04:18] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 8.133 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:05:36] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.408 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:05:54] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10Izno)
[23:06:08] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10Izno)
[23:14:52] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:15:00] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:15:20] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:18:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10dr0ptp4kt) I'm going to open a separate task for the `wmcs-admin` membership, but leave this task here open so we can continue to explore the additional matter of being...
[23:19:44] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:20:48] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.219 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:21:04] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 8.831 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:28:40] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:30:04] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:30:30] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:31:54] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:33:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10dr0ptp4kt)
[23:34:28] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.973 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:34:34] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:39:10] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:39:47] <wikibugs>	 (03PS1) 10Dr0ptp4kt: Add dr0ptp4kt to wmcs-admin [puppet] - 10https://gerrit.wikimedia.org/r/947028 (https://phabricator.wikimedia.org/T343862)
[23:40:50] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:44:08] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:48:30] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:49:36] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.597 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:49:44] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:52:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50211 and previous config saved to /var/cache/conftool/dbconfig/20230808-235258-ladsgroup.json
[23:53:02] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617