[00:01:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:09:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T342617)', diff saved to https://phabricator.wikimedia.org/P50169 and previous config saved to /var/cache/conftool/dbconfig/20230808-000859-ladsgroup.json [00:09:04] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [00:24:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P50170 and previous config saved to /var/cache/conftool/dbconfig/20230808-002405-ladsgroup.json [00:38:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945839 [00:38:40] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945839 (owner: 10TrainBranchBot) [00:39:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P50171 and previous config saved to /var/cache/conftool/dbconfig/20230808-003911-ladsgroup.json [00:45:43] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T342617)', diff saved to https://phabricator.wikimedia.org/P50172 and previous config saved to /var/cache/conftool/dbconfig/20230808-005418-ladsgroup.json [00:54:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [00:54:22] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [00:54:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [00:54:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T342617)', diff saved to https://phabricator.wikimedia.org/P50173 and previous config saved to /var/cache/conftool/dbconfig/20230808-005439-ladsgroup.json [00:58:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945839 (owner: 10TrainBranchBot) [01:03:34] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T343774 (10phaultfinder) [01:40:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T342617)', diff saved to https://phabricator.wikimedia.org/P50174 and previous config saved to /var/cache/conftool/dbconfig/20230808-014007-ladsgroup.json [01:40:12] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [01:53:29] (03PS1) 10Andrew Bogott: mwopenstackclients: add allvolumes() shortcut [puppet] - 10https://gerrit.wikimedia.org/r/946642 [01:53:31] (03PS1) 10Andrew Bogott: WIP: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 [01:54:44] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: add allvolumes() shortcut [puppet] - 10https://gerrit.wikimedia.org/r/946642 (owner: 10Andrew Bogott) [01:55:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P50175 and previous config saved to /var/cache/conftool/dbconfig/20230808-015513-ladsgroup.json [01:56:10] (03CR) 10CI reject: [V: 04-1] WIP: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 (owner: 10Andrew Bogott) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T0200) [02:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P50176 and previous config saved to /var/cache/conftool/dbconfig/20230808-021020-ladsgroup.json [02:18:41] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:51] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:42] (SystemdUnitFailed) firing: (2) prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:25:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T342617)', diff saved to https://phabricator.wikimedia.org/P50177 and previous config saved to /var/cache/conftool/dbconfig/20230808-022526-ladsgroup.json [02:25:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [02:25:30] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [02:25:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [02:25:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T342617)', diff saved to https://phabricator.wikimedia.org/P50178 and previous config saved to /var/cache/conftool/dbconfig/20230808-022547-ladsgroup.json [02:30:57] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:07] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T0300) [03:02:15] (03PS2) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 [03:02:17] (03PS1) 10Andrew Bogott: wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644 [03:05:29] (03CR) 10CI reject: [V: 04-1] add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 (owner: 10Andrew Bogott) [03:05:33] (03CR) 10CI reject: [V: 04-1] wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644 (owner: 10Andrew Bogott) [03:09:26] (03PS2) 10Andrew Bogott: wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644 [03:09:28] (03PS3) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 [03:10:24] (03PS3) 10Andrew Bogott: wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644 [03:10:26] (03PS4) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 [03:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:25:54] (03PS5) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 [03:27:56] (03PS1) 10Anzx: Update piwiki legacy vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946645 (https://phabricator.wikimedia.org/T305950) [03:32:21] (03PS1) 10Anzx: Update idwiktionary old vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946666 (https://phabricator.wikimedia.org/T341175) [03:35:48] (03CR) 10Andrew Bogott: "I've only tested this with --noop but it seems like it should work..." [puppet] - 10https://gerrit.wikimedia.org/r/946643 (owner: 10Andrew Bogott) [04:23:43] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:28:19] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wcqs-blazegraph.service,wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:43:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:45:54] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 3 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Kelson) [05:47:47] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ayounsi) [05:48:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:50:19] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 3 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Kelson) @akosiaris @MSantos May I underline Vadim's request: carifying if we (at Kiwix) can still benefit from the `mobile-sect... [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T0600) [06:00:04] kormat, marostegui, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T0600). [06:08:25] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:17] (03PS2) 10Elukey: admin_ng: allow host headers for base domain in istio mesh configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946593 (https://phabricator.wikimedia.org/T343740) [06:22:43] (SystemdUnitFailed) firing: (2) prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:44] (03CR) 10Giuseppe Lavagetto: "Overall lgtm, I have one general usability doubt." [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [06:36:58] (03PS1) 10Stevemunene: Add datahub_staging cname [dns] - 10https://gerrit.wikimedia.org/r/946851 (https://phabricator.wikimedia.org/T343236) [06:38:27] (03PS1) 10KartikMistry: testwiki: Enable Section Translation for 7 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946852 (https://phabricator.wikimedia.org/T343211) [06:51:49] (03CR) 10Vgutierrez: "change makes sense but now the puppetization is inconsistent. ip_reputation can be enabled for an upload node but it would be a NOOP in te" [puppet] - 10https://gerrit.wikimedia.org/r/946566 (owner: 10Giuseppe Lavagetto) [07:00:04] Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T0700) [07:00:04] aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:47] (03CR) 10Elukey: [C: 03+2] admin_ng: allow host headers for base domain in istio mesh configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946593 (https://phabricator.wikimedia.org/T343740) (owner: 10Elukey) [07:03:03] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:14] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:06:35] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:06:35] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:07:04] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [07:07:25] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [07:07:36] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:07:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:25:35] (03PS3) 10Giuseppe Lavagetto: cache: move vendor proxy lookup to cluster_fe_ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/946566 [07:28:20] (03PS1) 10Ayounsi: BGPalerter: mute software-update notifications [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) [07:29:54] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42796/console" [puppet] - 10https://gerrit.wikimedia.org/r/946566 (owner: 10Giuseppe Lavagetto) [07:30:20] (03PS2) 10Ayounsi: BGPalerter: mute software-update notifications [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) [07:31:07] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [07:31:37] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] cache: move vendor proxy lookup to cluster_fe_ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/946566 (owner: 10Giuseppe Lavagetto) [07:32:00] (03PS3) 10Ayounsi: BGPalerter: mute software-update notifications [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) [07:32:13] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [07:34:34] (03PS4) 10Giuseppe Lavagetto: cache: move vendor proxy lookup to cluster_fe_ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/946566 [07:34:45] (03PS4) 10Ayounsi: BGPalerter: mute software-update notifications [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) [07:35:02] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [07:52:43] (SystemdUnitFailed) firing: (3) prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:25] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:11] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:05:35] PROBLEM - Host ml-serve2004 is DOWN: PING CRITICAL - Packet loss = 100% [08:07:51] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:09:39] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:10:48] (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:14:12] cannot ssh to ml-serve2004, either a host or a network issue [08:18:09] probably host, as I see no cmd line on the mgmt [08:21:20] 14 minutes is too long to wait for a reboot, and elukey is not around, so I am going to force a soft power restart [08:21:46] oh, he is [08:21:57] so waiting for his ok, maybe it is just maintenance [08:22:00] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) Maybe it makes sense to create a dedicated task to discuss the general usage and policies for developer account naming? The... [08:24:25] (03CR) 10Btullis: Add datahub_staging cname (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/946851 (https://phabricator.wikimedia.org/T343236) (owner: 10Stevemunene) [08:25:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T342617)', diff saved to https://phabricator.wikimedia.org/P50179 and previous config saved to /var/cache/conftool/dbconfig/20230808-082539-ladsgroup.json [08:25:44] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [08:30:16] jynus: checking thanks :) [08:30:21] (I was afk) [08:31:28] no problem, in fact only because you were around I didn't take further action [08:31:44] le me know if a reboot is needed [08:33:04] !log powercycle ml-serve2004 - mgmt console without tty available, DIMM errors in getsel [08:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:18] jynus: kicked off one, there are some DIMM errors in getsel though, not great [08:33:26] :-( [08:33:36] yeah, it looked like it was stuck [08:36:19] RECOVERY - Host ml-serve2004 is UP: PING OK - Packet loss = 0%, RTA = 34.33 ms [08:37:43] my fear was that you were doing a reimage or some other maintenance, so I chose to wait for your feedback, rather than a small chance of ruining something [08:37:57] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 111, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:38:18] jynus: thanks! [08:38:34] for the moment let's see if it was a transient issue or not, in case of another freeze I'll involve dcops [08:39:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:40:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P50180 and previous config saved to /var/cache/conftool/dbconfig/20230808-084045-ladsgroup.json [08:40:49] (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:41:58] (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/946566 (owner: 10Giuseppe Lavagetto) [08:44:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:45:32] !log restart debmonitor2003 services [08:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:47] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:46:03] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:46:17] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/7 UP : OSPFv3: 4/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:46:19] PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:46:43] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:47:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache: move vendor proxy lookup to cluster_fe_ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/946566 (owner: 10Giuseppe Lavagetto) [08:47:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:47:51] (03PS1) 10Majavah: base: set Precedence: Bulk header in notify_maintainers [puppet] - 10https://gerrit.wikimedia.org/r/946924 [08:50:51] RECOVERY - BFD status on cr2-drmrs is OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:51:09] (03PS1) 10Giuseppe Lavagetto: Revert "cache: move vendor proxy lookup to cluster_fe_ratelimit" [puppet] - 10https://gerrit.wikimedia.org/r/946647 [08:51:11] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/946924 (owner: 10Majavah) [08:51:18] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "cache: move vendor proxy lookup to cluster_fe_ratelimit" [puppet] - 10https://gerrit.wikimedia.org/r/946647 (owner: 10Giuseppe Lavagetto) [08:51:47] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:51:53] <_joe_> cdanis: can I merge your change? [08:51:57] <_joe_> err dcaro [08:52:02] <_joe_> sorry cdanis :) [08:52:05] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:52:13] _joe_: yep :) [08:52:17] thanks [08:52:17] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:52:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:52:41] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:52:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T342617)', diff saved to https://phabricator.wikimedia.org/P50181 and previous config saved to /var/cache/conftool/dbconfig/20230808-085255-ladsgroup.json [08:53:01] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [08:55:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P50182 and previous config saved to /var/cache/conftool/dbconfig/20230808-085551-ladsgroup.json [08:57:23] (03PS1) 10Giuseppe Lavagetto: cache: expand ip reputation lookup cases [puppet] - 10https://gerrit.wikimedia.org/r/946925 [09:01:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache: expand ip reputation lookup cases [puppet] - 10https://gerrit.wikimedia.org/r/946925 (owner: 10Giuseppe Lavagetto) [09:08:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P50183 and previous config saved to /var/cache/conftool/dbconfig/20230808-090801-ladsgroup.json [09:10:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T342617)', diff saved to https://phabricator.wikimedia.org/P50184 and previous config saved to /var/cache/conftool/dbconfig/20230808-091058-ladsgroup.json [09:11:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [09:11:02] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [09:11:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [09:11:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T342617)', diff saved to https://phabricator.wikimedia.org/P50185 and previous config saved to /var/cache/conftool/dbconfig/20230808-091119-ladsgroup.json [09:15:52] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: fix pathing for knowledge-gap [deployment-charts] - 10https://gerrit.wikimedia.org/r/945806 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [09:15:59] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:16:13] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:16:29] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/7 UP : OSPFv3: 4/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:16:31] PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:16:51] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:18:19] (03Merged) 10jenkins-bot: rest-gateway: fix pathing for knowledge-gap [deployment-charts] - 10https://gerrit.wikimedia.org/r/945806 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [09:22:03] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:22:35] RECOVERY - BFD status on cr2-drmrs is OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:23:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P50186 and previous config saved to /var/cache/conftool/dbconfig/20230808-092308-ladsgroup.json [09:23:47] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:24:05] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:24:25] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:38:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T342617)', diff saved to https://phabricator.wikimedia.org/P50187 and previous config saved to /var/cache/conftool/dbconfig/20230808-093814-ladsgroup.json [09:38:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [09:38:18] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [09:38:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [09:38:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T342617)', diff saved to https://phabricator.wikimedia.org/P50188 and previous config saved to /var/cache/conftool/dbconfig/20230808-093835-ladsgroup.json [09:43:58] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [09:44:23] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [09:44:39] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:44:49] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:45:09] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:45:33] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:47:13] (03PS1) 10Volans: admin: add email to user maryana [puppet] - 10https://gerrit.wikimedia.org/r/946927 (https://phabricator.wikimedia.org/T342797) [09:48:02] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10darthmon_wmde) a:03roti_WMDE [09:50:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin: add email to user maryana [puppet] - 10https://gerrit.wikimedia.org/r/946927 (https://phabricator.wikimedia.org/T342797) (owner: 10Volans) [09:51:43] (03CR) 10Volans: [C: 03+2] admin: add email to user maryana [puppet] - 10https://gerrit.wikimedia.org/r/946927 (https://phabricator.wikimedia.org/T342797) (owner: 10Volans) [09:52:11] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [09:52:29] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [09:55:15] (03PS2) 10Hnowlan: trafficserver: route wikifeeds requests via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/945558 (https://phabricator.wikimedia.org/T339119) [09:59:16] (03PS1) 10Hnowlan: trafficserver: route knowledge-gap path via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/946928 (https://phabricator.wikimedia.org/T342213) [10:00:00] !log restart ferm on mirror1001 to pick new IP address for debian syncproxy2 [10:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1000) [10:01:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:02:44] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: enable ldap group sync on active GitLab server [puppet] - 10https://gerrit.wikimedia.org/r/945612 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto) [10:04:15] (03PS2) 10Filippo Giunchedi: admin: add adri to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/946515 (https://phabricator.wikimedia.org/T342969) [10:06:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:08:01] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add adri to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/946515 (https://phabricator.wikimedia.org/T342969) (owner: 10Filippo Giunchedi) [10:10:09] (03PS2) 10Hnowlan: rest-gateway: add availability route [deployment-charts] - 10https://gerrit.wikimedia.org/r/945784 (https://phabricator.wikimedia.org/T339119) [10:11:17] (03Abandoned) 10Hnowlan: rest-gateway: add availability route [deployment-charts] - 10https://gerrit.wikimedia.org/r/945784 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [10:13:42] (03PS1) 10Elukey: admin_ng: increase cpu limits for knative-serving pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/946929 [10:15:27] RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [10:15:47] (03PS2) 10Elukey: admin_ng: change cpu limits for knative-serving pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/946929 [10:16:47] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Access will be live in 30min, I'm optimistically resolving the task though please reopen if sth is amiss! [10:21:20] !log update T343294 mitigations [10:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:04] (03CR) 10Filippo Giunchedi: [C: 04-1] "Untested and I don't think it'll work as-is, though definitely +1 on the idea" [puppet] - 10https://gerrit.wikimedia.org/r/946559 (owner: 10Herron) [10:33:19] (03CR) 10JMeybohm: [C: 03+1] Revert "mediawiki: Reduce requests for canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/945798 (owner: 10Clément Goubert) [10:33:33] (03CR) 10Clément Goubert: [C: 03+2] Revert "mediawiki: Reduce requests for canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/945798 (owner: 10Clément Goubert) [10:34:35] (03Merged) 10jenkins-bot: Revert "mediawiki: Reduce requests for canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/945798 (owner: 10Clément Goubert) [10:36:17] !log deploying mw-on-k8s - https://gerrit.wikimedia.org/r/945798 [10:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:48] (03PS11) 10Clément Goubert: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) [10:42:24] (03CR) 10Clément Goubert: mediawiki: set requests based on php.workers (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [10:50:21] (03PS1) 10Samtar: IS: Ensure edit recovery is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946932 (https://phabricator.wikimedia.org/T342858) [10:55:24] (03CR) 10Tim Starling: [C: 03+1] IS: Ensure edit recovery is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946932 (https://phabricator.wikimedia.org/T342858) (owner: 10Samtar) [10:56:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10MatthewVernon) This needs approval by @mark or @joanna_borun (per `data.yaml`), I think. So I've tagged them to approve (or otherwise) this request :) [11:00:53] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:05:23] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10joanna_borun) Apologies for the delay in completing this task. Our Infrastructure Foundations team is currently in the process of evaluating the global root access policy and po... [11:16:37] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:59] PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdn1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [11:37:07] (03CR) 10Cathal Mooney: [C: 03+2] Allow HTTP return traffic from apt to network devices on TCP 8080 [homer/public] - 10https://gerrit.wikimedia.org/r/942639 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney) [11:38:01] (03Merged) 10jenkins-bot: Allow HTTP return traffic from apt to network devices on TCP 8080 [homer/public] - 10https://gerrit.wikimedia.org/r/942639 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney) [11:41:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:41:37] ^checking [11:42:55] Sharp fall in rps coinciding [11:43:10] It's not worker saturation [11:45:47] jouncebot: nowandnext [11:45:47] No deployments scheduled for the next 0 hour(s) and 14 minute(s) [11:45:47] In 0 hour(s) and 14 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1200) [11:46:37] claime: was going to deploy a prod no-op (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/946932), want me to hold off? [11:47:26] TheresNoTime: Yeah, please [11:47:52] okay :) [11:47:56] Great I love when my firefox crashes in the middle of debugging something [11:51:14] I'm not seeing anything particularly flagrant rn [11:51:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:51:21] It's recovering [11:52:43] (SystemdUnitFailed) firing: (3) prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:54:05] My firefox keeps crashing on grafana >_> [11:59:20] I'm not finding anything, TheresNoTime go ahead with your deployment, I'll keep digging in the logs [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1200) [12:01:42] * TheresNoTime will wait for that window ^ to start/finish [12:01:51] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [12:06:27] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:04] (03PS9) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) [12:14:37] (03PS5) 10JMeybohm: CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) [12:15:39] (03CR) 10JMeybohm: CI: Generate deployment fixtures from actual hiera data (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:17:26] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [12:18:39] (03CR) 10Elukey: [C: 03+2] admin_ng: change cpu limits for knative-serving pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/946929 (owner: 10Elukey) [12:23:00] (03PS10) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) [12:24:39] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:25:06] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:25:59] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:26:16] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [12:26:26] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:28:08] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:28:25] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:30:01] (03PS1) 10JMeybohm: Update apertium to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946940 (https://phabricator.wikimedia.org/T300033) [12:30:19] jouncebot: nowandnext [12:30:19] For the next 0 hour(s) and 29 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1200) [12:30:19] In 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1300) [12:30:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:30:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946932 (https://phabricator.wikimedia.org/T342858) (owner: 10Samtar) [12:30:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:31:28] (03PS11) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) [12:31:37] (03Merged) 10jenkins-bot: IS: Ensure edit recovery is disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946932 (https://phabricator.wikimedia.org/T342858) (owner: 10Samtar) [12:32:01] (03PS2) 10Stang: newiki: Fix templateeditor config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944983 (https://phabricator.wikimedia.org/T343257) [12:32:02] !log samtar@deploy1002 Started scap: Backport for [[gerrit:946932|IS: Ensure edit recovery is disabled (T342858)]] [12:32:05] T342858: Enable edit recovery on en.wikipedia.beta - https://phabricator.wikimedia.org/T342858 [12:33:19] (03CR) 10David Caro: "Now tested in toolsbeta 😊" [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [12:34:22] !log samtar@deploy1002 samtar: Backport for [[gerrit:946932|IS: Ensure edit recovery is disabled (T342858)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [12:34:28] !log samtar@deploy1002 samtar: Continuing with sync [12:34:40] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [12:35:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:35:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:36:43] (03PS2) 10JMeybohm: Update apertium to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946940 (https://phabricator.wikimedia.org/T300033) [12:37:32] (03CR) 10EoghanGaffney: [C: 03+1] vrts: send /var/log/{clamav,freshclam}.log to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/945781 (owner: 10AOkoth) [12:40:20] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:946932|IS: Ensure edit recovery is disabled (T342858)]] (duration: 08m 18s) [12:40:24] T342858: Enable edit recovery on en.wikipedia.beta - https://phabricator.wikimedia.org/T342858 [12:44:09] (03CR) 10Samtar: IS-labs: Enable edit recovery on en.wikipedia.beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942419 (https://phabricator.wikimedia.org/T342858) (owner: 10Samtar) [12:53:59] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10roti_WMDE) [12:56:01] (03CR) 10Ssingh: [C: 03+2] Release 0.9.1-1+wmf12u1 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/945633 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [12:57:06] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7] (wcqs): 0.3.124 [12:57:53] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7] (wcqs): 0.3.124 (duration: 00m 46s) [12:58:07] RECOVERY - Blazegraph process -wcqs-blazegraph- on wcqs2003 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:58:35] RECOVERY - Blazegraph Port for wcqs-blazegraph on wcqs2003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:58:55] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10roti_WMDE) [12:59:09] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10roti_WMDE) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1300) [13:00:06] koi and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:29] I’ll know in a few minutes whether I can deploy or not [13:00:55] I can deploy [13:00:59] o/ [13:01:01] o/ [13:01:52] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10roti_WMDE) Signed L3 and published my ssh key on https://meta.wikimedia.org/wiki/User:Robert_Timm_(WMDE) Note: I updated the key in this ticket. The key now listed above and on m... [13:02:40] !log reprepro -C main include bookworm-wikimedia anycast-healthchecker_0.9.1-1+wmf12u1_amd64.changes: T342154 [13:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:43] (SystemdUnitFailed) firing: (3) prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:59] T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 [13:03:50] koi: I feel somewhat uneasy changing the group that was added 5(!) years ago without any community discussion [13:05:19] i think this is a simple bug fix, they add such group per community consensus but seem configurate wrongly [13:05:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wcqs2003:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:06:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946645 (https://phabricator.wikimedia.org/T305950) (owner: 10Anzx) [13:06:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946666 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [13:07:13] (03Merged) 10jenkins-bot: Update piwiki legacy vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946645 (https://phabricator.wikimedia.org/T305950) (owner: 10Anzx) [13:07:16] (03Merged) 10jenkins-bot: Update idwiktionary old vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946666 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [13:07:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6001.drmrs.wmnet with OS bookworm [13:08:10] !log taavi@deploy1002 Started scap: Backport for [[gerrit:946645|Update piwiki legacy vector logo (T305950)]], [[gerrit:946666|Update idwiktionary old vector logo (T341175)]] [13:08:14] T341175: Change the Indonesian Wiktionary's name and project namespace from Wiktionary to Wikikamus - https://phabricator.wikimedia.org/T341175 [13:08:15] T305950: Change logo for pi.wikipedia.org back to default - https://phabricator.wikimedia.org/T305950 [13:09:17] Lucas_WMDE, urbanecm: any thoughts re templateeditor above? [13:09:42] !log taavi@deploy1002 anzx and taavi: Backport for [[gerrit:946645|Update piwiki legacy vector logo (T305950)]], [[gerrit:946666|Update idwiktionary old vector logo (T341175)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:09:48] Testing [13:10:56] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10roti_WMDE) a:05roti_WMDE→03None [13:11:31] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:12:08] Taavi tested looks good [13:12:10] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/933434 (https://phabricator.wikimedia.org/T340465) (owner: 10Volans) [13:12:23] BFD/BGP alerts expected in drmrs [13:12:40] !log taavi@deploy1002 anzx and taavi: Continuing with sync [13:12:51] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:12:58] (03CR) 10Ssingh: [C: 03+2] Release 3.99.0~alpha2-2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/945637 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [13:14:28] (03CR) 10Volans: [C: 03+2] Install hosts: fallback to drmrs [cookbooks] - 10https://gerrit.wikimedia.org/r/933434 (https://phabricator.wikimedia.org/T340465) (owner: 10Volans) [13:16:00] (03PS1) 10Volans: ganeti-netbox-sync: do not set platform anymore [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/946948 (https://phabricator.wikimedia.org/T336623) [13:17:31] (03CR) 10Ayounsi: [C: 03+1] ganeti-netbox-sync: do not set platform anymore [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/946948 (https://phabricator.wikimedia.org/T336623) (owner: 10Volans) [13:17:47] (03CR) 10Volans: [C: 03+2] ganeti-netbox-sync: do not set platform anymore [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/946948 (https://phabricator.wikimedia.org/T336623) (owner: 10Volans) [13:17:54] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host durum6001.drmrs.wmnet with OS bookworm [13:18:21] (03Merged) 10jenkins-bot: ganeti-netbox-sync: do not set platform anymore [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/946948 (https://phabricator.wikimedia.org/T336623) (owner: 10Volans) [13:18:24] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6001.drmrs.wmnet with OS bookworm [13:18:34] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [13:18:40] !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [13:18:41] (03CR) 10JMeybohm: [C: 03+2] CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:18:59] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:946645|Update piwiki legacy vector logo (T305950)]], [[gerrit:946666|Update idwiktionary old vector logo (T341175)]] (duration: 10m 48s) [13:19:03] T341175: Change the Indonesian Wiktionary's name and project namespace from Wiktionary to Wikikamus - https://phabricator.wikimedia.org/T341175 [13:19:03] T305950: Change logo for pi.wikipedia.org back to default - https://phabricator.wikimedia.org/T305950 [13:19:30] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T343774 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [13:19:40] or TheresNoTime, around? I could use a 2O on a config patch [13:19:48] taavi: hi, yes [13:19:54] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10Volans) [13:19:55] which? [13:20:10] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/944983/ [13:20:21] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Add support for knams as PoP in tooling and automation - https://phabricator.wikimedia.org/T340465 (10Volans) 05Open→03Resolved All changes required have been merged, if anything else come up later we can re-open this... [13:20:44] it's updating what looks like a mistake in the original patch. but that original patch was 5 years ago, and as far as I can tell this is the first time anyone noticed it, so I'm a bit uneasy deploying that without any futher on-wiki discussions [13:20:44] looking, anything in particular you're concerned about? [13:20:53] ah [13:21:11] !log reprepro -C main include bookworm-wikimedia gdnsd_3.99.0~alpha2-2_amd64.changes: T342154 [13:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:14] T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 [13:21:48] (03CR) 10Fabfur: [C: 03+1] "LGTM" [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/946635 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [13:22:38] taavi: I'd personally be okay with it, seeing as it was a misconfiguration — should probably ensure its announced on-wiki though by the patch author [13:22:50] (03PS2) 10Ladsgroup: Stop writing to old columns of externallinks in ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946597 (https://phabricator.wikimedia.org/T342683) [13:22:57] jouncebot: nowandnext [13:22:58] For the next 0 hour(s) and 37 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1300) [13:22:58] In 2 hour(s) and 37 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1600) [13:23:07] please let me know once you're done! [13:23:16] will do [13:23:17] (or if I can squeeze a patch :D) [13:23:35] TheresNoTime: sounds good, ^ koi: see TNT above [13:23:51] (03PS3) 10Majavah: newiki: Fix templateeditor config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944983 (https://phabricator.wikimedia.org/T343257) (owner: 10Stang) [13:24:15] just to ensure that if there *are* any objections, its noticed and can be discussed promptly [13:24:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944983 (https://phabricator.wikimedia.org/T343257) (owner: 10Stang) [13:24:19] (03CR) 10TrainBranchBot: "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944983 (https://phabricator.wikimedia.org/T343257) (owner: 10Stang) [13:24:26] taavi, TheresNoTime, thanks for the msg, will do [13:25:24] (03Merged) 10jenkins-bot: CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:26:07] (03Merged) 10jenkins-bot: newiki: Fix templateeditor config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944983 (https://phabricator.wikimedia.org/T343257) (owner: 10Stang) [13:26:13] (03PS3) 10JMeybohm: Update apertium to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946940 (https://phabricator.wikimedia.org/T300033) [13:26:23] !log taavi@deploy1002 Started scap: Backport for [[gerrit:944983|newiki: Fix templateeditor config (T343257)]] [13:26:26] T343257: Page protection not showing in Nepali Wiki - https://phabricator.wikimedia.org/T343257 [13:27:51] !log taavi@deploy1002 taavi and stang: Backport for [[gerrit:944983|newiki: Fix templateeditor config (T343257)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:27:58] koi: please test [13:28:06] looking [13:28:35] (it turned out I couldn’t deploy after all, sorry) [13:29:00] taavi, LGTM [13:29:46] syncing [13:29:47] !log taavi@deploy1002 taavi and stang: Continuing with sync [13:36:08] !log set platform to null on all devices and VMs in Netbox - T336623 [13:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:12] T336623: Netbox device's platform field inconsistency - https://phabricator.wikimedia.org/T336623 [13:36:12] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:944983|newiki: Fix templateeditor config (T343257)]] (duration: 09m 49s) [13:36:15] T343257: Page protection not showing in Nepali Wiki - https://phabricator.wikimedia.org/T343257 [13:36:18] ok, done [13:36:21] Amir1: your turn [13:36:28] awesome [13:36:34] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:36:39] (03PS3) 10Ladsgroup: Stop writing to old columns of externallinks in ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946597 (https://phabricator.wikimedia.org/T342683) [13:36:48] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to old columns of externallinks in ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946597 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [13:37:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946597 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [13:37:32] (03Merged) 10jenkins-bot: Stop writing to old columns of externallinks in ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946597 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [13:37:46] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:946597|Stop writing to old columns of externallinks in ruwikinews (T342683)]] [13:37:49] T342683: Stop writing to the old externallinks columns in beta cluster and production - https://phabricator.wikimedia.org/T342683 [13:37:58] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [13:39:03] 10SRE, 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T340156 (10Jhancock.wm) a:05Papaul→03Jhancock.wm [13:39:10] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:946597|Stop writing to old columns of externallinks in ruwikinews (T342683)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:41:34] (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:41:49] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [13:43:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [13:46:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [13:47:47] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:946597|Stop writing to old columns of externallinks in ruwikinews (T342683)]] (duration: 10m 00s) [13:47:50] T342683: Stop writing to the old externallinks columns in beta cluster and production - https://phabricator.wikimedia.org/T342683 [13:49:14] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:50:14] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:52:30] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10hnowlan) [13:52:39] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [13:53:28] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10hnowlan) 05Open→03Resolved a:03jijiki [13:56:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:56:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [13:56:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50189 and previous config saved to /var/cache/conftool/dbconfig/20230808-135636-ladsgroup.json [13:56:41] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:58:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50190 and previous config saved to /var/cache/conftool/dbconfig/20230808-135847-ladsgroup.json [14:03:12] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1082.eqiad.wmnet with OS bullseye [14:03:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [14:03:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [14:03:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50192 and previous config saved to /var/cache/conftool/dbconfig/20230808-140331-ladsgroup.json [14:03:38] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:10:54] <_joe_> !log updated conftool, requestctl on puppetmasters to 2.3.1 to fix bugs with requestctl log [14:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:26] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10hnowlan) [14:16:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:43] (03CR) 10Herron: [V: 03+1] thanos-fe: switch to cfssl (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/946559 (owner: 10Herron) [14:17:55] (03PS1) 10Hnowlan: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) [14:20:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wcqs2003:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:31:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50194 and previous config saved to /var/cache/conftool/dbconfig/20230808-143119-ladsgroup.json [14:31:26] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:32:59] (03PS1) 10Ssingh: bird::anycast_hc: temporarily remove validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/946952 [14:35:15] 10SRE, 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T342694 (10Jhancock.wm) [14:35:21] (03PS2) 10Eevans: admin: add darthmon to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/946632 (https://phabricator.wikimedia.org/T342968) [14:37:24] (03CR) 10Bartosz Dziewoński: Revert "logspam.pl: Filter out some persistent noise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz Dziewoński) [14:37:59] 10SRE, 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T342694 (10Jhancock.wm) @Papaul here are the ports on the switches Fasw-c8a: 16 Fasw-c8b: 16 mgmt: 18 [14:38:06] (03PS3) 10Bartosz Dziewoński: Revert "logspam.pl: Filter out some persistent noise" [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) [14:38:28] 10SRE, 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T340156 (10Jhancock.wm) [14:39:37] (03CR) 10Bartosz Dziewoński: "I see you've already scheduled this for deployment, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz Dziewoński) [14:39:49] 10SRE, 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T340156 (10Jhancock.wm) @Papaul ports are as follows Fasw-c8a: 15 Fasw-c8b: 15 mgmt: 16 [14:40:37] (03PS12) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) [14:43:33] (03CR) 10Ssingh: [C: 03+2] bird::anycast_hc: temporarily remove validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/946952 (owner: 10Ssingh) [14:46:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P50195 and previous config saved to /var/cache/conftool/dbconfig/20230808-144625-ladsgroup.json [14:47:01] (03CR) 10Filippo Giunchedi: "I'll let John comment on the envoy+cfssl bits and multiple ports bits, rest looks good" [puppet] - 10https://gerrit.wikimedia.org/r/946559 (owner: 10Herron) [14:49:58] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:50:10] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:51:14] (03PS1) 10Marostegui: install_server: Do not reimage pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/946954 [14:52:34] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/946954 (owner: 10Marostegui) [14:52:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:54:28] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6001.drmrs.wmnet with OS bookworm [14:57:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:01:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P50196 and previous config saved to /var/cache/conftool/dbconfig/20230808-150131-ladsgroup.json [15:11:10] (03PS1) 10Ssingh: bird::anycast_hc: re-add validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/946958 [15:13:03] (03CR) 10Ayounsi: [C: 03+1] bird::anycast_hc: re-add validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/946958 (owner: 10Ssingh) [15:13:25] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/946958/42798/durum6002.drmrs.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/946958 (owner: 10Ssingh) [15:13:27] (03CR) 10Ssingh: [C: 03+2] bird::anycast_hc: re-add validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/946958 (owner: 10Ssingh) [15:14:44] (03CR) 10David Caro: [C: 03+2] "Tested in tools too, merging:" [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [15:14:53] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6002.drmrs.wmnet with OS bookworm [15:15:25] (03CR) 10David Caro: [C: 03+2] replica_cnf_api: refactor to use multiple backends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [15:16:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50197 and previous config saved to /var/cache/conftool/dbconfig/20230808-151637-ladsgroup.json [15:16:41] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [15:18:38] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:19:20] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1082.eqiad.wmnet with reason: host reimage [15:19:24] expected [15:19:28] BGP/BFD [15:19:29] drmrs [15:19:46] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:22:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1082.eqiad.wmnet with reason: host reimage [15:23:26] (03PS7) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [15:26:23] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:57] (03CR) 10Cathal Mooney: Policy and definition updates for post-migration esams ranges (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [15:34:07] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:01] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10jijiki) [15:36:13] (03CR) 10Effie Mouzeli: "oh wow" [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [15:37:56] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [15:41:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [15:44:02] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1082.eqiad.wmnet with OS bullseye [15:48:47] (03PS4) 10Andrew Bogott: wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644 [15:48:49] (03PS6) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 [15:48:51] (03PS1) 10Andrew Bogott: Add backy2 class to codfw1dev cinder-backup nodes [puppet] - 10https://gerrit.wikimedia.org/r/946962 [15:48:53] (03PS1) 10Andrew Bogott: wip deploy [puppet] - 10https://gerrit.wikimedia.org/r/946963 [15:49:40] (03CR) 10Andrew Bogott: [C: 03+2] Add backy2 class to codfw1dev cinder-backup nodes [puppet] - 10https://gerrit.wikimedia.org/r/946962 (owner: 10Andrew Bogott) [15:49:49] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10bd808) >>! In T320390#9076268, @Jelto wrote: > As far as I understand login and registration of new accounts works fine and the co... [15:51:07] urbanecm: in order to test how IP Masking affects Wikidata-related extensions, it would be useful to have the feature enabled on Beta Wikidata (add it to wmgEnableIPMasking) [15:51:19] do you have any concerns or reservations about that? is there a phab task where we should track this? [15:51:35] (I found https://phabricator.wikimedia.org/T327420 but that sounds like it’s mainly for testing your team’s own features) [15:52:26] also, we don’t necessarily need it enabled permanently – can temporary accounts be turned off again later, or is it better to keep them enabled? [15:52:47] (03CR) 10CI reject: [V: 04-1] wip deploy [puppet] - 10https://gerrit.wikimedia.org/r/946963 (owner: 10Andrew Bogott) [15:53:49] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1083.eqiad.wmnet with OS bullseye [15:56:29] (03PS5) 10Andrew Bogott: wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644 [15:56:31] (03PS7) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 [15:56:33] (03PS2) 10Andrew Bogott: wip deploy [puppet] - 10https://gerrit.wikimedia.org/r/946963 [15:56:35] (03PS1) 10Andrew Bogott: Add eqiad backy2 config to cloudbackup200[12] [puppet] - 10https://gerrit.wikimedia.org/r/946964 [15:57:55] (03CR) 10Andrew Bogott: [C: 03+2] Add eqiad backy2 config to cloudbackup200[12] [puppet] - 10https://gerrit.wikimedia.org/r/946964 (owner: 10Andrew Bogott) [16:00:04] jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1600). [16:00:04] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:12] o/ [16:00:34] (03CR) 10CI reject: [V: 04-1] wip deploy [puppet] - 10https://gerrit.wikimedia.org/r/946963 (owner: 10Andrew Bogott) [16:02:06] (03PS6) 10Andrew Bogott: wmcs-backup: move image_info into the ImageBackupsState class [puppet] - 10https://gerrit.wikimedia.org/r/946644 [16:02:08] (03PS8) 10Andrew Bogott: add volumes functionality to wmcs-backup [puppet] - 10https://gerrit.wikimedia.org/r/946643 [16:02:10] (03PS3) 10Andrew Bogott: wip deploy [puppet] - 10https://gerrit.wikimedia.org/r/946963 [16:02:12] (03PS1) 10Andrew Bogott: cloudbackup200[12]: remove some spurious config from the last patch [puppet] - 10https://gerrit.wikimedia.org/r/946965 [16:06:09] (03CR) 10CI reject: [V: 04-1] wip deploy [puppet] - 10https://gerrit.wikimedia.org/r/946963 (owner: 10Andrew Bogott) [16:08:40] (03PS5) 10Volans: WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [16:10:32] (03CR) 10CI reject: [V: 04-1] WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [16:12:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T342617)', diff saved to https://phabricator.wikimedia.org/P50200 and previous config saved to /var/cache/conftool/dbconfig/20230808-161244-ladsgroup.json [16:12:52] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:13:09] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6002.drmrs.wmnet with OS bookworm [16:13:36] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:13:54] (03PS6) 10Volans: WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [16:14:04] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:15:46] (03CR) 10CI reject: [V: 04-1] WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [16:27:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P50201 and previous config saved to /var/cache/conftool/dbconfig/20230808-162750-ladsgroup.json [16:28:46] jbond/rzl: Is the puppet window happening today? [16:31:12] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:58] 10SRE, 10SRE-OnFire, 10Incident Tooling: Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10lmata) [16:37:14] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P50202 and previous config saved to /var/cache/conftool/dbconfig/20230808-164256-ladsgroup.json [16:42:58] (03PS4) 10Ryan Kemper: elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T343820) [16:43:06] (03CR) 10CI reject: [V: 04-1] elastic: allow only 1 enwiki_content per host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833861 (https://phabricator.wikimedia.org/T343820) (owner: 10Ryan Kemper) [16:58:00] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1083.eqiad.wmnet with reason: host reimage [16:58:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T342617)', diff saved to https://phabricator.wikimedia.org/P50203 and previous config saved to /var/cache/conftool/dbconfig/20230808-165803-ladsgroup.json [16:58:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:58:06] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:58:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:58:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50204 and previous config saved to /var/cache/conftool/dbconfig/20230808-165824-ladsgroup.json [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1700) [17:01:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1083.eqiad.wmnet with reason: host reimage [17:02:43] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:18] (03PS1) 10Btullis: Correct the role for the new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/946978 (https://phabricator.wikimedia.org/T343762) [17:05:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T342617)', diff saved to https://phabricator.wikimedia.org/P50205 and previous config saved to /var/cache/conftool/dbconfig/20230808-170521-ladsgroup.json [17:05:26] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [17:06:56] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42799/console" [puppet] - 10https://gerrit.wikimedia.org/r/946978 (https://phabricator.wikimedia.org/T343762) (owner: 10Btullis) [17:07:43] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:12:30] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:12] (03CR) 10Btullis: [V: 03+1 C: 03+2] Correct the role for the new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/946978 (https://phabricator.wikimedia.org/T343762) (owner: 10Btullis) [17:20:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P50206 and previous config saved to /var/cache/conftool/dbconfig/20230808-172027-ladsgroup.json [17:21:44] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:22:04] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:22:04] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wcqs1001.eqiad.wmnet with OS bullseye [17:24:21] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wcqs1002.eqiad.wmnet with OS bullseye [17:24:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1083.eqiad.wmnet with OS bullseye [17:28:54] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:29:14] (03PS1) 10JMeybohm: deployment_server::general: Globally enable mesh.certmanager [puppet] - 10https://gerrit.wikimedia.org/r/946981 (https://phabricator.wikimedia.org/T300033) [17:30:09] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:23] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:33] (JobUnavailable) firing: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:31:45] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum4001.ulsfo.wmnet with OS bookworm [17:31:57] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [17:33:07] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:33:31] ^ expected, BGP [17:33:53] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:34:37] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:34:39] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:34:53] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:35:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P50207 and previous config saved to /var/cache/conftool/dbconfig/20230808-173534-ladsgroup.json [17:35:36] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs1001.eqiad.wmnet with reason: host reimage [17:36:29] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:40] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs1002.eqiad.wmnet with reason: host reimage [17:37:49] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:38:44] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wcqs1001.eqiad.wmnet with reason: host reimage [17:38:48] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:04] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:10] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:41:14] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wcqs1002.eqiad.wmnet with reason: host reimage [17:46:37] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:48:25] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:50:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T342617)', diff saved to https://phabricator.wikimedia.org/P50208 and previous config saved to /var/cache/conftool/dbconfig/20230808-175040-ladsgroup.json [17:50:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [17:50:44] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [17:50:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [17:51:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T342617)', diff saved to https://phabricator.wikimedia.org/P50209 and previous config saved to /var/cache/conftool/dbconfig/20230808-175101-ladsgroup.json [17:52:18] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage [17:55:49] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage [17:56:29] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wcqs1001.eqiad.wmnet with OS bullseye [18:00:04] brennen and dancy: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T1800). [18:00:46] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:01:12] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.916 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:02:51] (no train this week.) [18:03:02] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup200[12]: remove some spurious config from the last patch [puppet] - 10https://gerrit.wikimedia.org/r/946965 (owner: 10Andrew Bogott) [18:04:52] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:05:44] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:28] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:10:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 7.421 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:12:10] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host durum4001.ulsfo.wmnet with OS bookworm [18:12:27] (03CR) 10Ahmon Dancy: "Today's puppet window didn't happen so I moved this to Thursday's" [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz Dziewoński) [18:12:28] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum4001.ulsfo.wmnet with OS bullseye [18:12:34] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:21:14] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:21:18] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:22:12] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:27:17] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage [18:29:44] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:31:39] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage [18:31:48] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.853 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:31:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 9.933 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:33:25] (03CR) 10Brennen Bearnes: Revert "logspam.pl: Filter out some persistent noise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz Dziewoński) [18:38:28] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:40:14] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:40:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:40:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:43:12] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:43:16] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [18:43:37] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [18:43:42] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.941 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:43:46] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 3.405 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:45:27] !log ryankemper@deploy1002 deploy aborted: 0.3.124 (duration: 01m 50s) [18:45:50] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: whitelist new qlever endpoints [18:46:10] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [18:46:58] ryankemper: are you doing something with wcqs ? That pybal alert is for it [18:47:45] RhinosF1 I am reimaging some wcqs servers. They should be depooled though [18:48:00] there's only 3 hosts so maybe I tripped a threshold [18:48:58] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: whitelist new qlever endpoints (duration: 03m 08s) [18:49:16] inflatador: probably [18:49:16] !log bking@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=wcqs,name=eqiad [18:51:01] RhinosF1 eqiad is depooled now, should be OK. Thanks for the heads up! [18:51:30] Np :) [18:52:50] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:52:56] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:52:56] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 90, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:54:02] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@aa5f5b7]: whitelist new qlever endpoints take 2 [18:54:28] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:54:32] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:54:35] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4001.ulsfo.wmnet with OS bullseye [18:55:22] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:59:58] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:02:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:04:52] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:04:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:04:58] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:05:36] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@aa5f5b7]: whitelist new qlever endpoints take 2 (duration: 11m 34s) [19:06:30] !log [WDQS] Depooled `wdqs1006` while it catches up on 7 hours of lag [19:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:40] inflatador: ^ [19:07:35] ACK [19:11:25] (03PS1) 10Kimberly Sarabia: Deploy to CN language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946997 (https://phabricator.wikimedia.org/T335886) [19:12:03] (03CR) 10CI reject: [V: 04-1] Deploy to CN language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946997 (https://phabricator.wikimedia.org/T335886) (owner: 10Kimberly Sarabia) [19:15:00] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:15:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:15:42] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:17:58] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:19:14] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:19:40] (03PS2) 10Kimberly Sarabia: Deploy to CN language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946997 (https://phabricator.wikimedia.org/T335886) [19:20:08] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.163 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:20:16] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 9.211 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:21:43] (03CR) 10Physikerwelt: "... now that restbase does not use png images from mathoid anymore, we can also deploy the "new" mathoid version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/906694 (owner: 10PipelineBot) [19:21:48] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:22:45] (03PS1) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on s2/s3/s5/s6 group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946998 (https://phabricator.wikimedia.org/T315353) [19:23:14] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [19:24:20] PROBLEM - Blazegraph Port for wcqs-blazegraph on wcqs1001 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:24:20] PROBLEM - Blazegraph process -wcqs-blazegraph- on wcqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:24:22] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wcqs-blazegraph.service,wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:03] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wcqs[1001-1003].eqiad.wmnet with reason: T331300 [19:28:06] T331300: Ensure WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300 [19:28:18] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wcqs[1001-1003].eqiad.wmnet with reason: T331300 [19:29:16] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:29:22] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:31:40] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:33:44] (03PS2) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on s2/s3/s5/s6 group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946998 (https://phabricator.wikimedia.org/T315353) [19:35:20] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.994 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:36:04] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:38:30] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 7.957 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:45:20] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:46:02] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:46:08] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:49:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:52:00] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:52:02] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50421 bytes in 0.510 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:59:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T2000). nyaa~ [20:00:05] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] i can deploy today [20:00:20] hi [20:00:32] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:00:40] * urbanecm presumes he needs to do the config change first [20:00:56] urbanecm: can you double-check if i got the set operations correctly in my ugly config change? :P [20:00:57] yes [20:01:16] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:01:22] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:02:02] oh, i guess i can look at https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/3841/consoleFull [20:02:28] there are no wikis where `"wgDiscussionToolsEnablePermalinksBackend": true,` got removed, so it's probably correct [20:03:26] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:03:58] Is https://lists.wikimedia.org/ wicked slow for everyone or just me? [20:04:12] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.778 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:04:16] i got myself confused by commit message saying "enable", and the patch adding a bunch of "=> false" rows [20:04:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 6.795 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:04:27] bd808: given the recovery...try now? :)) [20:05:23] urbanecm: yes, i had to swap the defaults, that's intended (but ugly) [20:05:32] MatmaRex: just to ensure i understand the goal: you want to set the variable to true on all wikis, except for group2 wikis that are in s1 or s7? [20:05:42] yes [20:06:54] * urbanecm does a few of double checks [20:08:46] MatmaRex: most of s7 wikis are finished by now (only viwiki is still running). do we want to enable the variable on all of s7 at once? [20:09:50] urbanecm: if you want to fiddle with running a bunch more maintenance scripts that seem to fail silently all the time, then sure :P [20:10:07] i think i'd rather have fewer crappy scripts and wait longer [20:10:49] makes sense; i thought the persistRevisionThreadItems.php step for those wikis is already completed, but i might be wrong on that :) [20:11:10] yeah, you're right that it's completed on s7 except viwiki [20:11:17] no joy in hoping that mailman got faster when the icinga check resolved. 14 to 25 seconds for `time curl https://lists.wikimedia.org/postorius/lists/` from my local. Less than a second for [[mw:Main Page]] (with some cache busting to keep the CDN from being the difference) [20:11:25] but i don't want to run scripts separately on every wiki [20:11:35] and i also don't want to have two runs on s7 in progress at once [20:11:46] makes sense. let's go as is then. [20:11:49] (03CR) 10Urbanecm: [C: 03+2] Enable wgDiscussionToolsEnablePermalinksBackend on s2/s3/s5/s6 group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946998 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [20:11:54] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:11:56] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:12:01] (it would work, i just don't want to document what is going on when we do that) [20:12:14] (the task is already messed up enough) [20:12:28] (03Merged) 10jenkins-bot: Enable wgDiscussionToolsEnablePermalinksBackend on s2/s3/s5/s6 group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946998 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [20:12:36] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:12:37] thank you! [20:13:01] not disputing, just was trying to understand the reasoning :) [20:13:08] 10SRE, 10AQS2.0, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman) [20:13:18] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@aa5f5b7]: whitelist new qlever endpoints take 3 T339347 [20:13:23] T339347: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 [20:13:26] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:946998|Enable wgDiscussionToolsEnablePermalinksBackend on s2/s3/s5/s6 group2 (T315353)]] [20:13:29] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [20:13:39] MatmaRex: there's nothing that can be meaningfully tested on mwdebug with this change, right? [20:14:28] !log [WDQS] Lag caught up on `wdqs1006`; repooled -> `ryankemper@wdqs1006:~$ sudo pool` [20:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:44] urbanecm: i could check that Special:FindComment works, one second [20:14:57] !log urbanecm@deploy1002 urbanecm and matmarex: Backport for [[gerrit:946998|Enable wgDiscussionToolsEnablePermalinksBackend on s2/s3/s5/s6 group2 (T315353)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:15:04] just in time :) [20:15:12] MatmaRex: okay, pausing, let me know how it looks like [20:16:12] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@aa5f5b7]: whitelist new qlever endpoints take 3 T339347 (duration: 02m 54s) [20:16:34] (03CR) 10Krinkle: [C: 03+1] "Feel free to schedule for backport deploy any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 (owner: 10D3r1ck01) [20:17:05] so, this page: https://pl.wikipedia.org/wiki/Specjalna:Znajdź_komentarz?idorname=c-Jcubic-20230530085900-Matma_Rex-20230529230500 [20:17:16] shows no results in production [20:17:21] but it has results on mwdebug [20:17:26] which looks perfect :) [20:17:42] awesome [20:17:46] proceeding [20:17:46] !log urbanecm@deploy1002 urbanecm and matmarex: Continuing with sync [20:17:51] (the results are outdated, and that will be fixed by the second script run) [20:20:42] in the meantime: i saw T332738 linked from the config patch, and i'm not sure it would actually resolve the issue. even if group2 was useable in the configuration, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/901697 probably wouldn't do what you intended to (group0+group1+group2 has all wikis, so that patch would be functionally equivalent to `'default' => true`) [20:20:43] T332738: 'group2' doesn't work to specify configuration in operations/mediawiki-config InitialiseSettings.php etc. - https://phabricator.wikimedia.org/T332738 [20:21:42] what _would_ help is the ability to combine dblists, and the ability to do `'group2 - s7' => false`, for example, but that doesn't seem to be requested in that task [20:22:20] 10SRE, 10Traffic: Add a reboot action to the Wikimedia DNS restart cookbook - https://phabricator.wikimedia.org/T342182 (10BCornwall) 05In progress→03Resolved [20:22:21] MatmaRex: if i'm missing something obvious, happy to hear 'em, otherwise we can leave it as an async thing for the task :) [20:22:38] urbanecm: so my first approach to today's patch was default=>false, group0=>true, group1=>true, s2=>true, s3=>true, etc. [20:22:44] (03CR) 10Jdlrobson: [C: 03+1] Deploy to CN language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946997 (https://phabricator.wikimedia.org/T335886) (owner: 10Kimberly Sarabia) [20:22:55] i think that if that worked, it would be the same as what i ended up doing, and it would be a more obvious diff [20:23:03] 10SRE, 10Traffic: dnsbox: Add gdnsd to bird's BindsTo systemd service - https://phabricator.wikimedia.org/T336973 (10BCornwall) 05In progress→03Resolved [20:23:43] and 's2' etc. also looks to be not usable in the config files, like 'group2' [20:24:22] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:946998|Enable wgDiscussionToolsEnablePermalinksBackend on s2/s3/s5/s6 group2 (T315353)]] (duration: 10m 55s) [20:24:25] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [20:24:28] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:24:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:25:06] i hope that's not my fault, but if it is, just revert [20:25:18] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.358 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:25:20] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:25:24] very likely isn't [20:26:27] MatmaRex: gotcha. the issue with that approach would actually be combination of groupX and sX lists. to give an example: if that was allowed, what would we want `default => false, group1 => true, s7 => false` to do for ie. `frwiktionary`? should it be `false`, because it is a s7 wiki? or should it be true, because it is a group1 wiki? [20:26:42] anyway, we're synced, so starting the scripts now [20:27:15] (03CR) 10BCornwall: [C: 03+2] Rebuild against Bullseye [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/946635 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [20:27:16] yeah, that would be ambiguous [20:27:58] Hello. I have a last-minute patch I just added to the late backport window. If it's too late, no worries. Let me know [20:28:36] kimberly_sarabia: can you add it to the Wikitech calendar please? i think we can make it too :) [20:28:40] (and hello!) [20:29:13] !log mwmaint1002: `foreachwikiindblist 'group2 & s2' extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all --touched-after=20230615000000` (T315353) [20:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:30:01] urbanecm: Thanks! I think I added it: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230808T2000 [20:30:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:30:06] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:30:26] !log mwmaint1002: `foreachwikiindblist 'group2 & s3' extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all --touched-after=20230615000000` (T315353) [20:30:27] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: whitelist new qlever endpoints take 4 (forgot git pull) T339347 [20:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:31] !log mwmaint1002: `foreachwikiindblist 'group2 & s5' extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all --touched-after=20230615000000` (T315353) [20:30:32] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [20:30:37] T339347: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 [20:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:05] !log mwmaint1002: `foreachwikiindblist 'group2 & s6' extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all --touched-after=20230615000000` (T315510) [20:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:09] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [20:31:24] MatmaRex: scripts started, my apologies for accidentally pasting wrong task id to the log entries [20:31:57] (03PS3) 10Urbanecm: Deploy to CN language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946997 (https://phabricator.wikimedia.org/T335886) (owner: 10Kimberly Sarabia) [20:32:01] (03CR) 10Urbanecm: [C: 03+2] Deploy to CN language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946997 (https://phabricator.wikimedia.org/T335886) (owner: 10Kimberly Sarabia) [20:32:14] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:32:30] (Traffic bill over quota) firing: (2) Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [20:32:41] (03Merged) 10jenkins-bot: Deploy to CN language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946997 (https://phabricator.wikimedia.org/T335886) (owner: 10Kimberly Sarabia) [20:34:11] urbanecm: thanks, no problem [20:34:40] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:946997|Deploy to CN language wikis (T335886)]] [20:34:44] T335886: Deploy Vector 2022 as the default desktop skin to next set of wikis - https://phabricator.wikimedia.org/T335886 [20:34:49] kimberly_sarabia: starting with your patch now! :) [20:35:05] urbanecm: thanks [20:36:08] !log urbanecm@deploy1002 ksarabia and urbanecm: Backport for [[gerrit:946997|Deploy to CN language wikis (T335886)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:36:24] kimberly_sarabia: can you test your patch at mwdebug1001 and let me know how it looks like please? [20:36:42] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:37:30] (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [20:37:50] urbanecm: LGTM [20:37:55] thanks, proceeding [20:37:56] !log urbanecm@deploy1002 ksarabia and urbanecm: Continuing with sync [20:41:12] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: whitelist new qlever endpoints take 4 (forgot git pull) T339347 (duration: 10m 44s) [20:41:14] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:41:20] T339347: qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 [20:43:49] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:946997|Deploy to CN language wikis (T335886)]] (duration: 09m 08s) [20:43:52] T335886: Deploy Vector 2022 as the default desktop skin to next set of wikis - https://phabricator.wikimedia.org/T335886 [20:43:56] kimberly_sarabia: should be deployed now [20:43:57] anything else? [20:44:17] urbanecm: tysm! all set [20:44:21] awesome [20:44:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:45:44] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:49:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:49:32] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50421 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:49:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:52:30] (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [20:52:41] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177] (wcqs): f1a6177 [20:52:59] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177] (wcqs): f1a6177 (duration: 00m 18s) [20:53:02] RECOVERY - Blazegraph Port for wcqs-blazegraph on wcqs1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:53:08] RECOVERY - Blazegraph process -wcqs-blazegraph- on wcqs1001 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:54:08] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:54:10] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:56:12] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:20] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:56:33] (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:57:20] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wcqs1002.eqiad.wmnet with OS bullseye [20:57:30] (Traffic bill over quota) resolved: (2) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [20:58:12] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177] (wcqs): f1a6177 [20:58:29] MatmaRex: are you still here by any chance? [20:58:30] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177] (wcqs): f1a6177 (duration: 00m 17s) [20:58:37] i got this error with the script https://www.irccloud.com/pastebin/J1r6BZCe/ [20:58:45] and i don't recall seeing it during the previous runs [20:58:47] any idea? [20:59:09] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans) [21:00:26] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans) p:05Triage→03Medium a:03Eevans [21:02:09] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans) >>! In T343596#9074468, @Seddon wrote: > Approved (direct manager) @thcipriani you are the full list of approvers for group //restricted//; Ok to proceed? [21:02:19] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wcqs1003.eqiad.wmnet with OS bullseye [21:02:35] !log bking@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wcqs,name=eqiad [21:04:30] !log bking@cumin1001 conftool action : set/pooled=no; selector: name=wcqs1003.eqiad.wmnet,service=wcqs [21:06:08] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 108 [21:06:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 108 [21:06:54] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:09:12] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.303 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:09:16] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:10:26] (03PS1) 10BCornwall: Rebuild against Bookworm, not Bullseye [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/947009 [21:12:42] (03CR) 10Ssingh: [C: 03+1] Rebuild against Bookworm, not Bullseye [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/947009 (owner: 10BCornwall) [21:12:57] filled my ping as T343859 [21:12:57] T343859: DiscussionTools: LogicException: Database can't find our row and won't let us insert it - https://phabricator.wikimedia.org/T343859 [21:13:28] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans) [21:13:50] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:13:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:14:46] (03CR) 10BCornwall: [C: 03+2] Rebuild against Bookworm, not Bullseye [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/947009 (owner: 10BCornwall) [21:15:33] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs1003.eqiad.wmnet with reason: host reimage [21:15:40] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans) [21:16:00] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:17:24] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:18:02] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wcqs1003.eqiad.wmnet with reason: host reimage [21:21:18] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans) Hi @Tsevener, we'll need to verify your key out-of-band (of Phabricator). It might be easiest to add it to a user page (for example, https://www.mediawiki.org/wiki/User:Adee_Ritman... [21:22:10] !log Exported varnish-modules 0.15.0-4 for bookworm-wikimedia (T342154) [21:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:15] T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 [21:24:19] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10thcipriani) >>! In T343596#9078742, @Eevans wrote: >>>! In T343596#9074468, @Seddon wrote: >> Approved (direct manager) > > @thcipriani you are the full list of approvers for group //restr... [21:24:33] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10thcipriani) [21:24:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10Eevans) a:03Eevans [21:28:02] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:30:32] urbanecm: sorry, was away for a bit. hopefully it's not very common? [21:31:07] it's a new error from new code. it should be impossible :) so i'll ask amir to look at it [21:32:31] looks like it's from the last deployment, not from the maintenance script. a few hundred since yesterday [21:32:34] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:33:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.567 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:34:58] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 2.936 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:37:41] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10Eevans) [21:38:20] (i replied on the task) [21:43:40] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10Eevans) [21:44:34] (03PS3) 10Bartosz Dziewoński: Remove unneeded $wgDefaultUserOptions['visualeditor-enable'] settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933998 (https://phabricator.wikimedia.org/T340696) [21:44:48] (03CR) 10Bartosz Dziewoński: [C: 04-1] "(waiting for deployment train)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933998 (https://phabricator.wikimedia.org/T340696) (owner: 10Bartosz Dziewoński) [21:45:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:45:38] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:46:08] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:46:11] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wcqs1003.eqiad.wmnet with OS bullseye [21:46:23] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10Eevans) p:05Triage→03Medium a:03Eevans [21:46:43] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:47:30] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:48:26] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:48:30] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50421 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:57:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:57:38] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:57:40] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:58:10] PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wcqs-blazegraph.service,wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:58:38] PROBLEM - Blazegraph process -wcqs-blazegraph- on wcqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:58:50] PROBLEM - Blazegraph Port for wcqs-blazegraph on wcqs1003 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:59:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:59:53] (03PS1) 10Bartosz Dziewoński: Move visual editor out of Beta Features (without changing prefs) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947015 (https://phabricator.wikimedia.org/T335056) [22:00:45] (03CR) 10Bartosz Dziewoński: [C: 04-1] "(not scheduled yet)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947015 (https://phabricator.wikimedia.org/T335056) (owner: 10Bartosz Dziewoński) [22:03:38] (03PS3) 10Eevans: admin: add darthmon to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/946632 (https://phabricator.wikimedia.org/T342968) [22:03:53] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177] (wcqs): f1a6177 [22:04:11] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177] (wcqs): f1a6177 (duration: 00m 17s) [22:04:40] RECOVERY - Blazegraph process -wcqs-blazegraph- on wcqs1003 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:04:50] RECOVERY - Blazegraph Port for wcqs-blazegraph on wcqs1003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:07:08] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:08:08] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.335 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:08:08] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 1.425 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:13:14] RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:18:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:18:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:19:20] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:21:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:23:26] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 9.905 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:23:42] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:24:42] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.457 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:33:07] !log mwmaint1002: stop persistRevisionThreadItems.php frwiki instance because of T343859 (cc T315510) [22:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:13] T343859: DiscussionTools: LogicException: Database can't find our row and won't let us insert it - https://phabricator.wikimedia.org/T343859 [22:33:14] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [22:33:58] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:34:06] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:34:30] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:38:54] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:39:50] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:40:06] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 7.825 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:50:34] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:50:42] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:51:06] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:58:38] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:59:36] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.529 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:02:02] (03PS1) 10Eevans: admin: add roti to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/947022 (https://phabricator.wikimedia.org/T342972) [23:04:10] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:04:18] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 8.133 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:05:36] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.408 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:05:54] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10Izno) [23:06:08] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10Izno) [23:14:52] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:15:00] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:15:20] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:18:33] 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10dr0ptp4kt) I'm going to open a separate task for the `wmcs-admin` membership, but leave this task here open so we can continue to explore the additional matter of being... [23:19:44] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:20:48] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.219 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:21:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 8.831 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:28:40] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:30:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:30:30] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:31:54] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:33:02] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10dr0ptp4kt) [23:34:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.973 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:34:34] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:39:10] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:39:47] (03PS1) 10Dr0ptp4kt: Add dr0ptp4kt to wmcs-admin [puppet] - 10https://gerrit.wikimedia.org/r/947028 (https://phabricator.wikimedia.org/T343862) [23:40:50] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:44:08] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:48:30] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:49:36] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.597 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:49:44] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:52:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50211 and previous config saved to /var/cache/conftool/dbconfig/20230808-235258-ladsgroup.json [23:53:02] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617