[00:01:25] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db1227.mgmt.eqiad.wmnet with reboot policy FORCED [00:26:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1227.mgmt.eqiad.wmnet with reboot policy FORCED [00:38:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/961521 [00:38:53] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/961521 (owner: 10TrainBranchBot) [00:46:51] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:08] (03PS12) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [00:56:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/961521 (owner: 10TrainBranchBot) [01:03:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T343198)', diff saved to https://phabricator.wikimedia.org/P52752 and previous config saved to /var/cache/conftool/dbconfig/20230929-010306-arnaudb.json [01:03:13] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [01:04:15] (03PS1) 10Robertsky: T347622 setup namespace for 2025, 2026, enable subpages for 2023-2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961963 [01:18:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P52753 and previous config saved to /var/cache/conftool/dbconfig/20230929-011813-arnaudb.json [01:33:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P52754 and previous config saved to /var/cache/conftool/dbconfig/20230929-013319-arnaudb.json [01:48:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T343198)', diff saved to https://phabricator.wikimedia.org/P52755 and previous config saved to /var/cache/conftool/dbconfig/20230929-014825-arnaudb.json [01:48:32] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [02:24:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:35:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [02:38:46] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [02:46:12] (03PS1) 10Tim Starling: Don't ignore imagemagick exit status [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/961968 (https://phabricator.wikimedia.org/T344233) [02:49:26] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:50:49] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [02:51:37] (03CR) 10CI reject: [V: 04-1] Don't ignore imagemagick exit status [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/961968 (https://phabricator.wikimedia.org/T344233) (owner: 10Tim Starling) [02:54:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [02:57:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:57:26] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudservices2005-dev.codfw.wmnet with OS bookworm [03:02:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:03:46] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:44:44] (03PS2) 10Tim Starling: Don't ignore imagemagick exit status [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/961968 (https://phabricator.wikimedia.org/T344233) [03:57:50] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Thumbor-k8s performance improvements - https://phabricator.wikimedia.org/T333445 (10tstarling) Per my comment at T344233#9209303, it may help performance if you allow ImageMagick to use mor... [04:36:43] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T343198)', diff saved to https://phabricator.wikimedia.org/P52756 and previous config saved to /var/cache/conftool/dbconfig/20230929-044617-arnaudb.json [04:46:25] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [05:00:53] (03Abandoned) 10Muehlenhoff: On Bookworm ship ppolicy.schema via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/961066 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [05:01:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P52757 and previous config saved to /var/cache/conftool/dbconfig/20230929-050123-arnaudb.json [05:11:59] (03CR) 10Muehlenhoff: sshd: Disable keyboard-interactive authentication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956983 (owner: 10Tim Starling) [05:16:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P52758 and previous config saved to /var/cache/conftool/dbconfig/20230929-051630-arnaudb.json [05:19:19] (03CR) 10Muehlenhoff: "Looks good in general, couple of nits inline." [software/bitu] - 10https://gerrit.wikimedia.org/r/961807 (owner: 10Slyngshede) [05:23:37] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - No response from remote host 185.15.59.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:26:04] (ProbeDown) firing: (4) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:26:07] (ProbeDown) firing: (4) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:26:44] (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [05:26:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [05:31:04] (ProbeDown) resolved: (5) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:31:07] (ProbeDown) resolved: (7) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:31:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T343198)', diff saved to https://phabricator.wikimedia.org/P52759 and previous config saved to /var/cache/conftool/dbconfig/20230929-053136-arnaudb.json [05:31:39] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [05:31:46] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [05:31:53] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [05:31:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T343198)', diff saved to https://phabricator.wikimedia.org/P52760 and previous config saved to /var/cache/conftool/dbconfig/20230929-053158-arnaudb.json [05:36:44] (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [05:36:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230929T0600) [06:06:40] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: increase recommendation-api-ng uwsgi workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [06:13:23] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 134 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:22:30] (03CR) 10Kevin Bazira: ml-services: increase recommendation-api-ng uwsgi workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [06:27:25] (03PS1) 10KartikMistry: Update MinT to 2023-09-28-043052-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/961977 (https://phabricator.wikimedia.org/T343450) [06:30:20] (03PS1) 10KartikMistry: Update cxserver to 2023-09-28-043003-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/961979 (https://phabricator.wikimedia.org/T343450) [06:32:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:33:00] (03CR) 10Elukey: ml-services: increase recommendation-api-ng uwsgi workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [06:33:59] (03CR) 10Elukey: "Np John! Added some DE SRE folks :)" [puppet] - 10https://gerrit.wikimedia.org/r/961841 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [06:37:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:43:32] (03PS2) 10Robertsky: T347622 setup namespace for 2025, 2026, enable subpages for 2023-2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961963 [06:45:22] A bunch of tools toolforge tools appear to be down :( [06:46:00] Unsure if this has been reported yet [06:48:10] ex: https://nppbrowser.toolforge.org/ , https://copypatrol.toolforge.org/en?filter=all&filterPage=Palasan%20River&drafts=0&revision=1173086254 and https://copyvios.toolforge.org/?lang=en&project=wikipedia&title=Sanwan_River&oldid=&action=search&use_engine=1&use_links=1 [06:48:30] Seems like tools.db can't be found [06:49:44] *tools.labsdb [06:52:07] (03PS1) 10Slyngshede: Automatic key expiry. [software/bitu] - 10https://gerrit.wikimedia.org/r/961981 (https://phabricator.wikimedia.org/T347572) [06:57:19] (03CR) 10Kevin Bazira: ml-services: increase recommendation-api-ng uwsgi workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [07:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230929T0700) [07:00:08] Sohom_Datta: https://phabricator.wikimedia.org/T347665 [07:03:46] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:20:14] (03PS1) 10Majavah: Revert "standard_packages: Remove more obsolete packages after buster->bullseye update" [puppet] - 10https://gerrit.wikimedia.org/r/961915 [07:23:48] (03PS1) 10David Caro: Revert "standard_packages: Remove more obsolete packages after buster->bullseye update" [puppet] - 10https://gerrit.wikimedia.org/r/961916 [07:24:18] (03Abandoned) 10Majavah: Revert "standard_packages: Remove more obsolete packages after buster->bullseye update" [puppet] - 10https://gerrit.wikimedia.org/r/961915 (owner: 10Majavah) [07:24:20] (03CR) 10Majavah: [C: 03+1] Revert "standard_packages: Remove more obsolete packages after buster->bullseye update" [puppet] - 10https://gerrit.wikimedia.org/r/961916 (owner: 10David Caro) [07:24:29] (03CR) 10David Caro: [C: 03+2] Revert "standard_packages: Remove more obsolete packages after buster->bullseye update" [puppet] - 10https://gerrit.wikimedia.org/r/961916 (owner: 10David Caro) [07:26:11] (03CR) 10David Caro: [V: 03+2 C: 03+2] Revert "standard_packages: Remove more obsolete packages after buster->bullseye update" [puppet] - 10https://gerrit.wikimedia.org/r/961916 (owner: 10David Caro) [07:26:20] (03CR) 10Majavah: [V: 03+2] Revert "standard_packages: Remove more obsolete packages after buster->bullseye update" [puppet] - 10https://gerrit.wikimedia.org/r/961916 (owner: 10David Caro) [07:30:10] (03PS1) 10Jelto: Revert "gitlab: enable local_gems in devtools test instance" [puppet] - 10https://gerrit.wikimedia.org/r/961917 (https://phabricator.wikimedia.org/T337570) [07:39:47] (03PS1) 10Muehlenhoff: Reinstate absented package list bullseye without ISC libraries [puppet] - 10https://gerrit.wikimedia.org/r/961983 [07:57:08] (03PS1) 10Jgiannelos: wikifeeds: Bump helm chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/961984 [07:59:55] (03CR) 10Jgiannelos: "This was missing from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/961696 after rebasing because of a merge conflict." [deployment-charts] - 10https://gerrit.wikimedia.org/r/961984 (owner: 10Jgiannelos) [08:00:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [08:07:23] (03CR) 10Mabualruz: [C: 03+1] wikifeeds: Bump helm chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/961984 (owner: 10Jgiannelos) [08:09:10] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: Bump helm chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/961984 (owner: 10Jgiannelos) [08:09:53] (03Merged) 10jenkins-bot: wikifeeds: Bump helm chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/961984 (owner: 10Jgiannelos) [08:23:47] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:24:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:25:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:25:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:26:06] (03PS1) 10DCausse: rdf-streaming-updater: do not produce side outputs to kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/961985 (https://phabricator.wikimedia.org/T347515) [08:26:57] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:29:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:31:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:34:23] (03CR) 10Elukey: ml-services: increase recommendation-api-ng uwsgi workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [08:40:21] (03PS1) 10Arturo Borrero Gonzalez: wmcs: instance: install isc-dhcp-client [puppet] - 10https://gerrit.wikimedia.org/r/961986 (https://phabricator.wikimedia.org/T347665) [08:41:34] (03CR) 10Muehlenhoff: dnsbox: add ntp.anycast.wmnet as the anycasted NTP address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961818 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [08:42:50] (03CR) 10David Caro: [C: 03+1] "No tests yet though" [puppet] - 10https://gerrit.wikimedia.org/r/961986 (https://phabricator.wikimedia.org/T347665) (owner: 10Arturo Borrero Gonzalez) [08:43:05] (03CR) 10David Caro: [C: 03+1] wmcs: instance: install isc-dhcp-client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961986 (https://phabricator.wikimedia.org/T347665) (owner: 10Arturo Borrero Gonzalez) [08:46:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, I can't see any reason why this would not work in d-i, let's reimage sretest1001 after deploying this to doublecheck." [puppet] - 10https://gerrit.wikimedia.org/r/961812 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [08:46:22] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "ran the CI on my laptop, the patch looks good." [puppet] - 10https://gerrit.wikimedia.org/r/961986 (https://phabricator.wikimedia.org/T347665) (owner: 10Arturo Borrero Gonzalez) [08:47:39] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] wmcs: instance: install isc-dhcp-client [puppet] - 10https://gerrit.wikimedia.org/r/961986 (https://phabricator.wikimedia.org/T347665) (owner: 10Arturo Borrero Gonzalez) [08:56:30] (03CR) 10Btullis: "I recommend that we do this removal in two steps." [puppet] - 10https://gerrit.wikimedia.org/r/961699 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [08:58:20] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.293 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:59:38] (03CR) 10Btullis: "We seems to have missed the file modules/statistics/manifests/wmde.pp which correlates to the statistics::wmde class itself." [puppet] - 10https://gerrit.wikimedia.org/r/961699 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [09:00:14] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:00:17] (03CR) 10Muehlenhoff: [C: 04-1] cloudelastic: new partman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961478 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [09:02:44] (03CR) 10Jbond: [C: 03+2] "seems harmless will merge now" [puppet] - 10https://gerrit.wikimedia.org/r/961906 (owner: 10Ahmon Dancy) [09:06:31] (03PS22) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [09:06:33] (03PS28) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [09:06:35] (03PS6) 10Jbond: prometheus: switch to wmflib::get_config [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) [09:06:37] (03PS3) 10Jbond: get_clusters: rmove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/960126 (https://phabricator.wikimedia.org/T341373) [09:06:39] (03CR) 10Jbond: "fixed thanks" [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [09:07:29] (03CR) 10Jbond: prometheus: switch to wmflib::get_config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [09:07:34] (03PS23) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [09:07:35] (03PS29) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [09:07:38] (03PS7) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) [09:07:39] (03PS4) 10Jbond: get_clusters: rmove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/960126 (https://phabricator.wikimedia.org/T341373) [09:07:41] (03CR) 10CI reject: [V: 04-1] get_clusters: rmove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/960126 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [09:08:41] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: sync [09:08:45] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: sync [09:09:50] (03PS5) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/960126 (https://phabricator.wikimedia.org/T341373) [09:10:12] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:33:50] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [09:34:12] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [09:41:42] !log arnaudb@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bookworm [09:41:44] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [09:42:08] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [09:42:25] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [09:42:33] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [09:43:21] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [10:06:06] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: nftables.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:10] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [10:09:36] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: sync [10:09:38] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: sync [10:09:48] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [10:13:50] PROBLEM - Host backup1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:15:42] RECOVERY - Host backup1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [10:18:07] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [10:18:11] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [10:19:06] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [10:19:09] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [10:19:10] (JobUnavailable) firing: (5) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:22:34] PROBLEM - Host backup1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:24:12] RECOVERY - Host backup1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [10:25:36] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T343198)', diff saved to https://phabricator.wikimedia.org/P52763 and previous config saved to /var/cache/conftool/dbconfig/20230929-102812-arnaudb.json [10:28:15] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host backup1010.eqiad.wmnet with OS bookworm [10:28:19] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [10:35:36] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [10:35:39] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [10:37:12] PROBLEM - Host an-worker1086 is DOWN: PING CRITICAL - Packet loss = 100% [10:40:18] ACKNOWLEDGEMENT - SSH on an-worker1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Btullis T347287 - Not booting with failed disk https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:40:18] ACKNOWLEDGEMENT - Host an-worker1086 is DOWN: PING CRITICAL - Packet loss = 100% Btullis T347287 - Not booting with failed disk [10:43:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P52764 and previous config saved to /var/cache/conftool/dbconfig/20230929-104318-arnaudb.json [10:43:42] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: GitLab version upgrade [10:49:49] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: GitLab version upgrade [10:50:48] RECOVERY - Host an-worker1086 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [10:52:11] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: GitLab version upgrade [10:55:20] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:58:18] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: GitLab version upgrade [10:58:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P52765 and previous config saved to /var/cache/conftool/dbconfig/20230929-105825-arnaudb.json [10:58:36] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on an-worker1085.eqiad.wmnet with reason: Cold booting to see if it helps with RAID BBU [10:58:50] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on an-worker1085.eqiad.wmnet with reason: Cold booting to see if it helps with RAID BBU [11:03:46] (JobUnavailable) firing: (6) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:09:00] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: GitLab version upgrade [11:13:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T343198)', diff saved to https://phabricator.wikimedia.org/P52766 and previous config saved to /var/cache/conftool/dbconfig/20230929-111331-arnaudb.json [11:13:34] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [11:13:37] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [11:13:47] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [11:13:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T343198)', diff saved to https://phabricator.wikimedia.org/P52767 and previous config saved to /var/cache/conftool/dbconfig/20230929-111353-arnaudb.json [11:18:40] PROBLEM - Host an-worker1085 is DOWN: PING CRITICAL - Packet loss = 100% [11:29:26] RECOVERY - Host an-worker1085 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [11:34:43] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1086.eqiad.wmnet [11:34:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:34:50] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:35:12] !log btullis@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts an-worker1086.eqiad.wmnet [11:36:06] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:36:12] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:37:26] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:40:12] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 8 hosts matching query A:cp-text_eqsin [11:40:34] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 8 hosts matching query A:cp-upload_eqsin [11:40:41] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1085.eqiad.wmnet [11:46:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1085.eqiad.wmnet [11:58:30] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:57] !log adjusting evpn_db BGP export filter lsw1-f3-eqiad [12:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:00:20] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:04:24] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: partial-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:06:04] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:06:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 8.688 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:10:44] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:10:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:13:34] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:13:40] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 5.363 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:13:42] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.013 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:18:56] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bookworm [12:19:42] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:19:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:22:40] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 8.760 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:22:40] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.615 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:30:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/962032 (owner: 10Muehlenhoff) [12:32:51] (03PS5) 10Cathal Mooney: Adjust EVPN BGP type-5 route creation / export to include host routes [homer/public] - 10https://gerrit.wikimedia.org/r/888219 (https://phabricator.wikimedia.org/T329369) [12:34:50] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1010.eqiad.wmnet with reason: host reimage [12:37:59] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1010.eqiad.wmnet with reason: host reimage [12:39:23] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: GitLab version upgrade [12:41:05] (03PS8) 10Jbond: prometheus::jmx_exporter_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) [12:45:03] (03PS6) 10Cathal Mooney: Adjust EVPN BGP type-5 route creation / export to include host routes [homer/public] - 10https://gerrit.wikimedia.org/r/888219 (https://phabricator.wikimedia.org/T329369) [12:46:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43785/console" [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [12:48:46] (03CR) 10Jbond: [V: 03+1] "ready for review pcc changes related to sort order" [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [12:49:13] (03PS7) 10Cathal Mooney: Adjust EVPN BGP type-5 route creation / export to include host routes [homer/public] - 10https://gerrit.wikimedia.org/r/888219 (https://phabricator.wikimedia.org/T329369) [12:49:53] (03PS9) 10Jbond: prometheus::jmx_exporter_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) [12:54:05] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Export routes generated from ARP/ND in EVPN - https://phabricator.wikimedia.org/T329369 (10cmooney) We need to apply this configuration before we start moving servers in codfw to the new switches, while connected to existing row-wide vlans.... [12:54:15] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1010.eqiad.wmnet with OS bookworm [12:55:03] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Export routes generated from ARP/ND in EVPN - https://phabricator.wikimedia.org/T329369 (10cmooney) >>! In T329369#8717915, @cmooney wrote: > If we do have stretched L2 segments across multiple LEAFs, we may wish to also export the /32 and... [12:56:11] (03CR) 10Jbond: [C: 03+1] "Lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/938822 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:58:05] (03CR) 10Ammarpad: noc: Fix various PHP errors that prevent db.php from working locally (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944355 (https://phabricator.wikimedia.org/T341859) (owner: 10Krinkle) [13:02:31] (03PS8) 10Cathal Mooney: Adjust EVPN BGP type-5 route creation / export to include host routes [homer/public] - 10https://gerrit.wikimedia.org/r/888219 (https://phabricator.wikimedia.org/T329369) [13:05:16] (03PS2) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) [13:08:08] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/962032 (owner: 10Muehlenhoff) [13:10:35] RECOVERY - Check systemd state on kafka-jumbo1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43787/console" [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [13:13:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 6 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43786/console" [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [13:17:23] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: refactor to set up routes independently from keepalived [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T336963) [13:18:14] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: refactor to set up routes independently from keepalived [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T336963) [13:18:29] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: refactor to set up routes independently from keepalived [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T336963) [13:19:17] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [13:23:18] (03PS3) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) [13:23:28] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: refactor to set up routes independently from keepalived [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T336963) [13:23:42] (03CR) 10CI reject: [V: 04-1] prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [13:23:48] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [13:26:24] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "this should be ready to merge per PCC. Will do next week." [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [13:27:16] (03PS4) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) [13:32:17] (03PS5) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) [13:37:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43791/console" [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [13:37:56] RECOVERY - Host db2109 #page is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [13:41:07] 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10Jhancock.wm) 05Open→03Resolved @Marostegui sorry for the delay. I got lucky and the server we just decommissioned had the part I needed to get the server running. the system board had died. I was trying to get the idrac firmw... [13:41:25] (03PS6) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) [13:44:16] (03Abandoned) 10Btullis: Change the owner:group of the wikidatawiki entities link [puppet] - 10https://gerrit.wikimedia.org/r/961412 (https://phabricator.wikimedia.org/T346165) (owner: 10Btullis) [13:46:05] (03PS7) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) [13:46:34] 10ops-codfw, 10Content-Transform-Team, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) @jijiki I haven't seen the idrac/ssh go down on this server in a while. is it ok if I close this ticket? if there's still work that needs to be... [13:49:32] 10ops-codfw, 10DBA: db2109 crashed - https://phabricator.wikimedia.org/T347318 (10Marostegui) Excellent news!! Thank you so much [13:51:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43793/console" [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [13:52:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:57:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:00:54] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: do not produce side outputs to kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/961985 (https://phabricator.wikimedia.org/T347515) (owner: 10DCausse) [14:07:05] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:07:47] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:07:51] (03PS1) 10Kamila Součková: benthos/mw_accesslog_metrics: add cluster label [puppet] - 10https://gerrit.wikimedia.org/r/962041 [14:11:34] (03CR) 10Herron: [C: 03+1] "LGTM hope this helps!" [puppet] - 10https://gerrit.wikimedia.org/r/961510 (https://phabricator.wikimedia.org/T345362) (owner: 10Cwhite) [14:11:54] (03CR) 10Bking: [C: 03+2] flink: upgrade to flink 1.17.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/961877 (https://phabricator.wikimedia.org/T346719) (owner: 10DCausse) [14:12:14] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1100.mgmt.eqiad.wmnet with reboot policy FORCED [14:12:19] (03PS1) 10Jgiannelos: push-notifications: Change logging level [deployment-charts] - 10https://gerrit.wikimedia.org/r/962042 [14:12:38] (03CR) 10Bking: [V: 03+2 C: 03+2] flink: upgrade to flink 1.17.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/961877 (https://phabricator.wikimedia.org/T346719) (owner: 10DCausse) [14:14:21] (03CR) 10Jbond: [V: 03+1] "ready for review:" [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [14:20:57] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [14:21:48] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 8 hosts matching query A:cp-text_eqsin [14:23:49] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 8 hosts matching query A:cp-upload_eqsin [14:27:22] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 8 hosts matching query A:cp-text_eqiad [14:27:24] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 8 hosts matching query A:cp-upload_eqiad [14:27:46] 10SRE, 10Traffic: Varnish should allow PURGE requests only from socket (purged) - https://phabricator.wikimedia.org/T347192 (10Fabfur) eqsin done [14:27:50] (03PS2) 10Jbond: redis::slave: switch to puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961857 (https://phabricator.wikimedia.org/T341373) [14:28:56] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10Jclark-ctr) @taavi Moved Server. cable id# 5310 port 46 [14:30:30] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10Jclark-ctr) [14:32:54] PROBLEM - Check systemd state on wdqs1016 is CRITICAL: CRITICAL - degraded: The following units failed: load-dcatap-weekly.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:32] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:50] (03PS3) 10Jbond: redis::slave: switch to puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961857 (https://phabricator.wikimedia.org/T341373) [14:33:58] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudservices1004: decomission - https://phabricator.wikimedia.org/T346033 (10Jclark-ctr) [14:33:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:34:08] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:18] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudservices1004: decomission - https://phabricator.wikimedia.org/T346033 (10Jclark-ctr) 05Open→03Resolved [14:34:20] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10Jclark-ctr) [14:34:27] (SystemdUnitFailed) firing: (3) load-dcatap-weekly.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43798/console" [puppet] - 10https://gerrit.wikimedia.org/r/961857 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [14:37:13] (03PS7) 10Arturo Borrero Gonzalez: cloudgw: refactor to set up routes independently from keepalived [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T336963) [14:37:15] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: refactor vlan interfaces to use interface::tagged [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T336963) [14:37:48] (03CR) 10CI reject: [V: 04-1] cloudgw: refactor vlan interfaces to use interface::tagged [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [14:38:42] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [14:38:45] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [14:40:04] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: refactor vlan interfaces to use interface::tagged [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T336963) [14:40:36] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [14:40:40] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [14:41:02] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs100[3,4].eqiad.wmnet - https://phabricator.wikimedia.org/T346699 (10Jclark-ctr) [14:41:03] (03PS2) 10Effie Mouzeli: push-notifications: Change logging level [deployment-charts] - 10https://gerrit.wikimedia.org/r/962042 (owner: 10Jgiannelos) [14:41:12] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs100[3,4].eqiad.wmnet - https://phabricator.wikimedia.org/T346699 (10Jclark-ctr) 05Open→03Resolved [14:41:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/961523 [14:41:57] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/961523 (owner: 10TrainBranchBot) [14:43:42] (03PS1) 10Effie Mouzeli: push-notifications: reduce replica numbers to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962046 [14:45:10] (03CR) 10Effie Mouzeli: [C: 03+2] push-notifications: Change logging level [deployment-charts] - 10https://gerrit.wikimedia.org/r/962042 (owner: 10Jgiannelos) [14:45:59] (03Merged) 10jenkins-bot: push-notifications: Change logging level [deployment-charts] - 10https://gerrit.wikimedia.org/r/962042 (owner: 10Jgiannelos) [14:47:43] (03CR) 10Jgiannelos: [C: 03+1] push-notifications: reduce replica numbers to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962046 (owner: 10Effie Mouzeli) [14:49:24] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-master1003.mgmt.eqiad.wmnet with reboot policy FORCED [14:50:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10Jclark-ctr) [14:50:10] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-master1004.mgmt.eqiad.wmnet with reboot policy FORCED [14:50:23] (03PS3) 10Slyngshede: Implement Codex design, from design team. [software/bitu] - 10https://gerrit.wikimedia.org/r/962003 (https://phabricator.wikimedia.org/T338824) [14:51:31] (03CR) 10Effie Mouzeli: [C: 03+2] push-notifications: reduce replica numbers to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962046 (owner: 10Effie Mouzeli) [14:52:19] (03Merged) 10jenkins-bot: push-notifications: reduce replica numbers to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962046 (owner: 10Effie Mouzeli) [14:53:20] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase2027.codfw.wmnet [14:53:20] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase2027.codfw.wmnet [14:53:50] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1100.mgmt.eqiad.wmnet with reboot policy FORCED [14:53:55] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [14:54:01] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [14:54:13] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [14:54:22] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [14:54:29] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [14:54:47] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [14:54:59] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [14:55:19] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1100.mgmt.eqiad.wmnet with reboot policy FORCED [14:55:24] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [14:55:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/961523 (owner: 10TrainBranchBot) [14:57:57] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10taavi) a:05Jclark-ctr→03taavi [14:58:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST envoyfilters) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:03:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST envoyfilters) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:03:46] (JobUnavailable) firing: (6) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:21] (03PS1) 10Eevans: install_server: utilize reuse recipe for restbase2027 [puppet] - 10https://gerrit.wikimedia.org/r/962048 (https://phabricator.wikimedia.org/T331713) [15:07:32] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1020.eqiad.wmnet [15:08:31] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1020.eqiad.wmnet [15:11:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:12:05] (03PS1) 10AikoChou: ml-services: update revertrisk-language-agnostic model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/962049 (https://phabricator.wikimedia.org/T347330) [15:14:08] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1100.mgmt.eqiad.wmnet with reboot policy FORCED [15:16:17] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-master1004.mgmt.eqiad.wmnet with reboot policy FORCED [15:16:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:16:33] (03PS1) 10DCausse: rdf-streaming-updater: set allowNonRestoredState [deployment-charts] - 10https://gerrit.wikimedia.org/r/962051 (https://phabricator.wikimedia.org/T347515) [15:18:23] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [15:19:46] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1020.eqiad.wmnet [15:19:48] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1020.eqiad.wmnet [15:20:42] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1021.eqiad.wmnet [15:21:15] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1021.eqiad.wmnet [15:21:44] (03PS1) 10Jgiannelos: push-notifications: Make queueing non-verbose by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/962053 [15:22:04] (03CR) 10Ebernhardson: [C: 03+2] rdf-streaming-updater: set allowNonRestoredState [deployment-charts] - 10https://gerrit.wikimedia.org/r/962051 (https://phabricator.wikimedia.org/T347515) (owner: 10DCausse) [15:22:44] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-master1004.mgmt.eqiad.wmnet with reboot policy FORCED [15:22:53] (03Merged) 10jenkins-bot: rdf-streaming-updater: set allowNonRestoredState [deployment-charts] - 10https://gerrit.wikimedia.org/r/962051 (https://phabricator.wikimedia.org/T347515) (owner: 10DCausse) [15:23:43] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:23:51] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:24:51] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-master1003.mgmt.eqiad.wmnet with reboot policy FORCED [15:25:17] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-master1004.mgmt.eqiad.wmnet with reboot policy FORCED [15:25:38] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004.eqiad.wmne'] [15:25:43] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003.eqiad.wmne'] [15:27:08] (03PS1) 10Ilias Sarantopoulos: ml-alerts: add alert for increased ORESFetchScoreJob [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [15:28:51] (03CR) 10CI reject: [V: 04-1] ml-alerts: add alert for increased ORESFetchScoreJob [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [15:30:35] (03PS2) 10Effie Mouzeli: push-notifications: Make queueing non-verbose by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/962053 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos) [15:32:56] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:33:07] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1021.eqiad.wmnet [15:33:08] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1021.eqiad.wmnet [15:33:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:34:04] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-master1004.eqiad.wmne'] [15:34:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-master1003.eqiad.wmne'] [15:34:53] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1028.eqiad.wmnet [15:35:06] !log bking@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [15:35:10] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:35:25] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1028.eqiad.wmnet [15:42:10] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:42:22] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:43:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:43:40] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:47:59] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1028.eqiad.wmnet [15:48:01] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1028.eqiad.wmnet [15:48:50] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [15:49:02] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1031.eqiad.wmnet [15:49:17] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1031.eqiad.wmnet [15:49:31] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1031.eqiad.wmnet [15:49:37] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1031.eqiad.wmnet [15:51:52] (03PS8) 10Arturo Borrero Gonzalez: cloudgw: refactor to set up routes independently from keepalived [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T336963) [15:51:54] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: refactor vlan interfaces to use interface::tagged [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T336963) [15:51:56] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: codfw: add cloud-private subnet support [puppet] - 10https://gerrit.wikimedia.org/r/922106 (https://phabricator.wikimedia.org/T336963) [15:54:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:54:09] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on restbase1031.eqiad.wmnet with reason: Upgrading BIOS [15:54:23] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on restbase1031.eqiad.wmnet with reason: Upgrading BIOS [15:54:38] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: codfw: add cloud-private subnet support [puppet] - 10https://gerrit.wikimedia.org/r/922106 (https://phabricator.wikimedia.org/T336963) [15:54:59] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922106 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [15:55:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:56:26] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:57:28] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: refactor vlan interfaces to use interface::tagged [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T336963) [15:57:30] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: codfw: add cloud-private subnet support [puppet] - 10https://gerrit.wikimedia.org/r/922106 (https://phabricator.wikimedia.org/T336963) [15:57:40] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922106 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [15:57:54] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [15:59:22] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:01:03] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST clusterdomainclaims) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:01:13] (03PS9) 10Arturo Borrero Gonzalez: cloudgw: refactor to set up routes independently from keepalived [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T347687) [16:01:17] (03PS7) 10Arturo Borrero Gonzalez: cloudgw: refactor vlan interfaces to use interface::tagged [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687) [16:01:20] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: codfw: add cloud-private subnet support [puppet] - 10https://gerrit.wikimedia.org/r/922106 (https://phabricator.wikimedia.org/T347687) [16:01:29] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922106 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [16:02:00] (03PS13) 10Bking: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [16:03:26] (03CR) 10Effie Mouzeli: [C: 03+2] Update tegola-vector-tiles to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961077 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [16:03:41] (03CR) 10CI reject: [V: 04-1] cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [16:06:04] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (LIST clusterdomainclaims) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:06:10] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [16:07:44] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [16:08:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T343198)', diff saved to https://phabricator.wikimedia.org/P52770 and previous config saved to /var/cache/conftool/dbconfig/20230929-160807-arnaudb.json [16:08:17] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [16:08:39] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw codfw - aborrero@cumin1001" [16:08:44] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase1031.eqiad.wmnet [16:08:44] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1031.eqiad.wmnet [16:09:08] (03CR) 10Cathal Mooney: "Overall lgtm, some comments/questions below (some just out of curiosity)" [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [16:11:48] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw codfw - aborrero@cumin1001" [16:11:48] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:11:58] (03CR) 10Cathal Mooney: "one mroe comment" [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [16:13:28] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [16:13:49] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [16:14:19] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [16:14:27] (SystemdUnitFailed) firing: (3) load-dcatap-weekly.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:24] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [16:15:50] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [16:16:20] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [16:19:10] (JobUnavailable) firing: (6) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:19:27] (SystemdUnitFailed) firing: (3) load-dcatap-weekly.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:32] (03PS1) 10Hnowlan: thumbor: add imagemagick policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) [16:22:10] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:22:46] !log bking@wdqs1016 depooling to compress JNL file T347605 [16:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:55] T347605: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 [16:23:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P52771 and previous config saved to /var/cache/conftool/dbconfig/20230929-162313-arnaudb.json [16:26:48] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1016 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:27:08] PROBLEM - WDQS SPARQL on wdqs1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:27:14] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:27:20] PROBLEM - Query Service HTTP Port on wdqs1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:27:40] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:27:51] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1016.eqiad.wmnet with reason: jnl compression [16:28:05] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1016.eqiad.wmnet with reason: jnl compression [16:38:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P52772 and previous config saved to /var/cache/conftool/dbconfig/20230929-163819-arnaudb.json [16:38:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) cp1100 - A 4. U 24. CableID 230304500204. port 44 cp1101 - A 4. U 29. CableID 230304500200. port 46 cp1102 - A 7. U 24 CableID 5025. port 34 cp1103 - A 7. U 29 CableID 20220... [16:53:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T343198)', diff saved to https://phabricator.wikimedia.org/P52773 and previous config saved to /var/cache/conftool/dbconfig/20230929-165326-arnaudb.json [16:53:28] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:53:32] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [16:53:42] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:53:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T343198)', diff saved to https://phabricator.wikimedia.org/P52774 and previous config saved to /var/cache/conftool/dbconfig/20230929-165347-arnaudb.json [17:06:21] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 8 hosts matching query A:cp-text_eqiad [17:08:20] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 8 hosts matching query A:cp-upload_eqiad [17:08:35] (03CR) 10Btullis: C:bigtop::hadoop switch to new topology script. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954911 (owner: 10Slyngshede) [17:11:41] 10SRE, 10Traffic, 10GitLab (Project Migration): Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10LSobanski) As long as the parent ticket is more or less up to date, that works for me, thanks for asking. As for archiving the repositories on Gerrit, this requir... [17:18:17] (03PS4) 10AOkoth: clamav: disable ConcurrentDatabaseReload [puppet] - 10https://gerrit.wikimedia.org/r/961689 (https://phabricator.wikimedia.org/T347450) [17:18:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:23:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:24:32] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/961689/43799/" [puppet] - 10https://gerrit.wikimedia.org/r/961689 (https://phabricator.wikimedia.org/T347450) (owner: 10AOkoth) [17:24:40] (03CR) 10AOkoth: [C: 03+2] clamav: disable ConcurrentDatabaseReload [puppet] - 10https://gerrit.wikimedia.org/r/961689 (https://phabricator.wikimedia.org/T347450) (owner: 10AOkoth) [17:49:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:04] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1022.eqiad.wmnet [17:53:14] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [17:53:46] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1022.eqiad.wmnet [17:54:25] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 8 hosts matching query A:cp-text_drmrs [17:54:32] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 8 hosts matching query A:cp-upload_drmrs [17:54:56] 10SRE, 10Traffic: Varnish should allow PURGE requests only from socket (purged) - https://phabricator.wikimedia.org/T347192 (10Fabfur) eqiad done [18:00:44] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:11:05] (03CR) 10Herron: rsyslog: update code to support cfssl and puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [18:11:42] (03CR) 10Herron: [C: 03+1] rsyslog: update code to support cfssl and puppet [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [18:11:56] (03CR) 10Herron: [C: 03+1] syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [18:13:54] PROBLEM - Host restbase1022 is DOWN: PING CRITICAL - Packet loss = 100% [18:15:16] RECOVERY - Host restbase1022 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [18:17:59] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [18:19:17] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1022.eqiad.wmnet [18:19:18] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1022.eqiad.wmnet [18:27:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:32:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:42:46] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1023.eqiad.wmnet [18:43:19] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1023.eqiad.wmnet [18:54:58] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1023.eqiad.wmnet [18:55:00] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1023.eqiad.wmnet [19:00:14] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [19:36:16] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004.eqiad.wmne'] [19:36:26] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003.eqiad.wmne'] [19:36:34] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004.eqiad.wmne'] [19:36:37] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1003.eqiad.wmne'] [19:37:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10Jclark-ctr) [19:46:47] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: jnl compression [19:46:49] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: jnl compression [19:52:32] 10SRE, 10Traffic: Varnish should allow PURGE requests only from socket (purged) - https://phabricator.wikimedia.org/T347192 (10Fabfur) [20:00:30] 10SRE, 10observability, 10SRE Observability (FY2023/2024-Q2): Icinga contact for dr0ptp4kt - https://phabricator.wikimedia.org/T346688 (10dr0ptp4kt) Thanks @lmata - I don't think I have permissions to the private Puppet repo to do that - I believe this is in modules/secret/secrets/nagios/contacts.cfg in the... [20:03:28] (03PS1) 10Varnent: Add Endowment namespace and include in Visual Editor and search settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) [20:10:16] (03PS3) 10Ejegg: Allow FundraiseUp scripts in Donatewiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) [20:19:44] (03CR) 10Ejegg: "Thanks for the feedback, SBassett" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) (owner: 10Ejegg) [20:23:46] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:34:20] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 8 hosts matching query A:cp-upload_drmrs [20:34:43] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [20:35:59] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 8 hosts matching query A:cp-text_drmrs [20:37:50] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [20:38:10] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [20:39:22] (03CR) 10SBassett: [C: 03+1] "Soft +1 for now, as a vendor privacy review was already conducted. It would still be nice to perform some sort of application security re" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) (owner: 10Ejegg) [20:59:59] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 8 hosts matching query A:cp-upload_esams [21:00:00] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 8 hosts matching query A:cp-text_esams [21:00:37] 10SRE, 10Traffic: Varnish should allow PURGE requests only from socket (purged) - https://phabricator.wikimedia.org/T347192 (10Fabfur) [21:05:22] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:31:46] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:58:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T343198)', diff saved to https://phabricator.wikimedia.org/P52776 and previous config saved to /var/cache/conftool/dbconfig/20230929-215849-arnaudb.json [21:58:56] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [22:02:52] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:13:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P52777 and previous config saved to /var/cache/conftool/dbconfig/20230929-221356-arnaudb.json [22:29:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P52778 and previous config saved to /var/cache/conftool/dbconfig/20230929-222902-arnaudb.json [22:44:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T343198)', diff saved to https://phabricator.wikimedia.org/P52779 and previous config saved to /var/cache/conftool/dbconfig/20230929-224409-arnaudb.json [22:44:11] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [22:44:16] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [22:44:24] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [23:40:47] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 8 hosts matching query A:cp-upload_esams [23:41:56] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 8 hosts matching query A:cp-text_esams [23:42:11] 10SRE, 10Traffic: Varnish should allow PURGE requests only from socket (purged) - https://phabricator.wikimedia.org/T347192 (10Fabfur) [23:45:47] 10SRE, 10Traffic, 10Patch-For-Review: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) 05Stalled→03Resolved [23:47:04] 10SRE, 10Traffic: Implement VTC tests for PURGE requests - https://phabricator.wikimedia.org/T347297 (10Fabfur) 05Open→03Resolved [23:47:15] 10SRE, 10Traffic: Implement VTC tests for PURGE requests - https://phabricator.wikimedia.org/T347297 (10Fabfur) All necessary tests implemented