[00:05:02] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:15:20] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:32] (03CR) 10Eevans: Decommissioning restbase-dev cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869320 (https://phabricator.wikimedia.org/T325387) (owner: 10Eevans) [00:22:33] (03CR) 10Cwhite: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/868438 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [00:23:04] (03CR) 10Cwhite: [C: 03+2] logstash: heavily restrict mediawiki http accesslog during initial onboarding [puppet] - 10https://gerrit.wikimedia.org/r/867630 (https://phabricator.wikimedia.org/T324439) (owner: 10Cwhite) [00:27:12] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [00:30:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:44] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [00:40:21] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [00:57:55] (03CR) 10Cwhite: [C: 03+1] "The drop filter is now in place." [puppet] - 10https://gerrit.wikimedia.org/r/867136 (https://phabricator.wikimedia.org/T324439) (owner: 10Clément Goubert) [01:17:31] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:54] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:24] PROBLEM - Host wdqs1005 is DOWN: PING CRITICAL - Packet loss = 100% [01:40:45] (JobUnavailable) firing: (9) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:48] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:48] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [01:55:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:02:36] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [02:10:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:00] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:08] PROBLEM - Disk space on maps1009 is CRITICAL: DISK CRITICAL - free space: / 2817 MB (3% inode=97%): /tmp 2817 MB (3% inode=97%): /var/tmp 2817 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=maps1009&var-datasource=eqiad+prometheus/ops [02:20:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:20] (03PS2) 10Eevans: Decommissioning restbase-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/869320 (https://phabricator.wikimedia.org/T325387) [03:10:32] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:22:16] RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [03:26:06] PROBLEM - dump of db_inventory in codfw on backupmon1001 is CRITICAL: dump for db_inventory at codfw (db2093, 2022-12-20 03:07:57): 93 KiB is less than 300 KiB https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:26:24] (03PS3) 10Andrew Bogott: Openstack backend: make use of all_tenants nova api flag [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 [03:33:41] (03CR) 10CI reject: [V: 04-1] Openstack backend: make use of all_tenants nova api flag [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (owner: 10Andrew Bogott) [03:44:24] PROBLEM - dump of db_inventory in eqiad on backupmon1001 is CRITICAL: dump for db_inventory at eqiad (db1115, 2022-12-20 03:19:12): 93 KiB is less than 300 KiB https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:47:30] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869320 (https://phabricator.wikimedia.org/T325387) (owner: 10Eevans) [03:51:28] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:56:28] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [04:02:26] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:00] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [04:19:40] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:48] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:51:11] (03PS1) 10KartikMistry: Content Translation: Move ttwiki out of Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869347 (https://phabricator.wikimedia.org/T319177) [05:13:13] 10SRE, 10SRE Program Management, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10Legoktm) >>! In T312067#8478670, @jcrespo wrote: > My take: > {F35876366} Nailed it. Please let me know where I need to sign up for a t-shirt and stickers. [05:22:27] 10SRE, 10SRE Program Management, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10Legoktm) | Front | Back | | {F35879165} | {F35879167} | (payments accepted in FTT) [06:19:16] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [06:20:28] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:20:58] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:22:24] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [06:26:35] (03PS1) 10Marostegui: misc_multiinstance.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/869352 (https://phabricator.wikimedia.org/T325154) [06:29:08] (03CR) 10Marostegui: [C: 03+2] misc_multiinstance.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/869352 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui) [06:32:14] (03PS1) 10Marostegui: site.pp: Add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/869353 (https://phabricator.wikimedia.org/T325210) [06:33:55] (03CR) 10Marostegui: [C: 03+2] site.pp: Add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/869353 (https://phabricator.wikimedia.org/T325210) (owner: 10Marostegui) [06:45:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: allow sending mail to the mailservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/869234 (owner: 10Giuseppe Lavagetto) [06:50:26] (03Merged) 10jenkins-bot: mediawiki: allow sending mail to the mailservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/869234 (owner: 10Giuseppe Lavagetto) [07:38:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:43:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:49:38] (03PS1) 10Marostegui: site.pp: Set db1207-db122 to spare [puppet] - 10https://gerrit.wikimedia.org/r/869578 (https://phabricator.wikimedia.org/T325209) [07:50:56] (03CR) 10Marostegui: [C: 03+2] site.pp: Set db1207-db122 to spare [puppet] - 10https://gerrit.wikimedia.org/r/869578 (https://phabricator.wikimedia.org/T325209) (owner: 10Marostegui) [07:53:44] (03PS1) 10Marostegui: dbstore_multiinstance.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/869579 (https://phabricator.wikimedia.org/T325154) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221220T0800) [08:09:04] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [08:09:26] (03PS1) 10Gehel: ApiFeatureUsage logstash servers are owned by Observability. [puppet] - 10https://gerrit.wikimedia.org/r/869582 [08:10:38] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [08:11:14] (03PS2) 10Muehlenhoff: Make ganeti4007 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/869220 (https://phabricator.wikimedia.org/T317247) [08:13:38] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10Performance-Team (Radar): Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Anerka) Other samples; From a non-admin account at 15 December around 06.00 UTC; without login : "cachereport":{"origin":"mw1... [08:16:08] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti4007 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/869220 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff) [08:23:58] (03CR) 10Slyngshede: [C: 03+2] C:ldap::management deploy updated modify-mfa tool. [puppet] - 10https://gerrit.wikimedia.org/r/867568 (owner: 10Slyngshede) [08:28:45] (03CR) 10Slyngshede: [C: 03+2] C:ldap::management Update add-ldap-groups [puppet] - 10https://gerrit.wikimedia.org/r/867594 (owner: 10Slyngshede) [08:29:49] PROBLEM - Check systemd state on ganeti4007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:19] (03CR) 10David Caro: metricsinfra: add optional basic auth to project_proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [08:32:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet [08:38:11] RECOVERY - Check systemd state on ganeti4007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:38:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet [08:40:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4007.ulsfo.wmnet to cluster ulsfo and group 1 [08:40:35] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4007.ulsfo.wmnet to cluster ulsfo and group 1 [08:42:33] (03CR) 10Volans: "question inline" [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (owner: 10Andrew Bogott) [08:45:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4007.ulsfo.wmnet to cluster ulsfo and group 1 [08:45:14] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4007.ulsfo.wmnet to cluster ulsfo and group 1 [08:52:38] (03PS11) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187 (https://phabricator.wikimedia.org/T325617) [08:52:51] (03PS12) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187 (https://phabricator.wikimedia.org/T325617) [08:53:51] (03CR) 10Volans: [C: 04-1] "Wouldn't work as is, see comments inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/869334 (owner: 10Ryan Kemper) [08:58:13] (03CR) 10Muehlenhoff: [C: 03+2] elasticsearch: Enable profile::auto_restarts::service for Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/868438 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:00:51] (03PS1) 10Giuseppe Lavagetto: mediawiki: also add the v6 IPs for wiki-mail [deployment-charts] - 10https://gerrit.wikimedia.org/r/869704 [09:12:04] (03PS1) 10Muehlenhoff: Remove LDAP access for scfc [puppet] - 10https://gerrit.wikimedia.org/r/869705 [09:12:30] 10SRE, 10Acme-chief, 10Traffic-Icebox: Use acme-chief provided OCSP stapling responses - https://phabricator.wikimedia.org/T232988 (10Aklapper) @Vgutierrez: Could you please answer the last comment? Thanks in advance! [09:13:22] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for scfc [puppet] - 10https://gerrit.wikimedia.org/r/869705 (owner: 10Muehlenhoff) [09:15:27] (03CR) 10Muehlenhoff: [C: 03+2] vrts: Enable vrts profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/868663 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:16:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [09:18:09] 10SRE, 10observability: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10akosiaris) Actually, per 28f86674054b7 #observability has taken over arclamp from #serviceops. It aligns more closely with their area of expertise and focus than #serviceops and we ar... [09:23:37] (03CR) 10Jelto: P:spicerack: add python-gitlab package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860902 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [09:24:53] 10SRE, 10observability: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10MoritzMuehlenhoff) >>! In T319434#8480708, @akosiaris wrote: > Actually, per 28f86674054b7 #observability has taken over arclamp from #serviceops. It aligns more closely with their ar... [09:25:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/868471 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [09:30:27] (03CR) 10Elukey: kafka_config: set a real string for default api_version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868739 (owner: 10Jbond) [09:31:25] (03CR) 10Elukey: [C: 03+2] ml-services: update outlink docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/868702 (https://phabricator.wikimedia.org/T325199) (owner: 10AikoChou) [09:35:35] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:11] RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [09:43:19] (03CR) 10Elukey: [C: 03+1] cert-manager: Update to 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868622 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [09:45:06] (03CR) 10Elukey: [C: 03+1] "The fixme may be something that we forget when upgrading, let's write it down somewhere so it doesn't happen." [deployment-charts] - 10https://gerrit.wikimedia.org/r/868680 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [09:47:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4007.ulsfo.wmnet to cluster ulsfo and group 1 [09:48:01] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:28] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4007.ulsfo.wmnet to cluster ulsfo and group 1 [09:49:50] (03CR) 10Marostegui: [C: 03+2] dbstore_multiinstance.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/869579 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui) [09:51:33] (03CR) 10Jaime Nuche: [C: 04-1] admin: create new group deployment-jenkins (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [09:51:49] 10SRE, 10Acme-chief, 10Traffic-Icebox: Use acme-chief provided OCSP stapling responses - https://phabricator.wikimedia.org/T232988 (10Vgutierrez) 05Open→03Resolved Oops.. yeah, let me close this. Cheers! [09:52:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: also add the v6 IPs for wiki-mail [deployment-charts] - 10https://gerrit.wikimedia.org/r/869704 (owner: 10Giuseppe Lavagetto) [09:57:17] (03CR) 10JMeybohm: cert-manager: Disable seccomProfile for k8s 1.16 compatibility (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/868680 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [09:57:59] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Update cert-manager to 1.10.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/868422 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [09:59:27] (03Merged) 10jenkins-bot: mediawiki: also add the v6 IPs for wiki-mail [deployment-charts] - 10https://gerrit.wikimedia.org/r/869704 (owner: 10Giuseppe Lavagetto) [09:59:37] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/869278 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:03:27] (03PS1) 10JMeybohm: cert-manager: Reference parent image for build directly [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/869727 (https://phabricator.wikimedia.org/T325292) [10:03:44] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10MoritzMuehlenhoff) ganeti4007 has been added to the ulsfo Ganeti cluster. [10:03:51] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] cert-manager: Disable seccomProfile for k8s 1.16 compatibility [deployment-charts] - 10https://gerrit.wikimedia.org/r/868680 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [10:03:58] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] cert-manager: Update to 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868622 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [10:04:44] (03CR) 10CI reject: [V: 04-1] cert-manager: Update to 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868622 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [10:04:46] (03CR) 10CI reject: [V: 04-1] cert-manager: Disable seccomProfile for k8s 1.16 compatibility [deployment-charts] - 10https://gerrit.wikimedia.org/r/868680 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [10:04:49] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] cert-manager: Reference parent image for build directly [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/869727 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [10:05:43] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:05:51] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:06:50] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:06:56] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:16:40] !log rebalance ganeti cluster in ulsfo after adding new node and decom of the old hardware T317247 [10:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:45] T317247: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 [10:19:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1010.eqiad.wmnet [10:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:24:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1010.eqiad.wmnet [10:25:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1008.eqiad.wmnet [10:28:51] (03PS3) 10Jbond: kafka_config: set a real string for default api_version [puppet] - 10https://gerrit.wikimedia.org/r/868739 [10:29:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1008.eqiad.wmnet [10:29:19] (03CR) 10Jbond: kafka_config: set a real string for default api_version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868739 (owner: 10Jbond) [10:29:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1009.eqiad.wmnet [10:32:32] (03CR) 10Jbond: [C: 03+2] puppet_compiler: manage symlink to output dir [puppet] - 10https://gerrit.wikimedia.org/r/869280 (owner: 10Jbond) [10:34:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1009.eqiad.wmnet [10:35:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-web1001.eqiad.wmnet [10:42:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-web1001.eqiad.wmnet [10:42:33] (03CR) 10Jbond: [C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [10:51:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host matomo1002.eqiad.wmnet [10:53:44] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host matomo1002.eqiad.wmnet [10:58:19] (03CR) 10Jbond: Add vendored module bodgit/puppet-postfix (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway) [10:59:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host archiva1002.wikimedia.org [10:59:50] (03CR) 10Jbond: [C: 03+2] monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/868471 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [11:03:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host archiva1002.wikimedia.org [11:05:30] PROBLEM - Check systemd state on pc2012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:50] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:28] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.08133 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:06:32] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:36] me looking [11:06:46] PROBLEM - Check systemd state on db1184 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:48] PROBLEM - Check systemd state on db2163 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:02] RECOVERY - Check systemd state on pc2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:06] this will be related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/868471 [11:07:10] rolling back [11:07:24] PROBLEM - Check systemd state on db1127 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:27] (03PS1) 10Jbond: Revert "monitoring: update monitoring files to dynamically discover config" [puppet] - 10https://gerrit.wikimedia.org/r/869715 [11:07:28] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:30] PROBLEM - Check systemd state on es1024 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:38] PROBLEM - Check systemd state on db1123 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:50] PROBLEM - Check systemd state on es2027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:08] PROBLEM - Check systemd state on db2135 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:14] PROBLEM - Check systemd state on db2125 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:18] PROBLEM - Check systemd state on es1023 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:24] PROBLEM - Check systemd state on db1180 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:58] PROBLEM - Check systemd state on es2020 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:12] PROBLEM - Check systemd state on db1181 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:32] PROBLEM - Check systemd state on pc1011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:40] PROBLEM - Check systemd state on db2136 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:44] PROBLEM - Check systemd state on db2122 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:56] PROBLEM - Check systemd state on db1175 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:56] PROBLEM - Check systemd state on es2030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:56] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "monitoring: update monitoring files to dynamically discover config" [puppet] - 10https://gerrit.wikimedia.org/r/869715 (owner: 10Jbond) [11:10:12] PROBLEM - Check systemd state on db1206 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:14] PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:22] PROBLEM - Check systemd state on es2022 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:22] PROBLEM - Check systemd state on db1143 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:28] PROBLEM - Check systemd state on es1029 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:30] PROBLEM - Check systemd state on db1157 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:30] PROBLEM - Check systemd state on db1201 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:32] PROBLEM - Check systemd state on es1027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:54] PROBLEM - Check systemd state on db2105 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:54] PROBLEM - Check systemd state on db1197 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:58] PROBLEM - Check systemd state on es2031 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:18] PROBLEM - Check systemd state on thumbor2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:20] PROBLEM - Check systemd state on db2116 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:20] PROBLEM - Check systemd state on db2147 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:24] PROBLEM - Check systemd state on db1169 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:30] PROBLEM - Check systemd state on db1130 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:42] PROBLEM - Check systemd state on pc2012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:46] PROBLEM - Check systemd state on db2152 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:52] PROBLEM - Check systemd state on db2131 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:52] PROBLEM - Check systemd state on db2130 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:04] RECOVERY - Check systemd state on es1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:06] RECOVERY - Check systemd state on db1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:06] RECOVERY - Check systemd state on es1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:24] PROBLEM - Check systemd state on db1109 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:30] PROBLEM - Check systemd state on es1025 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:36] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:44] PROBLEM - Check systemd state on db1161 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:48] PROBLEM - Check systemd state on db2124 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:04] RECOVERY - Check systemd state on db1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:04] RECOVERY - Check systemd state on db1175 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:28] PROBLEM - Check systemd state on db1190 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:29] (03PS1) 10Jbond: monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) [11:14:00] PROBLEM - Check systemd state on db1159 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:02] RECOVERY - Check systemd state on db2105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:18] PROBLEM - Check systemd state on db1104 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:24] PROBLEM - Check systemd state on db2134 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:46] PROBLEM - Check systemd state on db1122 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:48] PROBLEM - Check systemd state on db1126 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:48] PROBLEM - Check systemd state on db1151 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:50] PROBLEM - Check systemd state on db2182 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:04] RECOVERY - Check systemd state on db1190 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:28] PROBLEM - Check systemd state on db2132 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:45] (03CR) 10CI reject: [V: 04-1] monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [11:16:04] RECOVERY - Check systemd state on db2116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:04] RECOVERY - Check systemd state on db2122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:15] !log installing apache2 security updates on Buster [11:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:48] PROBLEM - Check systemd state on es1029 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:50] PROBLEM - Check systemd state on db1188 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:50] PROBLEM - Check systemd state on db1127 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:52] PROBLEM - Check systemd state on es1027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:04] RECOVERY - Check systemd state on db1123 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:48] PROBLEM - Check systemd state on db1130 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:50] PROBLEM - Check systemd state on db1175 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:02] RECOVERY - Check systemd state on pc2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:10] RECOVERY - Check systemd state on matomo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:37] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/869255 [11:18:48] PROBLEM - Check systemd state on db2105 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:02] RECOVERY - Check systemd state on pc1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:04] RECOVERY - Check systemd state on db1161 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:04] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:10] RECOVERY - Check systemd state on db2136 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:50] PROBLEM - Check systemd state on db1190 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:01] (03PS1) 10Majavah: cr-cloud: permit LDAPS traffic too [homer/public] - 10https://gerrit.wikimedia.org/r/869736 [11:20:02] RECOVERY - Check systemd state on es1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:02] RECOVERY - Check systemd state on es2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:50] PROBLEM - Check systemd state on db2116 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:50] PROBLEM - Check systemd state on db2122 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:50] PROBLEM - Check systemd state on db1123 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:04] RECOVERY - Check systemd state on es2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:48] PROBLEM - Check systemd state on pc2012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:56] PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:04] RECOVERY - Check systemd state on db1143 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:23] (03CR) 10CI reject: [V: 04-1] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/869255 (owner: 10PipelineBot) [11:23:48] PROBLEM - Check systemd state on pc1011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:50] PROBLEM - Check systemd state on db1161 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:50] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:58] PROBLEM - Check systemd state on db2136 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:00] RECOVERY - Check systemd state on db2122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:01] RECOVERY - Check systemd state on db2147 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:10] RECOVERY - Check systemd state on db1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:48] PROBLEM - Check systemd state on es1027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:50] PROBLEM - Check systemd state on es2020 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:06] RECOVERY - Check systemd state on db1109 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lists1001.wikimedia.org [11:26:52] PROBLEM - Check systemd state on es2027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:00] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:02] RECOVERY - Check systemd state on db1104 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:08] RECOVERY - Check systemd state on db2136 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:11] (03PS2) 10Jbond: monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) [11:27:50] PROBLEM - Check systemd state on db1143 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:48] PROBLEM - Check systemd state on db2122 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:48] PROBLEM - Check systemd state on db2147 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:56] PROBLEM - Check systemd state on db1130 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:04] RECOVERY - Check systemd state on db1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:29] (03CR) 10CI reject: [V: 04-1] monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [11:29:50] PROBLEM - Check systemd state on db1109 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:02] RECOVERY - Check systemd state on es2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:06] RECOVERY - Check systemd state on db1157 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:06] RECOVERY - Check systemd state on db1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lists1001.wikimedia.org [11:31:44] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:44] PROBLEM - Check systemd state on db1104 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:54] PROBLEM - Check systemd state on db2136 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:02] RECOVERY - Check systemd state on es1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:06] RECOVERY - Check systemd state on db1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:20] RECOVERY - Check systemd state on pc2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:26] RECOVERY - Check systemd state on db2152 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:02] RECOVERY - Check systemd state on db1159 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:20] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10BTullis) I have written up a note about these H750 based controllers here: https://wikitech.wikimedia.org/wiki/Rai... [11:33:36] RECOVERY - Check systemd state on db1184 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:40] RECOVERY - Check systemd state on db2163 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:48] PROBLEM - Check systemd state on db1122 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:00] RECOVERY - Check systemd state on db1206 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:04] RECOVERY - Check systemd state on matomo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:22] RECOVERY - Check systemd state on thumbor2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:30] RECOVERY - Check systemd state on db1123 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:48] PROBLEM - Check systemd state on es2031 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:04] RECOVERY - Check systemd state on db2124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:04] RECOVERY - Check systemd state on db2134 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:04] RECOVERY - Check systemd state on db2136 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:04] RECOVERY - Check systemd state on db2135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:10] RECOVERY - Check systemd state on db2125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:18] RECOVERY - Check systemd state on db1180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:50] PROBLEM - Check systemd state on db1157 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:50] RECOVERY - Check systemd state on db1201 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:08] RECOVERY - Check systemd state on db1181 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:10] RECOVERY - Check systemd state on db1109 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:22] RECOVERY - Check systemd state on es2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:24] RECOVERY - Check systemd state on es2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:30] RECOVERY - Check systemd state on db1161 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:42] RECOVERY - Check systemd state on db2116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:44] RECOVERY - Check systemd state on db2122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:52] PROBLEM - Check systemd state on db1130 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:52] RECOVERY - Check systemd state on db1175 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:02] RECOVERY - Check systemd state on db1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:16] RECOVERY - Check systemd state on db2131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:20] RECOVERY - Check systemd state on db1143 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:20] RECOVERY - Check systemd state on es2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:26] RECOVERY - Check systemd state on es1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:26] RECOVERY - Check systemd state on db1157 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:28] RECOVERY - Check systemd state on es1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:32] RECOVERY - Check systemd state on es2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:48] PROBLEM - Check systemd state on db1159 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:50] RECOVERY - Check systemd state on db1197 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:06] RECOVERY - Check systemd state on pc1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:18] RECOVERY - Check systemd state on thumbor2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:18] RECOVERY - Check systemd state on db2147 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:20] RECOVERY - Check systemd state on db1169 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:30] RECOVERY - Check systemd state on es2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:36] RECOVERY - Check systemd state on db1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:50] RECOVERY - Check systemd state on db2130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:02] RECOVERY - Check systemd state on db1188 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:28] RECOVERY - Check systemd state on db2105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:28] RECOVERY - Check systemd state on es1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:34] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:50] PROBLEM - Check systemd state on db2124 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:50] PROBLEM - Check systemd state on db2134 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:04] RECOVERY - Check systemd state on db1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:10] RECOVERY - Check systemd state on db1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:28] RECOVERY - Check systemd state on db1190 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:36] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:26] RECOVERY - Check systemd state on db2134 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:26] RECOVERY - Check systemd state on db2124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:26] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [11:41:50] RECOVERY - Check systemd state on db2182 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:28] RECOVERY - Check systemd state on db2132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:30] RECOVERY - Check systemd state on db1159 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:48] RECOVERY - Check systemd state on db1104 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:58] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [11:43:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cr-cloud: permit LDAPS traffic too [homer/public] - 10https://gerrit.wikimedia.org/r/869736 (owner: 10Majavah) [11:46:11] 10SRE, 10serviceops: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10akosiaris) 05Open→03Resolved a:03akosiaris I am gonna tentatively resolve this task. Last updated was >2.5 years ago, it quite possibly doesn't even apply anymore. [11:46:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Sporadic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10akosiaris) [11:47:20] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.000502 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:50:48] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10akosiaris) Switching from #serviceops to #SRE for greater visibility within the SRE team, this could... [11:51:06] 10SRE, 10Wikimedia-Mailing-lists: Request to create new mailing lists for WMGMC - https://phabricator.wikimedia.org/T325437 (10LClightcat) >>! 在T325437#8475759中,@Ladsgroup写道: > Please send me the email addresses. To Ladsgroup AT Gmail. I will run this with T&S so it might take a bit Sorry for the late reply,... [11:52:52] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:46] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38902/console" [puppet] - 10https://gerrit.wikimedia.org/r/868737 (https://phabricator.wikimedia.org/T325385) (owner: 10Ahmon Dancy) [12:00:31] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm, we can continue here after the freeze" [puppet] - 10https://gerrit.wikimedia.org/r/868737 (https://phabricator.wikimedia.org/T325385) (owner: 10Ahmon Dancy) [12:01:36] (03CR) 10Physikerwelt: [C: 04-1] "Let's fix this in beta first." [deployment-charts] - 10https://gerrit.wikimedia.org/r/869255 (owner: 10PipelineBot) [12:02:33] (03PS1) 10Ladsgroup: Labs: Use the correct address for mathoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869742 (https://phabricator.wikimedia.org/T311620) [12:02:48] (03CR) 10Jaime Nuche: [C: 03+1] train-presync: Pass -Dfull_image_build:True to scap stage-train [puppet] - 10https://gerrit.wikimedia.org/r/869333 (https://phabricator.wikimedia.org/T325576) (owner: 10Ahmon Dancy) [12:03:24] (03PS2) 10Ladsgroup: Labs: Use the correct address for mathoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869742 (https://phabricator.wikimedia.org/T311620) [12:04:23] (03CR) 10Alexandros Kosiaris: mathoid: pipeline bot promote (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/869255 (owner: 10PipelineBot) [12:05:05] (03CR) 10Physikerwelt: [C: 03+1] Labs: Use the correct address for mathoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869742 (https://phabricator.wikimedia.org/T311620) (owner: 10Ladsgroup) [12:05:33] (03CR) 10Ladsgroup: [C: 03+2] Labs: Use the correct address for mathoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869742 (https://phabricator.wikimedia.org/T311620) (owner: 10Ladsgroup) [12:06:16] (03Merged) 10jenkins-bot: Labs: Use the correct address for mathoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869742 (https://phabricator.wikimedia.org/T311620) (owner: 10Ladsgroup) [12:12:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10BTullis) The firmware upgrade was successful. I inadvertently upgraded the iDRAC as well to version 6 but downgrad... [12:13:19] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1011.eqiad.wmnet with OS bullseye [12:13:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1011... [12:19:09] (03PS4) 10Jbond: apereo_cas::services: drop mfa-u2f support [puppet] - 10https://gerrit.wikimedia.org/r/863292 (https://phabricator.wikimedia.org/T311999) [12:19:11] (03PS4) 10Jbond: apereo_cas: add OidcRegisteredService service support [puppet] - 10https://gerrit.wikimedia.org/r/863006 (https://phabricator.wikimedia.org/T311999) [12:20:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38903/console" [puppet] - 10https://gerrit.wikimedia.org/r/863006 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [12:20:58] (03CR) 10Jbond: [C: 03+2] apereo_cas::services: drop mfa-u2f support [puppet] - 10https://gerrit.wikimedia.org/r/863292 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [12:21:01] (03CR) 10Jbond: [V: 03+1 C: 03+2] apereo_cas: add OidcRegisteredService service support [puppet] - 10https://gerrit.wikimedia.org/r/863006 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [12:38:05] (03PS1) 10Jbond: apereo_cas: Add OIDC service to cloud instance [puppet] - 10https://gerrit.wikimedia.org/r/869750 (https://phabricator.wikimedia.org/T311999) [12:42:33] (03CR) 10Btullis: "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/866329 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:43:22] (03CR) 10Btullis: [C: 03+1] "Thank you." [puppet] - 10https://gerrit.wikimedia.org/r/866337 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:43:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38904/console" [puppet] - 10https://gerrit.wikimedia.org/r/869750 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [12:43:42] (03CR) 10Jbond: [V: 03+1 C: 03+2] apereo_cas: Add OIDC service to cloud instance [puppet] - 10https://gerrit.wikimedia.org/r/869750 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [12:44:10] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:01] (03PS1) 10Jbond: hieradata: add oidc id for django test app [puppet] - 10https://gerrit.wikimedia.org/r/869751 [12:55:16] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:22] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:04:10] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:06:09] (03PS2) 10Muehlenhoff: oozie: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863296 (https://phabricator.wikimedia.org/T308013) [13:06:49] 10SRE, 10SRE Program Management, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10Ladsgroup) >>! In T312067#8480491, @Legoktm wrote: > > (payments accepted in FTT) Too late, I already minted it as NFT and you just committed a rightclickcide crime. [13:08:19] (03PS2) 10Muehlenhoff: redis: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868706 (https://phabricator.wikimedia.org/T308013) [13:08:24] (03CR) 10Muehlenhoff: [C: 03+2] oozie: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863296 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:15:25] (03CR) 10Jbond: [C: 03+2] hieradata: add oidc id for django test app [puppet] - 10https://gerrit.wikimedia.org/r/869751 (owner: 10Jbond) [13:16:10] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/869753 [13:18:37] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/869753 (owner: 10Muehlenhoff) [13:26:40] (03CR) 10Muehlenhoff: [C: 03+2] an-web: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/866337 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:30:14] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:17] (03CR) 10Muehlenhoff: [C: 03+2] piwik: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/866329 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:32:15] (03PS2) 10Muehlenhoff: lists: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868707 (https://phabricator.wikimedia.org/T308013) [13:33:02] (03PS1) 10Giuseppe Lavagetto: mediawiki: add connection timeout to sendmail [deployment-charts] - 10https://gerrit.wikimedia.org/r/869755 (https://phabricator.wikimedia.org/T325131) [13:37:50] (03CR) 10Muehlenhoff: [C: 03+2] lists: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868707 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:41:36] (03PS1) 10AOkoth: Revert "Revert "vrts: add vrts2001 values and add database port in config"" [puppet] - 10https://gerrit.wikimedia.org/r/869717 [13:41:50] (03PS2) 10AOkoth: Revert "Revert "vrts: add vrts2001 values and add database port in config"" [puppet] - 10https://gerrit.wikimedia.org/r/869717 [13:43:19] (03PS3) 10Muehlenhoff: durum: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) [13:43:34] (03PS2) 10Muehlenhoff: quarry: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863302 (https://phabricator.wikimedia.org/T308013) [13:44:26] (03PS2) 10Muehlenhoff: hive: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863297 (https://phabricator.wikimedia.org/T308013) [13:45:11] (03CR) 10Volans: [C: 04-1] "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/860902 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [13:45:48] (03CR) 10Volans: [C: 03+2] cloud: authorize cumin from the bastion [puppet] - 10https://gerrit.wikimedia.org/r/869278 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [13:52:07] (03CR) 10Muehlenhoff: [C: 03+2] hive: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863297 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:54:32] (03PS1) 10Jbond: idp01: disable mod_cas on django site [puppet] - 10https://gerrit.wikimedia.org/r/869762 [14:10:34] !log installing ruby-rails-html-sanitizer security updates [14:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:38] (03CR) 10Jbond: [C: 03+2] idp01: disable mod_cas on django site [puppet] - 10https://gerrit.wikimedia.org/r/869762 (owner: 10Jbond) [14:10:47] (03PS2) 10Jbond: idp01: disable mod_cas on django site [puppet] - 10https://gerrit.wikimedia.org/r/869762 [14:10:52] (03CR) 10Jbond: [V: 03+2] idp01: disable mod_cas on django site [puppet] - 10https://gerrit.wikimedia.org/r/869762 (owner: 10Jbond) [14:13:10] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:15] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1011.eqiad.wmnet with reason: host reimage [14:16:14] !log installing jackson-databind security updates [14:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1011.eqiad.wmnet with reason: host reimage [14:18:40] (03PS4) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [14:18:44] (03PS1) 10Jbond: P:idp::standalone: fix vhost name [puppet] - 10https://gerrit.wikimedia.org/r/869766 [14:19:28] !log installing libpgjava security updates [14:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:48] (03CR) 10Jbond: [C: 03+2] P:idp::standalone: fix vhost name [puppet] - 10https://gerrit.wikimedia.org/r/869766 (owner: 10Jbond) [14:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:26:28] (03PS1) 10Jbond: idp: fix hiera data [puppet] - 10https://gerrit.wikimedia.org/r/869769 [14:28:16] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp: fix hiera data [puppet] - 10https://gerrit.wikimedia.org/r/869769 (owner: 10Jbond) [14:29:43] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [14:30:53] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:01] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [14:31:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1011.eqiad.wmnet with OS bullseye [14:31:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1011.eqi... [14:33:51] (03CR) 10Muehlenhoff: "comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede) [14:38:28] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1012.eqiad.wmnet with OS bullseye [14:38:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1012... [14:44:36] (03PS1) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [14:47:03] (03PS1) 10Muehlenhoff: Add buster data to https://os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/869772 [14:47:22] (03CR) 10CI reject: [V: 04-1] Add buster data to https://os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/869772 (owner: 10Muehlenhoff) [14:47:49] (03PS1) 10Jbond: puppet: allow to specify the exact disabled message [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 [14:48:57] (03PS7) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) [14:49:04] (03CR) 10Jbond: "thanks updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [14:52:28] (03CR) 10CI reject: [V: 04-1] puppet: allow to specify the exact disabled message [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond) [14:52:35] (03CR) 10Volans: "It's hard to review due to the move to class and gerrit not being smart enough. In general LGTM, left just a couple of comments inline. Th" [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [14:55:20] (03PS1) 10Slyngshede: C:ldap::client::utils absent ldapsupportlib [puppet] - 10https://gerrit.wikimedia.org/r/869776 [14:56:04] (03PS2) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [14:57:45] (03CR) 10CI reject: [V: 04-1] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [14:58:37] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:21] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_chartmuseum:prod.service,swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:25] (03CR) 10Ottomata: kafka_config: set a real string for default api_version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868739 (owner: 10Jbond) [15:08:06] (03PS1) 10Muehlenhoff: Add insetup variant for undefined ownership [puppet] - 10https://gerrit.wikimedia.org/r/869777 [15:08:25] (03CR) 10CI reject: [V: 04-1] Add insetup variant for undefined ownership [puppet] - 10https://gerrit.wikimedia.org/r/869777 (owner: 10Muehlenhoff) [15:17:21] (03PS1) 10Jbond: idp01: Add delegateed provider [puppet] - 10https://gerrit.wikimedia.org/r/869778 [15:19:38] (03CR) 10Jbond: [C: 03+2] idp01: Add delegateed provider [puppet] - 10https://gerrit.wikimedia.org/r/869778 (owner: 10Jbond) [15:20:56] (03CR) 10David Caro: [C: 03+1] "LGTM :)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/868789 (https://phabricator.wikimedia.org/T314581) (owner: 10Majavah) [15:21:09] (03CR) 10Majavah: [C: 03+2] tox: revert flake8 4.x pinning [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/868789 (https://phabricator.wikimedia.org/T314581) (owner: 10Majavah) [15:21:23] (03PS1) 10Volans: cloudcumin: add FQDN of the eqiad1 bastion [puppet] - 10https://gerrit.wikimedia.org/r/869779 (https://phabricator.wikimedia.org/T319401) [15:22:31] (03Merged) 10jenkins-bot: tox: revert flake8 4.x pinning [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/868789 (https://phabricator.wikimedia.org/T314581) (owner: 10Majavah) [15:22:52] (03PS3) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [15:23:15] (03CR) 10David Caro: [C: 03+1] cloudcumin: add FQDN of the eqiad1 bastion [puppet] - 10https://gerrit.wikimedia.org/r/869779 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:24:01] RECOVERY - Check systemd state on es1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:22] (03PS2) 10Muehlenhoff: Add buster data to https://os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/869772 [15:25:39] !log installing jupyter-core security updates [15:25:41] (03CR) 10CI reject: [V: 04-1] Add buster data to https://os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/869772 (owner: 10Muehlenhoff) [15:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:25] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:27:21] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 242, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:27:37] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10phaultfinder) [15:28:23] (03CR) 10Volans: [C: 03+2] cloudcumin: add FQDN of the eqiad1 bastion [puppet] - 10https://gerrit.wikimedia.org/r/869779 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:28:35] PROBLEM - Check systemd state on es1024 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:51] (03PS3) 10Muehlenhoff: Add buster data to https://os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/869772 [15:30:23] (03PS1) 10Jbond: idp01: fix allowed_delage logic [puppet] - 10https://gerrit.wikimedia.org/r/869780 [15:30:49] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01104 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:31:01] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:31:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38906/console" [puppet] - 10https://gerrit.wikimedia.org/r/869780 (owner: 10Jbond) [15:32:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869772 (owner: 10Muehlenhoff) [15:32:31] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:33:05] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01054 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:33:25] mmmh puppet seems broken widely [15:33:47] (03PS1) 10Alexandros Kosiaris: rsync: Fix a typo [puppet] - 10https://gerrit.wikimedia.org/r/869781 [15:34:46] (03CR) 10JMeybohm: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:36:04] * jbond looking at puppet [15:36:49] some are File[/etc/wikimedia/logout.d/50-systemdlogoutd] [15:37:10] others have different failures, did puppet crash? [15:37:33] looking at puppet board uid say either webserver restart of puppetdb restart [15:38:02] icinga reported some network issues at :26, wonder if those are related [15:39:58] I gradually upgraded Apache on puppet masters, that possibly was still too fast to not trip over the threshold [15:40:01] (03PS1) 10Volans: cloudcumin: fix hieradata for codfw1dev bastion [puppet] - 10https://gerrit.wikimedia.org/r/869782 (https://phabricator.wikimedia.org/T319401) [15:40:03] should recover soonish [15:40:27] ahh ok that will be it thanks moritz and yes should be harmless [15:40:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869782 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:42:16] (03CR) 10Volans: [C: 03+2] cloudcumin: fix hieradata for codfw1dev bastion [puppet] - 10https://gerrit.wikimedia.org/r/869782 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:42:45] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1012.eqiad.wmnet with reason: host reimage [15:45:45] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1012.eqiad.wmnet with reason: host reimage [15:47:12] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:47:36] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:47:58] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003011 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:49:29] (03CR) 10Muehlenhoff: [C: 03+2] Add buster data to https://os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/869772 (owner: 10Muehlenhoff) [15:51:16] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0005018 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:52:28] 10SRE, 10observability: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10lmata) >>! In T319434#8480722, @MoritzMuehlenhoff wrote: >>>! In T319434#8480708, @akosiaris wrote: >> Actually, per 28f86674054b7 #observability has taken over arclamp from #serviceo... [15:52:29] (03PS1) 10Jbond: wmflib: add new function to get first usable ip from network [puppet] - 10https://gerrit.wikimedia.org/r/869785 [15:54:40] (03PS2) 10FNegri: Reinstate innodb_large_prefix on ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/866598 (https://phabricator.wikimedia.org/T324846) [15:54:46] (03CR) 10CI reject: [V: 04-1] wmflib: add new function to get first usable ip from network [puppet] - 10https://gerrit.wikimedia.org/r/869785 (owner: 10Jbond) [15:55:53] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38907/console" [puppet] - 10https://gerrit.wikimedia.org/r/866598 (https://phabricator.wikimedia.org/T324846) (owner: 10FNegri) [15:59:30] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [16:00:32] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:00:44] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [16:00:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1012.eqiad.wmnet with OS bullseye [16:00:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1012.eqi... [16:03:16] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38908/console" [puppet] - 10https://gerrit.wikimedia.org/r/866598 (https://phabricator.wikimedia.org/T324846) (owner: 10FNegri) [16:05:56] (03CR) 10FNegri: [V: 03+1 C: 03+2] Reinstate innodb_large_prefix on ToolsDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866598 (https://phabricator.wikimedia.org/T324846) (owner: 10FNegri) [16:17:10] !log dancy@deploy1002 backport aborted: (duration: 00m 04s) [16:18:40] (03PS2) 10Jelto: P:spicerack: add python-gitlab package [puppet] - 10https://gerrit.wikimedia.org/r/860902 (https://phabricator.wikimedia.org/T323569) [16:19:09] !log dancy@deploy1002 backport aborted: (duration: 00m 01s) [16:19:18] ^ Those are me testing. Not actually deploying anything [16:21:50] (03CR) 10Jelto: P:spicerack: add python-gitlab package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860902 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:22:34] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/860902 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:26:15] (03PS1) 10Volans: cloud cumin: fix ssh config for codf1dev bastion [puppet] - 10https://gerrit.wikimedia.org/r/869816 (https://phabricator.wikimedia.org/T319401) [16:27:12] (03PS3) 10Eevans: Decommissioning restbase-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/869320 (https://phabricator.wikimedia.org/T325387) [16:45:18] !log dancy@deploy1002 scap failed: NameError name 'logging' is not defined (duration: 00m 00s) [16:47:02] RECOVERY - Check systemd state on es1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:33] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10Volans) [16:49:54] !log dancy@deploy1002 scap failed: NameError name 'SyslogFormatter' is not defined (duration: 00m 00s) [16:49:54] (03CR) 10JMeybohm: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:50:26] !log dancy@deploy1002 scap failed: NameError name 'SyslogFormatter' is not defined (duration: 00m 00s) [16:51:48] PROBLEM - Check systemd state on es1024 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:00] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:54] (03CR) 10JHathaway: Add a Puppetfile to track vendored modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [17:26:27] (03CR) 10Jbond: [C: 04-1] Add a Puppetfile to track vendored modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [17:31:12] (03CR) 10Volans: [C: 03+1] "LGTM, nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/869236 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [17:39:14] (03CR) 10Volans: "couple of comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [17:43:55] (03PS1) 10Slyngshede: C:ldap::management use bitu-ldap from add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/869824 [17:45:57] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment add RQ and database settings. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868621 (owner: 10Slyngshede) [17:48:00] is this thing on [17:48:13] ^ (That was me) [18:02:02] RECOVERY - Check systemd state on es1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:19] 10SRE, 10ops-ulsfo: ripe-atlas-ulsfo down - https://phabricator.wikimedia.org/T325549 (10RobH) I've removed and plugged back in power and while the LEDs on the appliances mainboard light up, it doesn't output to serial or display anything. So it seems this is now defunct. It is well out of warranty, having b... [18:05:21] (03PS1) 10Bking: query_service: add wdqs/wcqs hosts as rsync clients to clouddumps [puppet] - 10https://gerrit.wikimedia.org/r/869828 (https://phabricator.wikimedia.org/T323096) [18:06:21] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869828 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [18:06:50] PROBLEM - Check systemd state on es1024 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:06] !log otrs1001 - upgraded clamav daemon package, manually removed run dir and pid file, stopped, started clamav daemon [18:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:15] (03CR) 10Jbond: Add a Puppetfile to track vendored modules (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [18:15:12] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:15:38] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:18:37] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/869824 (owner: 10Slyngshede) [18:19:52] (03CR) 10Jbond: [V: 03+1 C: 03+2] idp01: fix allowed_delage logic [puppet] - 10https://gerrit.wikimedia.org/r/869780 (owner: 10Jbond) [18:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:21:26] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:21:48] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:32:13] (03CR) 10DCausse: [C: 03+1] query_service: add wdqs/wcqs hosts as rsync clients to clouddumps [puppet] - 10https://gerrit.wikimedia.org/r/869828 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [18:39:48] (03PS1) 10BryanDavis: striker: Bump container to 2022-12-20-182951-production [puppet] - 10https://gerrit.wikimedia.org/r/869832 (https://phabricator.wikimedia.org/T325622) [18:43:27] (03CR) 10BryanDavis: "PCC results: https://puppet-compiler.wmflabs.org/output/869832/4/" [puppet] - 10https://gerrit.wikimedia.org/r/869832 (https://phabricator.wikimedia.org/T325622) (owner: 10BryanDavis) [18:48:24] (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container to 2022-12-20-182951-production [puppet] - 10https://gerrit.wikimedia.org/r/869832 (https://phabricator.wikimedia.org/T325622) (owner: 10BryanDavis) [18:54:50] (03PS1) 10Jbond: aereo_cas: add addtional OIDC parameteres [puppet] - 10https://gerrit.wikimedia.org/r/869837 (https://phabricator.wikimedia.org/T311999) [18:55:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38909/console" [puppet] - 10https://gerrit.wikimedia.org/r/869837 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [18:59:57] (03CR) 10Dzahn: "What broke exactly? I am surprised to see this one reverted. I thought the problem was just a missing ; and that was fixed. Could you ad" [puppet] - 10https://gerrit.wikimedia.org/r/868534 (owner: 10AOkoth) [19:06:15] (03CR) 10Jbond: [V: 03+1 C: 03+2] aereo_cas: add addtional OIDC parameteres [puppet] - 10https://gerrit.wikimedia.org/r/869837 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [19:07:44] (03CR) 10FNegri: cloud cumin: fix ssh config for codf1dev bastion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869816 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [19:12:38] (03CR) 10Dzahn: admin: create new group deployment-jenkins (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [19:12:51] (03PS3) 10Dzahn: admin: create new group deployment-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) [19:14:32] (03CR) 10Dzahn: admin: create new group deployment-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [19:19:01] (03PS1) 10Jbond: apreo_cas: Bypass the approval prompt [puppet] - 10https://gerrit.wikimedia.org/r/869840 (https://phabricator.wikimedia.org/T311999) [19:20:05] (03CR) 10Bking: [C: 03+2] query_service: add wdqs/wcqs hosts as rsync clients to clouddumps [puppet] - 10https://gerrit.wikimedia.org/r/869828 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [19:21:15] (03CR) 10Volans: "Good questions, I've replied inline." [puppet] - 10https://gerrit.wikimedia.org/r/869816 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [19:21:17] (03PS1) 10Jbond: ldjango test app: add debug page [puppet] - 10https://gerrit.wikimedia.org/r/869842 [19:21:39] (03CR) 10Jbond: [C: 03+2] apreo_cas: Bypass the approval prompt [puppet] - 10https://gerrit.wikimedia.org/r/869840 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [19:21:51] (03PS2) 10Jbond: django test app: add debug page [puppet] - 10https://gerrit.wikimedia.org/r/869842 [19:22:07] (03CR) 10Jbond: [C: 03+2] django test app: add debug page [puppet] - 10https://gerrit.wikimedia.org/r/869842 (owner: 10Jbond) [19:23:12] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:23:36] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:23:43] (03CR) 10CI reject: [V: 04-1] django test app: add debug page [puppet] - 10https://gerrit.wikimedia.org/r/869842 (owner: 10Jbond) [19:26:00] (03PS2) 10Jbond: Migrate service definitions to CasRegisteredService [puppet] - 10https://gerrit.wikimedia.org/r/860551 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [19:26:27] (03CR) 10Jbond: "i think we can go ahead and merge this now" [puppet] - 10https://gerrit.wikimedia.org/r/860551 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [19:35:30] (03PS3) 10Jbond: django test app: add debug page [puppet] - 10https://gerrit.wikimedia.org/r/869842 [19:37:07] (03CR) 10Jbond: "probably best you don't see this :P" [puppet] - 10https://gerrit.wikimedia.org/r/869842 (owner: 10Jbond) [19:37:21] (03PS1) 10JHathaway: Upgrade concat to v7.1.1 to support stdlib 8.X [puppet] - 10https://gerrit.wikimedia.org/r/869845 (https://phabricator.wikimedia.org/T325597) [19:37:33] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Dzahn) [19:37:39] (03CR) 10Jbond: [C: 03+2] django test app: add debug page [puppet] - 10https://gerrit.wikimedia.org/r/869842 (owner: 10Jbond) [19:38:26] (03CR) 10JHathaway: Add a Puppetfile to track vendored modules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [19:40:13] (03CR) 10JHathaway: Add vendored module bodgit/puppet-postfix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway) [19:42:16] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Dzahn) >>! In T325607#8480269, @Albertoleoncio wrote: > Including SRE as it involves Google Search Console I don't think SRE is actually the right... [19:49:27] (03CR) 10Jbond: [C: 03+1] "LGTM however i think it would also be safe to go to 7.3.0, nothing significant (only CI, metadata etc) changes between 7.1.1 and 7.3.0" [puppet] - 10https://gerrit.wikimedia.org/r/869845 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [19:53:21] (03CR) 10JHathaway: Upgrade concat to v7.1.1 to support stdlib 8.X (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869845 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [19:53:36] (03PS1) 10JHathaway: Upgrade concat to v7.3.0 to support stdlib 8.X [puppet] - 10https://gerrit.wikimedia.org/r/869848 (https://phabricator.wikimedia.org/T325597) [19:53:56] (03CR) 10CI reject: [V: 04-1] Upgrade concat to v7.3.0 to support stdlib 8.X [puppet] - 10https://gerrit.wikimedia.org/r/869848 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [19:54:57] (03PS2) 10JHathaway: Upgrade concat to v7.3.0 to support stdlib 8.X [puppet] - 10https://gerrit.wikimedia.org/r/869845 (https://phabricator.wikimedia.org/T325597) [19:55:39] (03Abandoned) 10JHathaway: Upgrade concat to v7.3.0 to support stdlib 8.X [puppet] - 10https://gerrit.wikimedia.org/r/869848 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [19:58:20] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [20:00:17] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [20:01:01] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [20:01:11] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [20:04:16] PROBLEM - Query Service HTTP Port on wdqs2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [20:05:22] PROBLEM - Host wdqs1008 is DOWN: PING CRITICAL - Packet loss = 100% [20:05:28] RECOVERY - Host wdqs1008 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [20:09:53] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [20:10:42] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [20:10:54] PROBLEM - Host wdqs2012 is DOWN: PING CRITICAL - Packet loss = 100% [20:11:14] RECOVERY - Host wdqs2012 is UP: PING OK - Packet loss = 0%, RTA = 33.25 ms [20:11:20] (03CR) 10Eevans: [C: 03+2] Decommissioning restbase-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/869320 (https://phabricator.wikimedia.org/T325387) (owner: 10Eevans) [20:16:41] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [20:21:50] PROBLEM - Host wdqs1009 is DOWN: PING CRITICAL - Packet loss = 100% [20:22:50] PROBLEM - Host wdqs1010 is DOWN: PING CRITICAL - Packet loss = 100% [20:30:51] 10ops-eqiad, 10decommission-hardware: decommission restbase-dev100{4,5,6} - https://phabricator.wikimedia.org/T325387 (10Eevans) a:05Eevans→03Cmjohnson [20:31:06] 10ops-eqiad, 10decommission-hardware: decommission restbase-dev100{4,5,6} - https://phabricator.wikimedia.org/T325387 (10Eevans) [20:39:10] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [20:52:12] RECOVERY - Host wdqs1009 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [20:52:54] RECOVERY - Host wdqs1010 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [21:13:02] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10EWilfong_WMF) Noting that the cert for links.email is now in place and all requirements stated in this task are implemented: https://www.ss... [21:19:48] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10EWilfong_WMF) From my perspective, this task can be closed unless there are any further questions or comments about the SSL cert for this s... [21:26:25] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10greg) a:03Vgutierrez (Assigning to @Vgutierrez per the work/patch, letting ya'll review/close however you'd like over in Traffic.) [21:55:04] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Darwinius) @Dzahn a request for someone to look into this using the Search Console counts as that? [22:02:58] (03CR) 10Krinkle: "In reviewing this, I'm trying to answer these three questions:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525977 (owner: 10Aaron Schulz) [22:11:52] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [22:15:06] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [22:15:54] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Dzahn) If you are asking for access to the Search Console, please clarify who needs access to what and add the access request tag. That would make... [22:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:46:08] (03PS1) 10Dzahn: phabricator: rewrite https://phabricator.wikimedia.org/r/ to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/869853 (https://phabricator.wikimedia.org/T324311) [22:47:22] (03PS2) 10Dzahn: phabricator: rewrite https://phabricator.wikimedia.org/r/ to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/869853 (https://phabricator.wikimedia.org/T324311) [22:51:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:51:44] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:52:42] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:53:12] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:00:00] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 108 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:01:36] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:07:02] RECOVERY - Check systemd state on es1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:44] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:17] 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10Dzahn) a:03Wangombe [23:11:50] PROBLEM - Check systemd state on es1024 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:19:58] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:52] !log [WDQS] Powercycling `wdqs1005` [23:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:44] RECOVERY - Host wdqs1005 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [23:24:52] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:52] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 244 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:26:28] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:30] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.136 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook