[00:39:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/923584 [00:39:19] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/923584 (owner: 10TrainBranchBot) [00:56:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/923584 (owner: 10TrainBranchBot) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:13] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:57:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:27:02] (03PS2) 10KartikMistry: Undeploy Special:Contribute from unsupported skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923527 (https://phabricator.wikimedia.org/T337366) [05:10:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool sanitarium masters for s1, s2, s3, s5 T337446', diff saved to https://phabricator.wikimedia.org/P48598 and previous config saved to /var/cache/conftool/dbconfig/20230529-051043-root.json [05:10:47] PROBLEM - MariaDB Replica IO: s8 on clouddb1016 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3318 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:10:47] PROBLEM - MariaDB Replica IO: s8 on clouddb1020 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3318 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:10:50] T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 [05:12:31] PROBLEM - MariaDB Replica IO: s4 on clouddb1015 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1155.eqiad.wmnet:3314 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:12:41] PROBLEM - MariaDB Replica IO: s6 on clouddb1019 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1155.eqiad.wmnet:3316 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:12:47] PROBLEM - MariaDB Replica IO: s4 on clouddb1019 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1155.eqiad.wmnet:3314 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:12:47] PROBLEM - MariaDB Replica IO: s6 on clouddb1015 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1155.eqiad.wmnet:3316 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:15:27] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:18:25] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:20:20] (03PS1) 10Marostegui: db1156,db1161,db1196,db1212: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923773 (https://phabricator.wikimedia.org/T337446) [05:21:59] PROBLEM - MariaDB Replica Lag: s8 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 890.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:22:13] PROBLEM - MariaDB Replica Lag: s8 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 903.75 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:23:15] RECOVERY - MariaDB Replica IO: s4 on clouddb1015 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:23:25] RECOVERY - MariaDB Replica IO: s6 on clouddb1019 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:23:31] RECOVERY - MariaDB Replica IO: s4 on clouddb1019 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:23:31] RECOVERY - MariaDB Replica IO: s6 on clouddb1015 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:24:37] RECOVERY - MariaDB Replica IO: s8 on clouddb1020 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:24:37] RECOVERY - MariaDB Replica IO: s8 on clouddb1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:25:05] RECOVERY - MariaDB Replica Lag: s8 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 2.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:25:17] RECOVERY - MariaDB Replica Lag: s8 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:27:42] (03CR) 10Marostegui: [C: 03+2] db1156,db1161,db1196,db1212: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923773 (https://phabricator.wikimedia.org/T337446) (owner: 10Marostegui) [05:30:39] (03PS1) 10Marostegui: Revert "control-mariadb-10.4-bullseye: Bump version" [software] - 10https://gerrit.wikimedia.org/r/923641 [05:31:18] (03CR) 10Santhosh: [C: 03+1] Undeploy Special:Contribute from unsupported skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923527 (https://phabricator.wikimedia.org/T337366) (owner: 10KartikMistry) [05:32:10] (03CR) 10Marostegui: [C: 03+2] Revert "control-mariadb-10.4-bullseye: Bump version" [software] - 10https://gerrit.wikimedia.org/r/923641 (owner: 10Marostegui) [05:32:30] 10SRE, 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10Marostegui) Started mariadb [05:37:05] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:40:07] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:49:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:49:41] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:54:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:20:37] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:22:31] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:24:14] (03PS1) 10KartikMistry: Remove OpusMT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/923920 (https://phabricator.wikimedia.org/T337657) [06:27:53] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:31:31] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:35:45] PROBLEM - MariaDB Replica IO: s3 on clouddb1013 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3313 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:36:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:41:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:00:06] Deploy window No deploys all day (Per Deployments/Yearly_calendar)! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230529T0700) [07:15:50] (03PS1) 10Marostegui: db1161: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923922 [07:16:27] (03CR) 10Marostegui: [C: 03+2] db1161: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/923922 (owner: 10Marostegui) [07:16:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48601 and previous config saved to /var/cache/conftool/dbconfig/20230529-071643-root.json [07:20:51] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:23:58] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: de-provision 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/921349 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [07:30:03] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41388/console" [puppet] - 10https://gerrit.wikimedia.org/r/921349 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [07:31:11] (03PS3) 10Filippo Giunchedi: prometheus: de-provision 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/921349 (https://phabricator.wikimedia.org/T288196) [07:31:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48602 and previous config saved to /var/cache/conftool/dbconfig/20230529-073148-root.json [07:32:08] (03CR) 10Filippo Giunchedi: [V: 03+2] prometheus: de-provision 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/921349 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [07:40:43] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:40:57] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: don't add empty targets [puppet] - 10https://gerrit.wikimedia.org/r/923576 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [07:46:01] (03CR) 10Filippo Giunchedi: "LGTM overall!" [puppet] - 10https://gerrit.wikimedia.org/r/916914 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [07:46:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add the possibility to override CI settings using a .fixturesctl.yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/922793 (https://phabricator.wikimedia.org/T337359) (owner: 10Giuseppe Lavagetto) [07:46:52] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41389/console" [puppet] - 10https://gerrit.wikimedia.org/r/923576 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [07:46:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48603 and previous config saved to /var/cache/conftool/dbconfig/20230529-074653-root.json [07:49:19] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:50:00] (03PS1) 10Elukey: admin_ng: fix limit ranges experimental ns config for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/923924 [07:54:22] (03CR) 10Elukey: [C: 03+2] admin_ng: fix limit ranges experimental ns config for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/923924 (owner: 10Elukey) [07:56:35] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/923924 (owner: 10Elukey) [07:56:49] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:57:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:57:03] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:01:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48604 and previous config saved to /var/cache/conftool/dbconfig/20230529-080157-root.json [08:03:16] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:07:14] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add systemd unit crashloop alert [alerts] - 10https://gerrit.wikimedia.org/r/921047 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi) [08:08:12] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 14 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:09:19] (03PS1) 10Elukey: admin_ng: refactor duplicate limitranges in ml-serve's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/924047 [08:10:24] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 14 down 3: Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [08:11:16] (03PS2) 10Giuseppe Lavagetto: Add the possibility to override CI settings using a .fixturesctl.yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/922793 (https://phabricator.wikimedia.org/T337359) [08:15:29] (03PS4) 10D3r1ck01: Switch VisualEditor to not use RESTbase on small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923650 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [08:17:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48606 and previous config saved to /var/cache/conftool/dbconfig/20230529-081702-root.json [08:18:52] (03CR) 10D3r1ck01: Switch VisualEditor to not use RESTbase on small and medium wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923650 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [08:22:20] (03PS4) 10Giuseppe Lavagetto: rakemodules: improve condition in should_patch? [deployment-charts] - 10https://gerrit.wikimedia.org/r/923538 (owner: 10Elukey) [08:23:52] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 3: https://wikitech.wikimedia.org/wiki/HAProxy [08:24:34] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "This patch would've had unintended consequences, we already merged a patch that fixed the problem" [deployment-charts] - 10https://gerrit.wikimedia.org/r/923538 (owner: 10Elukey) [08:24:58] <_joe_> jouncebot: nowandnext [08:24:59] For the next 22 hour(s) and 35 minute(s): No deploys all day (Per Deployments/Yearly_calendar)! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230529T0700) [08:24:59] In 22 hour(s) and 35 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230530T0700) [08:25:04] (03PS2) 10Elukey: admin_ng: set new limitranges for experimental in ml-serve's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/924047 [08:25:22] (03PS2) 10Giuseppe Lavagetto: mediawiki: fix deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/923247 [08:32:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48607 and previous config saved to /var/cache/conftool/dbconfig/20230529-083206-root.json [08:33:18] (03CR) 10Ilias Sarantopoulos: [C: 03+1] admin_ng: set new limitranges for experimental in ml-serve's config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/924047 (owner: 10Elukey) [08:33:39] ACKNOWLEDGEMENT - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 3: Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [08:35:44] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for nskaggs - https://phabricator.wikimedia.org/T337571 (10Volans) I'm wondering if this was the right long term approach. In general we're trying to reduce the need for global root, not expand it. I see that we already have a `wmcs-roots` group in `... [08:39:35] (03PS3) 10Elukey: admin_ng: set new limitranges for experimental in ml-serve's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/924047 [08:39:37] (03PS1) 10Elukey: ml-services: updated memory limit ranges to host the bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/924049 [08:42:20] (03CR) 10Elukey: [C: 03+2] admin_ng: set new limitranges for experimental in ml-serve's config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/924047 (owner: 10Elukey) [08:43:00] (03CR) 10Elukey: [C: 03+2] ml-services: updated memory limit ranges to host the bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/924049 (owner: 10Elukey) [08:43:38] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:59] btullis: FYI rsync-published has been flapping on stat1009 ^ since last week [08:45:05] (03PS1) 10KartikMistry: testwiki: Enable Section Translation for 9 Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924050 (https://phabricator.wikimedia.org/T337290) [08:45:08] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:36] !log delete old raw blocks from thanos - T337236 [08:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:41] T337236: Audit stored Thanos data - https://phabricator.wikimedia.org/T337236 [08:47:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48608 and previous config saved to /var/cache/conftool/dbconfig/20230529-084711-root.json [08:48:10] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-compact.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:32] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:50:50] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [08:51:20] (03PS1) 10Marostegui: db1196: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/924051 [08:52:40] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:52:40] (03CR) 10Marostegui: [C: 04-2] "not yet" [puppet] - 10https://gerrit.wikimedia.org/r/924051 (owner: 10Marostegui) [08:53:45] (03CR) 10Volans: "Couple of general comments on the idea." [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [08:54:12] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:47] (03PS3) 10Giuseppe Lavagetto: mediawiki: fix deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/923247 [08:55:50] (ThanosCompactIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [08:59:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/923247 (owner: 10Giuseppe Lavagetto) [08:59:32] (JobUnavailable) resolved: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:00:10] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:13] (03Merged) 10jenkins-bot: mediawiki: fix deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/923247 (owner: 10Giuseppe Lavagetto) [09:02:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48609 and previous config saved to /var/cache/conftool/dbconfig/20230529-090216-root.json [09:05:18] (03PS1) 10Gergő Tisza: Improve logging of invalid image recommendation kinds [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923643 [09:06:52] (03PS1) 10Gergő Tisza: Section images: Do not treat unexpected kinds as production errors [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/923644 [09:07:32] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:10] (03PS1) 10Filippo Giunchedi: team-o11y: warn on thanos-compact not performing enough downsamples [alerts] - 10https://gerrit.wikimedia.org/r/924052 (https://phabricator.wikimedia.org/T337251) [09:08:17] (03PS1) 10Urbanecm: [Growth] Enable user impact refresh on 10 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924053 (https://phabricator.wikimedia.org/T336203) [09:11:09] (03CR) 10Filippo Giunchedi: [C: 03+2] team-o11y: warn on thanos-compact not performing enough downsamples [alerts] - 10https://gerrit.wikimedia.org/r/924052 (https://phabricator.wikimedia.org/T337251) (owner: 10Filippo Giunchedi) [09:12:12] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] profile: start cadvisor rollout in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/923531 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [09:13:26] !log start partial rollout of cadvisor to eqiad/codfw (~10%) T108027 [09:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:30] T108027: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 [09:15:57] (03CR) 10Volans: "comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [09:16:28] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:19:28] PROBLEM - Check systemd state on ms-fe1013 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:50] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:01] (03CR) 10Effie Mouzeli: Enable parser cache warming jobs for parsoid on some top wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923588 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [09:25:37] 10SRE, 10Machine-Learning-Team, 10ORES, 10Scap: Use external dsh group to list pooled ORES nodes - https://phabricator.wikimedia.org/T179501 (10elukey) 05Open→03Declined We are moving to Lift Wing: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing I am closing old tasks related to ORES sin... [09:26:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:29:48] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:30:25] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:31:12] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:31:36] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:31:57] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:33:16] (03CR) 10Volans: Ganeti: Add small script to display free resources in gnt groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923608 (owner: 10Jbond) [09:35:17] (03PS1) 10Filippo Giunchedi: prometheus: don't enable server when absenting [puppet] - 10https://gerrit.wikimedia.org/r/924055 (https://phabricator.wikimedia.org/T288196) [09:36:46] (03CR) 10Marostegui: [C: 03+2] db1196: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/924051 (owner: 10Marostegui) [09:37:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48610 and previous config saved to /var/cache/conftool/dbconfig/20230529-093709-root.json [09:38:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:38:34] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:23] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41390/console" [puppet] - 10https://gerrit.wikimedia.org/r/924055 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [09:43:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:43:16] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: don't enable server when absenting [puppet] - 10https://gerrit.wikimedia.org/r/924055 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [09:45:02] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [09:45:31] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [09:45:58] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:40] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [09:50:05] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [09:51:10] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [09:52:04] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [09:52:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48611 and previous config saved to /var/cache/conftool/dbconfig/20230529-095214-root.json [09:53:20] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:36] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [09:56:04] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [09:58:38] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [09:59:20] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [10:00:02] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:43] !log restarting pybal on lvs1020 [10:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:58] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [10:03:22] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [10:03:45] !log oblivian@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [10:03:46] !log oblivian@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [10:04:00] !log oblivian@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [10:04:09] !log oblivian@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [10:05:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:05:09] !log oblivian@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [10:05:09] !log oblivian@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [10:05:21] !log oblivian@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [10:05:22] !log oblivian@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [10:07:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48612 and previous config saved to /var/cache/conftool/dbconfig/20230529-100719-root.json [10:07:30] !log restarting pybal on lvs1018 [10:07:32] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:55] (03CR) 10Volans: "Some questions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [10:10:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:12:05] (03PS1) 10Volans: spicerack: add test-cookbook script [puppet] - 10https://gerrit.wikimedia.org/r/924059 [10:12:41] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) >>! In T108027#8886158, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://... [10:13:05] (03PS11) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 [10:13:07] (03PS9) 10Giuseppe Lavagetto: Simplify management of the request time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718 [10:13:09] (03PS2) 10Giuseppe Lavagetto: Do not use firejail on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920213 [10:14:16] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:16:00] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:15] (03CR) 10Abijeet Patro: [C: 03+1] ttm: use new config option to separate readable and writable services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922481 (https://phabricator.wikimedia.org/T322284) (owner: 10DCausse) [10:22:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48614 and previous config saved to /var/cache/conftool/dbconfig/20230529-102223-root.json [10:22:48] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:15] (03PS1) 10Urbanecm: [Growth] Enable new Impact for 10 additional wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924060 (https://phabricator.wikimedia.org/T336203) [10:26:45] (03CR) 10Urbanecm: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924060 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [10:29:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:30:14] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:29] btullis: I've silenced 'check systemd state' on stat1009 in icinga for a week FYI [10:34:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:37:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48615 and previous config saved to /var/cache/conftool/dbconfig/20230529-103728-root.json [10:46:04] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:47:08] RECOVERY - Check systemd state on ms-fe1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48616 and previous config saved to /var/cache/conftool/dbconfig/20230529-105233-root.json [11:00:50] (03PS1) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 [11:03:27] (03CR) 10CI reject: [V: 04-1] sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [11:07:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48617 and previous config saved to /var/cache/conftool/dbconfig/20230529-110737-root.json [11:13:11] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:13:48] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:22:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48618 and previous config saved to /var/cache/conftool/dbconfig/20230529-112242-root.json [11:24:34] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:25:22] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:25:43] (03PS2) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 [11:26:22] (03CR) 10Jbond: Ganeti: Add small script to display free resources in gnt groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923608 (owner: 10Jbond) [11:27:09] (03PS3) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 [11:29:26] (03CR) 10CI reject: [V: 04-1] sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [11:32:10] (03PS2) 10KartikMistry: Update cxserver to 2023-05-29-112644-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/923920 (https://phabricator.wikimedia.org/T337657) [11:43:44] (03PS4) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 [12:11:03] (03Abandoned) 10Elukey: rakemodules: improve condition in should_patch? [deployment-charts] - 10https://gerrit.wikimedia.org/r/923538 (owner: 10Elukey) [12:53:14] (03CR) 10Volans: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [12:53:54] PROBLEM - SSH on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:59:39] (03PS1) 10Filippo Giunchedi: sre: alert on units crashlooping more than once a minute [alerts] - 10https://gerrit.wikimedia.org/r/924078 (https://phabricator.wikimedia.org/T293970) [12:59:44] (03CR) 10Jbond: [C: 03+1] "LGTM, cc brett as they asked about something like this recently" [puppet] - 10https://gerrit.wikimedia.org/r/924059 (owner: 10Volans) [13:01:44] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: alert on units crashlooping more than once a minute [alerts] - 10https://gerrit.wikimedia.org/r/924078 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi) [13:07:26] (03PS5) 10Fabfur: SRE: Add a new cookbook that allows to run puppet configuration while restarting Varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (https://phabricator.wikimedia.org/T323557) [13:10:38] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:29] (03CR) 10Vgutierrez: [C: 03+1] SRE: Add a new cookbook that allows to run puppet configuration while restarting Varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [13:12:38] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [13:12:42] (03PS5) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 [13:14:28] RECOVERY - MariaDB Replica Lag: s5 on clouddb1021 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:17:33] (03PS6) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 [13:17:55] (03CR) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [13:23:24] RECOVERY - SSH on wdqs2004 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:24:12] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:26:29] (03PS1) 10Gergő Tisza: GrowthExperiments: Re-add $wgGERestbaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924079 [13:26:48] (03CR) 10Gergő Tisza: "This caused the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912310 (owner: 10Urbanecm) [13:34:37] Amir1 / marostegui is s7 depooled right now? it seems nothing can connect to it https://phabricator.wikimedia.org/T337682 [13:34:54] I assume this is perhaps part of https://phabricator.wikimedia.org/T337446 and we can't do anything but wait? just verifying :) [13:35:30] RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:49] the other slices are heavily lagged, but we can and have been able to still connect to them. s7 went down sometime today [13:36:21] musikanimal: cloud or production? [13:36:27] cloud [13:36:42] Only wait [13:36:44] Sorry [13:37:25] Manuel is rebuilding all of the hosts. It'll take a while [13:38:22] TLDR: A bug in new version of mariadb (very likely) corrupted data [13:38:53] okay I see, so we're now recloning and that was started on s7 [13:39:12] I didn't realize full downtime was happening, as opposed to just replag [13:39:12] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:40:15] should an email to wikitech-l and/or cloud-announce be sent? [13:41:49] (03CR) 10Volans: SRE: Add a new cookbook that allows to run puppet configuration while restarting Varnish (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [13:47:24] (03PS1) 10Giuseppe Lavagetto: trafficserver: also match mobile domains in mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/924080 [13:47:29] musikanimal: I am trying to find a balance, but they all have tradeoffs unfortunately. s7 is almost done though [13:48:02] musikanimal: anyways, the data was like 3-4 days old anyways, so not connecting and having old data....not sure what's best :) [13:48:33] okay, no worries :) I understand if there's not much that can be done to avoid downtime, but it would be great if we could send an email or some sort of update to tool maintainers etc. [13:48:52] For now I've just put a notice at the top of XTools pointing to T337446 [13:48:52] T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 [13:48:54] musikanimal: but the data was already old [13:49:06] Like 3-4 days old or almost a week [13:49:17] yes, that's not great, but no data at all is worse I think [13:49:25] back in the day, replag was common so users are used to that [13:49:35] complete downtime and people start freaking out [13:49:43] Well I am trying to fix this as fast as I can [13:49:56] all good! didn't mean to put pressure on you :) it is what it is [13:50:13] people complained about replag too, if I don't do it this way, it might take me like 2 weeks to get it fixed [13:50:19] (03CR) 10Fabfur: SRE: Add a new cookbook that allows to run puppet configuration while restarting Varnish (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [13:50:37] I see, in that case you're making the right decision, I think [13:50:55] (03PS6) 10Fabfur: SRE: Add a new cookbook that allows to run puppet configuration while restarting Varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (https://phabricator.wikimedia.org/T323557) [13:51:53] sorry I didn't understand. I know you know that you know what you're doing and am not questing that :) I just wanted to know what to tell the users. Is "several days of downtime" a fair estimate? better to overestimate, I guess [13:52:52] musikanimal: I think it will be a few days of on-off availability for a section at the time. Like I am almost done with s7 and once that is done, I will shutdown s5 entirely [13:54:23] got it. I'll stick with the current wording, then. Thanks for all you do, marostegui! [13:54:36] thanks musikanimal :* [13:54:59] :hugs: [13:57:29] !log vgutierrez@puppetmaster1001 conftool action : set/weight=10; selector: name=dbproxy.*,dc=eqiad [13:58:08] (03CR) 10Volans: [C: 03+1] "LGTM, ready for testing!" [cookbooks] - 10https://gerrit.wikimedia.org/r/922844 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [14:03:15] (03PS1) 10Jbond: ganeti. add GanetiRAPI.nodes and GanetiRAPI.groups [software/spicerack] - 10https://gerrit.wikimedia.org/r/924081 [14:03:40] (03PS1) 10Func: Revert "Rename wgPageContentLanguage to wgPageViewLanguage" partially [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924086 (https://phabricator.wikimedia.org/T337634) [14:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:10] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on stat1009.eqiad.wmnet with reason: Bringing stat1009 into service [14:18:34] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on stat1009.eqiad.wmnet with reason: Bringing stat1009 into service [14:24:54] PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:40] PROBLEM - WDQS SPARQL on wdqs2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.208 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:29:12] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-ipmi-exporter.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:26] RECOVERY - MariaDB Replica Lag: s2 on clouddb1021 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:01:24] PROBLEM - Check systemd state on vrts2001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service,vrts-cache-cleanup.service,wmf_auto_restart_apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:53] 10SRE, 10Observability-Metrics, 10User-fgiunchedi: Extend router ACLs to block 4194/tcp on LVSes - https://phabricator.wikimedia.org/T337689 (10fgiunchedi) [15:04:06] (ProbeDown) firing: Service vrts2001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts2001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:10:05] (03CR) 10Hnowlan: [C: 03+2] thumbor: move xcf support to imagemagick [deployment-charts] - 10https://gerrit.wikimedia.org/r/921053 (https://phabricator.wikimedia.org/T260285) (owner: 10Hnowlan) [15:10:54] (03Merged) 10jenkins-bot: thumbor: move xcf support to imagemagick [deployment-charts] - 10https://gerrit.wikimedia.org/r/921053 (https://phabricator.wikimedia.org/T260285) (owner: 10Hnowlan) [15:12:41] (03CR) 10Framawiki: [C: 03+1] varnish: Remove bbcrewind exemption for Wikimedia Maps [puppet] - 10https://gerrit.wikimedia.org/r/893846 (https://phabricator.wikimedia.org/T331087) (owner: 10Legoktm) [15:14:50] (03CR) 10Vgutierrez: [C: 03+2] varnish: Remove bbcrewind exemption for Wikimedia Maps [puppet] - 10https://gerrit.wikimedia.org/r/893846 (https://phabricator.wikimedia.org/T331087) (owner: 10Legoktm) [15:16:22] 10SRE, 10Maps, 10Patch-For-Review: Remove bbcrewind.co.uk exemption for Wikimedia Maps - https://phabricator.wikimedia.org/T331087 (10Vgutierrez) 05Open→03Resolved [15:19:19] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: This is being worked on [15:19:33] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: This is being worked on [15:37:23] (03PS1) 10EoghanGaffney: Ensure rsync jobs get removed on the non-active machine [puppet] - 10https://gerrit.wikimedia.org/r/924085 (https://phabricator.wikimedia.org/T334435) [15:44:42] RECOVERY - MariaDB Replica SQL: s3 on clouddb1021 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:45:16] RECOVERY - MariaDB Replica Lag: s3 on clouddb1021 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:51:40] PROBLEM - SSH on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:53:08] RECOVERY - SSH on wdqs2004 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:05:08] (03PS1) 10Andrew Bogott: cloud-vps: disable profile::prometheus::cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/924106 (https://phabricator.wikimedia.org/T108027) [16:06:57] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: disable profile::prometheus::cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/924106 (https://phabricator.wikimedia.org/T108027) (owner: 10Andrew Bogott) [16:46:11] (03PS1) 10Gergő Tisza: [beta] GrowthExperiments: Fix variable name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924110 (https://phabricator.wikimedia.org/T337348) [16:46:53] (03CR) 10Gergő Tisza: [C: 03+2] [beta] GrowthExperiments: Fix variable name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924110 (https://phabricator.wikimedia.org/T337348) (owner: 10Gergő Tisza) [16:47:43] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: Fix variable name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924110 (https://phabricator.wikimedia.org/T337348) (owner: 10Gergő Tisza) [17:06:44] (03PS1) 10Gergő Tisza: GrowthExperiments: Fix beta $wgGEImageRecommendationServiceUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924111 (https://phabricator.wikimedia.org/T337348) [17:07:17] (03PS2) 10Gergő Tisza: [beta] GrowthExperiments: Fix beta $wgGEImageRecommendationServiceUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924111 (https://phabricator.wikimedia.org/T337348) [17:07:29] (03CR) 10Gergő Tisza: [C: 03+2] [beta] GrowthExperiments: Fix beta $wgGEImageRecommendationServiceUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924111 (https://phabricator.wikimedia.org/T337348) (owner: 10Gergő Tisza) [17:07:57] (03CR) 10CI reject: [V: 04-1] [beta] GrowthExperiments: Fix beta $wgGEImageRecommendationServiceUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924111 (https://phabricator.wikimedia.org/T337348) (owner: 10Gergő Tisza) [17:09:00] (03PS3) 10Gergő Tisza: [beta] GrowthExperiments: Fix beta $wgGEImageRecommendationServiceUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924111 (https://phabricator.wikimedia.org/T337348) [17:09:10] (03CR) 10Gergő Tisza: [C: 03+2] [beta] GrowthExperiments: Fix beta $wgGEImageRecommendationServiceUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924111 (https://phabricator.wikimedia.org/T337348) (owner: 10Gergő Tisza) [17:09:55] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: Fix beta $wgGEImageRecommendationServiceUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924111 (https://phabricator.wikimedia.org/T337348) (owner: 10Gergő Tisza) [17:11:12] (03PS1) 10Effie Mouzeli: tegola: Switch swift container to tegola-swift-codfw-v003 [deployment-charts] - 10https://gerrit.wikimedia.org/r/924112 (https://phabricator.wikimedia.org/T333318) [17:43:57] (03CR) 10Stef Dunlap: [C: 03+1] "Thanks! <3" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/923688 (https://phabricator.wikimedia.org/T331651) (owner: 10Hashar) [18:29:12] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-ipmi-exporter.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:01:45] (03PS1) 10Bartosz Dziewoński: editpage: Change the order of hooks slightly for FlaggedRevs [core] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924158 (https://phabricator.wikimedia.org/T337637) [19:01:54] (03PS1) 10Bartosz Dziewoński: Hide 'editnotice-notext' message in VE (and mobile apps) [extensions/VisualEditor] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924159 (https://phabricator.wikimedia.org/T337633) [19:02:16] (03PS1) 10Bartosz Dziewoński: ve.ui.MWGalleryDialog: Fix showing the search panel [extensions/VisualEditor] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924160 (https://phabricator.wikimedia.org/T337638) [19:21:26] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [19:22:48] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3754 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Docker [20:07:15] (03CR) 10Volans: "Thanks for the review, addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/924059 (owner: 10Volans) [20:07:25] (03PS2) 10Volans: spicerack: add test-cookbook script [puppet] - 10https://gerrit.wikimedia.org/r/924059 [20:39:39] (03PS1) 10Zabe: tables_to_check: drop revision_comment_temp [software] - 10https://gerrit.wikimedia.org/r/924122 (https://phabricator.wikimedia.org/T215466) [20:39:47] (03CR) 10CI reject: [V: 04-1] tables_to_check: drop revision_comment_temp [software] - 10https://gerrit.wikimedia.org/r/924122 (https://phabricator.wikimedia.org/T215466) (owner: 10Zabe) [20:40:10] (03CR) 10Zabe: "recheck" [software] - 10https://gerrit.wikimedia.org/r/924122 (https://phabricator.wikimedia.org/T215466) (owner: 10Zabe) [20:43:06] (03PS1) 10Volans: Revert "setup.py: limit prospector upper version" [software/spicerack] - 10https://gerrit.wikimedia.org/r/924161 [20:48:55] (03PS2) 10Volans: Revert "setup.py: limit prospector upper version" [software/spicerack] - 10https://gerrit.wikimedia.org/r/924161 [20:55:57] (03CR) 10Volans: [C: 03+2] Revert "setup.py: limit prospector upper version" [software/spicerack] - 10https://gerrit.wikimedia.org/r/924161 (owner: 10Volans) [20:59:50] (03Merged) 10jenkins-bot: Revert "setup.py: limit prospector upper version" [software/spicerack] - 10https://gerrit.wikimedia.org/r/924161 (owner: 10Volans) [21:16:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:21:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:23:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:23:40] (03CR) 10Volans: [C: 03+1] "Thanks for the patch, LGTM, minor nits inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/924081 (owner: 10Jbond) [21:26:02] (03PS1) 10Superpes15: [kaawiki] Change the logo with an HD version and the tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924124 (https://phabricator.wikimedia.org/T337641) [21:26:04] (03PS1) 10Superpes15: [kaawiki] Change the logo with an HD version and the tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924125 (https://phabricator.wikimedia.org/T337641) [21:26:53] (03Abandoned) 10Superpes15: [kaawiki] Change the logo with an HD version and the tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924125 (https://phabricator.wikimedia.org/T337641) (owner: 10Superpes15) [21:28:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:29:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:31:18] (03PS1) 10Superpes15: [kaawiki] Change the logo with an HD version and the tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924166 (https://phabricator.wikimedia.org/T337641) [21:32:30] (03Abandoned) 10Superpes15: [kaawiki] Change the logo with an HD version and the tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924124 (https://phabricator.wikimedia.org/T337641) (owner: 10Superpes15) [21:32:54] (03PS2) 10Superpes15: [kaawiki] Change the logo with an HD version and the tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924166 (https://phabricator.wikimedia.org/T337641) [21:33:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:38:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:39:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:42:07] (03PS1) 10Volans: setup.py: remove temporary upper limits of deps [cookbooks] - 10https://gerrit.wikimedia.org/r/924167 [21:43:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:53:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:58:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:03:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:04:09] (03PS3) 10Superpes15: [kaawiki] Change the logo with an HD version and the tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924166 (https://phabricator.wikimedia.org/T337641) [22:08:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:09:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:13:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:18:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:23:22] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:24:37] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:29:12] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-ipmi-exporter.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:37] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:30:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:30:20] PROBLEM - PHP7 rendering on mw1468 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:33:20] RECOVERY - PHP7 rendering on mw1468 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 4.695 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:35:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:38:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Zabe - https://phabricator.wikimedia.org/T337703 (10Zabe) [22:41:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:44:37] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:45:22] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:49:37] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:49:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Zabe - https://phabricator.wikimedia.org/T337703 (10TheresNoTime) I've kept a distant half-eye on some of the CU work going on, and at the //very least// can attest to Zabe's hard work in shepherding this related data migration... [22:50:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Zabe - https://phabricator.wikimedia.org/T337703 (10TheresNoTime) I've kept a distant half-eye on some of the CU work going on, and at the //very least// can attest to Zabe's hard work in shepherding this related data migration... [22:52:03] *sigh* [22:54:37] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:59:37] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:02:32] PROBLEM - PHP7 jobrunner on mw1468 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [23:03:58] RECOVERY - PHP7 jobrunner on mw1468 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 2.052 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:04:37] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:05:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:08:18] PROBLEM - PHP7 jobrunner on mw1461 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [23:09:37] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:09:42] RECOVERY - PHP7 jobrunner on mw1461 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:10:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:11:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:14:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:16:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:19:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:21:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:22:50] PROBLEM - PHP7 jobrunner on mw1468 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [23:24:18] RECOVERY - PHP7 jobrunner on mw1468 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 5.056 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:26:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:36:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:46:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:51:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown