[00:00:11] (03PS3) 10DDesouza: Deploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917863 (https://phabricator.wikimedia.org/T336092) [00:04:16] (03PS3) 10Andrew Bogott: Openstack trove hacks: update a patch [puppet] - 10https://gerrit.wikimedia.org/r/923436 [00:10:48] (03CR) 10Andrew Bogott: [C: 03+2] Openstack trove hacks: update a patch [puppet] - 10https://gerrit.wikimedia.org/r/923436 (owner: 10Andrew Bogott) [00:19:08] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T337276 (10Jhancock.wm) 05Open→03Resolved this is the same server as before. resolving [00:23:26] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T337247 (10Jhancock.wm) @jcrespo can I get your help depooling this server when you are free. I've tried to reboot the idrac and it's not taking. I believe we need to reboot the whole server to fix this issue. Blinking Amber—Indicates that iDRAC... [00:32:08] (03PS1) 10Andrew Bogott: Openstack trove hacks: update a patch, take 2 [puppet] - 10https://gerrit.wikimedia.org/r/923439 [00:32:35] (03CR) 10CI reject: [V: 04-1] Openstack trove hacks: update a patch, take 2 [puppet] - 10https://gerrit.wikimedia.org/r/923439 (owner: 10Andrew Bogott) [00:33:16] (03PS1) 10Andrew Bogott: Openstack trove hacks: update a patch, take 2 [puppet] - 10https://gerrit.wikimedia.org/r/923440 [00:33:30] (03Abandoned) 10Andrew Bogott: Openstack trove hacks: update a patch, take 2 [puppet] - 10https://gerrit.wikimedia.org/r/923439 (owner: 10Andrew Bogott) [00:33:45] (03CR) 10Andrew Bogott: [C: 03+2] Openstack trove hacks: update a patch, take 2 [puppet] - 10https://gerrit.wikimedia.org/r/923440 (owner: 10Andrew Bogott) [00:39:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/922542 [00:39:41] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/922542 (owner: 10TrainBranchBot) [00:56:09] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/922542 (owner: 10TrainBranchBot) [01:16:51] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-data-backup-gitlab1003.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:33] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:45] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:31] (03PS1) 10Andrew Bogott: Revert "Openstack trove hacks: update a patch, take 2" [puppet] - 10https://gerrit.wikimedia.org/r/923443 [02:28:33] (03PS1) 10Andrew Bogott: Revert "Openstack trove hacks: update a patch" [puppet] - 10https://gerrit.wikimedia.org/r/923444 [02:29:46] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Openstack trove hacks: update a patch, take 2" [puppet] - 10https://gerrit.wikimedia.org/r/923443 (owner: 10Andrew Bogott) [02:29:50] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Openstack trove hacks: update a patch" [puppet] - 10https://gerrit.wikimedia.org/r/923444 (owner: 10Andrew Bogott) [02:46:36] (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:56:36] (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:20:35] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:21:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:51:34] !log fab@deploy1002 Started deploy [airflow-dags/research@77cf676]: (no justification provided) [03:51:52] !log fab@deploy1002 Finished deploy [airflow-dags/research@77cf676]: (no justification provided) (duration: 00m 17s) [03:57:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:53:35] (03CR) 10Giuseppe Lavagetto: Add the possibility to override CI settings using a .fixturesctl.yaml files (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922793 (https://phabricator.wikimedia.org/T337359) (owner: 10Giuseppe Lavagetto) [04:54:27] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:55:09] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:55:45] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:56:33] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:57:09] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:57:23] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49994 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:00:47] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [05:02:13] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [05:06:39] PROBLEM - Disk space on krb1001 is CRITICAL: DISK CRITICAL - free space: / 1762 MB (3% inode=97%): /tmp 1762 MB (3% inode=97%): /var/tmp 1762 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [05:26:38] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/922905 (https://phabricator.wikimedia.org/T334493) (owner: 10Stevemunene) [05:32:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/923245 (owner: 10Slyngshede) [05:35:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [05:41:38] (03PS1) 10Muehlenhoff: Update MOU date [puppet] - 10https://gerrit.wikimedia.org/r/923446 [05:44:09] (03CR) 10Muehlenhoff: [C: 03+2] Update MOU date [puppet] - 10https://gerrit.wikimedia.org/r/923446 (owner: 10Muehlenhoff) [05:58:35] PROBLEM - Check systemd state on db1156 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230526T0600) [06:04:17] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:09:17] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:12:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48576 and previous config saved to /var/cache/conftool/dbconfig/20230526-061236-root.json [06:13:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48577 and previous config saved to /var/cache/conftool/dbconfig/20230526-061330-root.json [06:16:46] (03PS1) 10Jameel Kaisar: Allow query parameters in network probe url [puppet] - 10https://gerrit.wikimedia.org/r/923448 (https://phabricator.wikimedia.org/T337317) [06:17:52] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/923448 (https://phabricator.wikimedia.org/T337317) (owner: 10Jameel Kaisar) [06:26:47] (03PS2) 10Elukey: helmfile.d: attempt to fix changeprop's staging config for Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/923376 [06:27:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48578 and previous config saved to /var/cache/conftool/dbconfig/20230526-062741-root.json [06:28:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48579 and previous config saved to /var/cache/conftool/dbconfig/20230526-062835-root.json [06:28:45] (03PS3) 10Elukey: helmfile.d: attempt to fix changeprop's staging config for Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/923376 [06:31:54] (03CR) 10Ayounsi: [C: 03+1] Adjust Eqiad row E/F switch parents in hierdata after cable moves [puppet] - 10https://gerrit.wikimedia.org/r/923395 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [06:33:19] (03CR) 10Elukey: "Hugh: we didn't realize that the new Lift Wing config is causing changeprop in staging to crashloop. The errors are:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/923376 (owner: 10Elukey) [06:36:30] !log `truncate /var/log/kerberos/krb5kdc.log -s 10g` on krb1001 to avoid the root partition to fill up [06:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:52] moritzm: o/ --^ There are a ton of logs in krb5kdc.log, mostly from analytics nodes [06:38:58] having a look [06:42:22] !log `apt-get clean` on stat1008 to clean up some space in the root partition [06:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48580 and previous config saved to /var/cache/conftool/dbconfig/20230526-064245-root.json [06:43:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48581 and previous config saved to /var/cache/conftool/dbconfig/20230526-064340-root.json [06:44:21] PROBLEM - MariaDB Replica IO: s1 on clouddb1017 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: Error: connecting slave requested to start from GTID 171966471-171966471-62, which is not in the masters binlog https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:44:37] PROBLEM - MariaDB Replica Lag: s3 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 88615.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:45:05] PROBLEM - MariaDB Replica Lag: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 88644.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:45:13] PROBLEM - MariaDB Replica IO: s1 on clouddb1013 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: Error: connecting slave requested to start from GTID 171966471-171966471-62, which is not in the masters binlog https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:47:53] elukey: the increased log rates seem all be coming from the coordinator hosts, there might have been some software updates there for the bullseye updates, will check with Ben when he's aroun [06:48:17] super [06:49:06] (03CR) 10Slyngshede: [C: 03+2] C:IDM Ensure service restart on git update [puppet] - 10https://gerrit.wikimedia.org/r/923245 (owner: 10Slyngshede) [06:49:23] RECOVERY - Disk space on krb1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [06:54:21] RECOVERY - Check systemd state on db1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:57:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48582 and previous config saved to /var/cache/conftool/dbconfig/20230526-065750-root.json [06:58:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48583 and previous config saved to /var/cache/conftool/dbconfig/20230526-065844-root.json [07:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230526T0700) [07:04:55] (03PS1) 10Muehlenhoff: Remove outdated Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/923484 [07:06:58] (03CR) 10Btullis: [C: 03+1] "I'm happy to add this as an experiment, but I'm not convinced that it's going to fix the issue that we're seeing." [puppet] - 10https://gerrit.wikimedia.org/r/922905 (https://phabricator.wikimedia.org/T334493) (owner: 10Stevemunene) [07:12:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48584 and previous config saved to /var/cache/conftool/dbconfig/20230526-071255-root.json [07:13:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48585 and previous config saved to /var/cache/conftool/dbconfig/20230526-071349-root.json [07:18:19] PROBLEM - MariaDB Replica IO: s1 on clouddb1021 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: Error: connecting slave requested to start from GTID 171966471-171966471-62, which is not in the masters binlog https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:18:51] PROBLEM - MariaDB Replica Lag: s3 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 90670.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:28:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48586 and previous config saved to /var/cache/conftool/dbconfig/20230526-072759-root.json [07:28:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48587 and previous config saved to /var/cache/conftool/dbconfig/20230526-072854-root.json [07:31:26] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41363/console" [puppet] - 10https://gerrit.wikimedia.org/r/923484 (owner: 10Muehlenhoff) [07:33:47] (03CR) 10Elukey: [V: 03+1 C: 03+1] Remove outdated Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/923484 (owner: 10Muehlenhoff) [07:34:33] (03CR) 10Elukey: [V: 03+1 C: 03+1] Remove outdated Hiera setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923484 (owner: 10Muehlenhoff) [07:34:58] (03CR) 10Muehlenhoff: [C: 03+1] Add the refinery-cache/revs directory to git safe list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922905 (https://phabricator.wikimedia.org/T334493) (owner: 10Stevemunene) [07:42:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/923348 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [07:43:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48588 and previous config saved to /var/cache/conftool/dbconfig/20230526-074304-root.json [07:43:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48589 and previous config saved to /var/cache/conftool/dbconfig/20230526-074358-root.json [07:44:35] (03CR) 10Muehlenhoff: Remove outdated Hiera setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923484 (owner: 10Muehlenhoff) [07:44:37] (03CR) 10Muehlenhoff: [C: 03+2] Remove outdated Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/923484 (owner: 10Muehlenhoff) [07:47:51] (03CR) 10Giuseppe Lavagetto: [C: 03+1] conftool: Add more servers to the jobrunner problem [puppet] - 10https://gerrit.wikimedia.org/r/923426 (https://phabricator.wikimedia.org/T329366) (owner: 10Effie Mouzeli) [07:51:17] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:52:09] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:53:49] (03CR) 10Effie Mouzeli: "LGTM, do you think it would make sense to keep those wikis in another variable and reference it here? Obviously just for visuals, no stron" [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [07:53:56] (03CR) 10Effie Mouzeli: [C: 03+1] mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [07:54:41] (03CR) 10Effie Mouzeli: [C: 03+1] testwikidatawiki: Fix missing mobile redir to k8s [puppet] - 10https://gerrit.wikimedia.org/r/923384 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [07:56:15] (03CR) 10Muehlenhoff: debmonitor::server: Add bookworm support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922145 (owner: 10Muehlenhoff) [07:56:22] (03PS2) 10Muehlenhoff: debmonitor::server: Add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/922145 [07:56:24] (03CR) 10Effie Mouzeli: [C: 03+1] mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [07:57:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:58:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48590 and previous config saved to /var/cache/conftool/dbconfig/20230526-075809-root.json [07:58:47] (03PS2) 10Ilias Sarantopoulos: ml-services: deploy bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) [07:59:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48591 and previous config saved to /var/cache/conftool/dbconfig/20230526-075903-root.json [08:02:43] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/922145 (owner: 10Muehlenhoff) [08:02:47] (03CR) 10Filippo Giunchedi: "Is the idea to get probes only from the catalog entries? AFAICS there are prometheus::blackbox::check::http probes for both, which should " [puppet] - 10https://gerrit.wikimedia.org/r/923263 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [08:06:51] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:00] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/922543 [08:10:12] !log jiji@cumin1001 conftool action : set/pooled=inactive; selector: dc=eqiad,name=parse101[3-6].eqiad.wmnet [08:13:59] (03CR) 10Effie Mouzeli: [C: 03+2] conftool: Add more servers to the jobrunner problem [puppet] - 10https://gerrit.wikimedia.org/r/923426 (https://phabricator.wikimedia.org/T329366) (owner: 10Effie Mouzeli) [08:15:28] (03PS3) 10Elukey: ml-services: deploy bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [08:17:05] (03CR) 10Filippo Giunchedi: "LGTM overall, left some notes" [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (owner: 10Herron) [08:17:13] (03CR) 10CI reject: [V: 04-1] ml-services: deploy bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [08:19:49] PROBLEM - mediawiki-installation DSH group on parse1015 is CRITICAL: Host parse1015 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:20:15] (03PS1) 10KartikMistry: Undeploy Special:Contribute from unsupported skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923527 (https://phabricator.wikimedia.org/T337366) [08:20:47] PROBLEM - mediawiki-installation DSH group on parse1016 is CRITICAL: Host parse1016 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:21:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:22:31] PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:30] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T337247 (10jcrespo) Hi, let me know how I can help, but if I understand it rightly, cp2035 is a #traffic host, so better contacting either @Vgutierrez on Europe time or @BBlack in US time (it has nothing to do with my team, data persistence). I c... [08:24:45] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) With the latest changes in place we have the following metrics, `cp1075` has block/network accounting enabled and thus... [08:29:13] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:34:18] (03CR) 10Klausman: [C: 03+1] helmfile.d: attempt to fix changeprop's staging config for Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/923376 (owner: 10Elukey) [08:34:56] (03CR) 10Jdlrobson: [C: 03+1] Undeploy Special:Contribute from unsupported skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923527 (https://phabricator.wikimedia.org/T337366) (owner: 10KartikMistry) [08:35:35] PROBLEM - puppet last run on gitlab2002 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:39:02] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host parse1013.eqiad.wmnet with OS buster [08:39:14] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host parse1014.eqiad.wmnet with OS buster [08:39:24] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host parse1015.eqiad.wmnet with OS buster [08:39:47] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host parse1016.eqiad.wmnet with OS buster [08:41:07] RECOVERY - puppet last run on gitlab2002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:44:16] (03PS1) 10Filippo Giunchedi: profile: add ensure for prometheus::cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/923530 (https://phabricator.wikimedia.org/T108027) [08:44:18] (03PS1) 10Filippo Giunchedi: profile: start cadvisor rollout in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/923531 (https://phabricator.wikimedia.org/T108027) [08:47:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [08:47:27] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41364/console" [puppet] - 10https://gerrit.wikimedia.org/r/923530 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [08:47:55] (03CR) 10CI reject: [V: 04-1] profile: add ensure for prometheus::cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/923530 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [08:50:13] RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:45] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1013.eqiad.wmnet with reason: host reimage [08:51:55] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1014.eqiad.wmnet with reason: host reimage [08:52:05] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1015.eqiad.wmnet with reason: host reimage [08:52:22] (03PS3) 10Jbond: install_console: restrict options used [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) [08:52:33] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1016.eqiad.wmnet with reason: host reimage [08:54:17] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on parse1015.eqiad.wmnet with reason: host reimage [08:54:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1013.eqiad.wmnet with reason: host reimage [08:55:33] (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond) [08:55:59] (03PS2) 10Filippo Giunchedi: profile: add ensure for prometheus::cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/923530 (https://phabricator.wikimedia.org/T108027) [08:56:01] (03PS2) 10Filippo Giunchedi: profile: start cadvisor rollout in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/923531 (https://phabricator.wikimedia.org/T108027) [08:56:55] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1014.eqiad.wmnet with reason: host reimage [08:58:39] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41365/console" [puppet] - 10https://gerrit.wikimedia.org/r/923530 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [08:59:16] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41366/console" [puppet] - 10https://gerrit.wikimedia.org/r/923531 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [08:59:28] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1016.eqiad.wmnet with reason: host reimage [09:05:49] (03CR) 10Jbond: puppet-merge: implement Lock out, tag out (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond) [09:08:05] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parse1015.eqiad.wmnet with OS buster [09:09:54] (03CR) 10Elukey: [C: 03+2] helmfile.d: attempt to fix changeprop's staging config for Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/923376 (owner: 10Elukey) [09:10:06] (03PS3) 10Jbond: puppet-merge: implement Lock out, tag out [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) [09:10:41] (03CR) 10CI reject: [V: 04-1] puppet-merge: implement Lock out, tag out [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond) [09:13:33] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [09:13:44] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [09:17:00] (03PS3) 10Ilias Sarantopoulos: ORES: add model versions configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922512 (https://phabricator.wikimedia.org/T319170) [09:18:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/922145 (owner: 10Muehlenhoff) [09:19:01] (03CR) 10Jbond: [C: 03+1] Add debmonitor[12]003 as additional scap targets [puppet] - 10https://gerrit.wikimedia.org/r/922126 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [09:23:02] !log jnuche@deploy1002 Installing scap version "4.52.3" for 596 hosts [09:23:13] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1013.eqiad.wmnet with OS buster [09:24:00] !log jnuche@deploy1002 Installation of scap version "4.52.3" completed for 596 hosts [09:25:19] (03CR) 10WMDE-leszek: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/923427 (https://phabricator.wikimedia.org/T336659) (owner: 10WMDE-leszek) [09:26:17] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1014.eqiad.wmnet with OS buster [09:26:18] !log parse1013-parse1016 have neen depooled and removed from the parsoid-php service - T329366 [09:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:24] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [09:27:02] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:07] (03CR) 10Jbond: [C: 03+2] puppetmaster: fix puppetdb_submit_only_hosts [puppet] - 10https://gerrit.wikimedia.org/r/923356 (owner: 10Jbond) [09:28:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1016.eqiad.wmnet with OS buster [09:28:50] (03CR) 10Jbond: [C: 03+2] puppetmnaster::frontend: configure puppetmaster2004 as a canary [puppet] - 10https://gerrit.wikimedia.org/r/923333 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [09:28:55] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster2004: enable subimt_only [puppet] - 10https://gerrit.wikimedia.org/r/923353 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [09:29:41] !log disable puppet fleet wide to deploy minor puppet change https://gerrit.wikimedia.org/r/c/operations/puppet/+/923353 [09:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:33] (03PS4) 10Elukey: ml-services: deploy bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [09:38:35] (03PS1) 10Elukey: rakemodules: improve condition in should_patch? [deployment-charts] - 10https://gerrit.wikimedia.org/r/923538 [09:38:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] Make kubernetes::clusters the central place for k8s config [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:40:45] (03PS1) 10Jbond: puppetmaster2004: use correct puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/923544 [09:41:17] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetmaster2004: use correct puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/923544 (owner: 10Jbond) [09:42:14] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations: Investigate crypto deprecations after Bullseye update - https://phabricator.wikimedia.org/T337544 (10MoritzMuehlenhoff) [09:42:26] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations: Investigate crypto KDC deprecations after Bullseye update - https://phabricator.wikimedia.org/T337544 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:43:47] (03PS1) 10Zabe: maintain-views: Drop views on revision_comment_temp [puppet] - 10https://gerrit.wikimedia.org/r/923545 (https://phabricator.wikimedia.org/T275246) [09:44:54] (03CR) 10CI reject: [V: 04-1] ml-services: deploy bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [09:48:42] (03PS1) 10Jbond: puppetmaster2004: also set command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/923546 [09:49:03] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetmaster2004: also set command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/923546 (owner: 10Jbond) [09:50:53] (03PS1) 10David Caro: gitlab.runners: allow tools/toolsbeta harbor instances [puppet] - 10https://gerrit.wikimedia.org/r/923547 (https://phabricator.wikimedia.org/T336130) [09:52:22] (03PS2) 10David Caro: gitlab.runners: allow tools/toolsbeta harbor instances [puppet] - 10https://gerrit.wikimedia.org/r/923547 (https://phabricator.wikimedia.org/T336130) [09:52:32] (03CR) 10David Caro: gitlab.runners: allow tools/toolsbeta harbor instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923547 (https://phabricator.wikimedia.org/T336130) (owner: 10David Caro) [09:54:02] !log pool parse1013-parse1016 to the jobrunner cluster - T329366 [09:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:07] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [09:56:46] (03PS2) 10Elukey: rakemodules: improve condition in should_patch? [deployment-charts] - 10https://gerrit.wikimedia.org/r/923538 [09:56:50] (03PS5) 10Elukey: ml-services: deploy bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [09:57:42] (03CR) 10CI reject: [V: 04-1] rakemodules: improve condition in should_patch? [deployment-charts] - 10https://gerrit.wikimedia.org/r/923538 (owner: 10Elukey) [09:57:47] (03CR) 10CI reject: [V: 04-1] ml-services: deploy bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [09:58:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:58:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:00:06] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49994 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:00:12] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.372 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:01:59] (03CR) 10Majavah: "Note that this would effectively unblock access to any services behind the WMCS shared web proxy." [puppet] - 10https://gerrit.wikimedia.org/r/923547 (https://phabricator.wikimedia.org/T336130) (owner: 10David Caro) [10:06:44] (03PS4) 10Ilias Sarantopoulos: ORES: add model versions configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922512 (https://phabricator.wikimedia.org/T319170) [10:06:59] (03PS3) 10Elukey: rakemodules: improve condition in should_patch? [deployment-charts] - 10https://gerrit.wikimedia.org/r/923538 [10:07:01] (03PS6) 10Elukey: ml-services: deploy bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [10:12:19] (03CR) 10David Caro: gitlab.runners: allow tools/toolsbeta harbor instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923547 (https://phabricator.wikimedia.org/T336130) (owner: 10David Caro) [10:13:41] (03CR) 10David Caro: gitlab.runners: allow tools/toolsbeta harbor instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923547 (https://phabricator.wikimedia.org/T336130) (owner: 10David Caro) [10:13:46] (03CR) 10CI reject: [V: 04-1] ml-services: deploy bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [10:16:21] (03PS1) 10Marostegui: Revert "db1158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/923522 [10:16:47] (03PS3) 10Arturo Borrero Gonzalez: cloud_private: route the whole cloud public IPv4 space to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) [10:16:49] (03PS1) 10Arturo Borrero Gonzalez: cloud_private_subnet: split BGP code into separate profile [puppet] - 10https://gerrit.wikimedia.org/r/923551 (https://phabricator.wikimedia.org/T324992) [10:16:51] (03PS1) 10Arturo Borrero Gonzalez: cloud_private_subnet::bgp: set up route lookup rule only for /32 VIPs [puppet] - 10https://gerrit.wikimedia.org/r/923552 (https://phabricator.wikimedia.org/T324992) [10:17:48] (03CR) 10Marostegui: [C: 03+2] Revert "db1158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/923522 (owner: 10Marostegui) [10:20:17] (03CR) 10CI reject: [V: 04-1] cloud_private: route the whole cloud public IPv4 space to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [10:20:46] RECOVERY - mediawiki-installation DSH group on parse1015 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:20:55] (03CR) 10CI reject: [V: 04-1] cloud_private_subnet::bgp: set up route lookup rule only for /32 VIPs [puppet] - 10https://gerrit.wikimedia.org/r/923552 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [10:23:40] RECOVERY - mediawiki-installation DSH group on parse1016 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:24:18] (03CR) 10Cathal Mooney: [C: 03+2] Fix cable validator to allow editing of existing cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/923406 (https://phabricator.wikimedia.org/T310590) (owner: 10Cathal Mooney) [10:24:26] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Fix cable validator to allow editing of existing cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/923406 (https://phabricator.wikimedia.org/T310590) (owner: 10Cathal Mooney) [10:24:53] (03Merged) 10jenkins-bot: Fix cable validator to allow editing of existing cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/923406 (https://phabricator.wikimedia.org/T310590) (owner: 10Cathal Mooney) [10:27:26] !log cmooney@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [10:33:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for Manuel - https://phabricator.wikimedia.org/T336841 (10Manuel) Thank you for your quick solution! Kerberos is working now \o/ [10:33:42] (03PS7) 10Elukey: ml-services: deploy bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [10:34:12] (03CR) 10WMDE-leszek: "No idea if this is the right "fix". It seemed to be the only place I could identify that would possibly disallow PATCH requests on Beta si" [puppet] - 10https://gerrit.wikimedia.org/r/923427 (https://phabricator.wikimedia.org/T336659) (owner: 10WMDE-leszek) [10:38:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [10:47:02] (03PS8) 10Elukey: ml-services: deploy bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [10:51:34] (03CR) 10Hoo man: [C: 04-1] install_console: restrict options used (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond) [10:54:08] !log cmooney@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [10:54:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [10:56:54] (03CR) 10Elukey: "No idea what is the best way forward, lemme know :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/923538 (owner: 10Elukey) [10:58:49] (03PS1) 10Marostegui: wiki-replicas.sql: Create role [puppet] - 10https://gerrit.wikimedia.org/r/923558 [10:59:07] (03PS2) 10Marostegui: wiki-replicas.sql: Create role [puppet] - 10https://gerrit.wikimedia.org/r/923558 (https://phabricator.wikimedia.org/T337446) [10:59:21] (03CR) 10Cathal Mooney: [C: 03+2] Adjust Eqiad row E/F switch parents in hierdata after cable moves [puppet] - 10https://gerrit.wikimedia.org/r/923395 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [10:59:30] (03CR) 10CI reject: [V: 04-1] wiki-replicas.sql: Create role [puppet] - 10https://gerrit.wikimedia.org/r/923558 (https://phabricator.wikimedia.org/T337446) (owner: 10Marostegui) [11:02:21] (03PS1) 10Slyngshede: C:IDM Remove absent systemd services. [puppet] - 10https://gerrit.wikimedia.org/r/923559 [11:03:15] (03PS2) 10Arturo Borrero Gonzalez: cloud_private_subnet: split BGP code into separate profile [puppet] - 10https://gerrit.wikimedia.org/r/923551 (https://phabricator.wikimedia.org/T324992) [11:03:17] (03PS2) 10Arturo Borrero Gonzalez: cloud_private_subnet::bgp: set up route lookup rule only for /32 VIPs [puppet] - 10https://gerrit.wikimedia.org/r/923552 (https://phabricator.wikimedia.org/T324992) [11:03:19] (03PS4) 10Arturo Borrero Gonzalez: cloud_private: route the whole cloud public IPv4 space to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) [11:07:11] (03CR) 10CI reject: [V: 04-1] cloud_private: route the whole cloud public IPv4 space to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [11:07:13] 10SRE, 10Icinga, 10observability, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10MoritzMuehlenhoff) [11:08:41] (03PS2) 10Hnowlan: svg: attempt to build valid locales from hyphenated languages [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/923368 (https://phabricator.wikimedia.org/T337139) [11:16:24] (03CR) 10Ilias Sarantopoulos: "Hey Amir, is this the way you proposed to me to add the information from config instead of loading from files?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922512 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:21:09] (03PS4) 10Jbond: install_console: restrict options used [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) [11:21:23] (03CR) 10Muehlenhoff: proffile::firewall: create new firewall profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:34:22] (03CR) 10Hashar: [C: 03+2] wm-patch-demo: link to other patches [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922882 (https://phabricator.wikimedia.org/T332474) (owner: 10Hashar) [11:34:26] (03CR) 10Jbond: "Thanks for the input see inline" [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond) [11:34:30] (03CR) 10Hashar: [C: 03+2] wm-patch-demo: use WARNING to prevent chipset collapsing [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/923418 (https://phabricator.wikimedia.org/T332474) (owner: 10Hashar) [11:34:54] (03Merged) 10jenkins-bot: wm-patch-demo: link to other patches [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/922882 (https://phabricator.wikimedia.org/T332474) (owner: 10Hashar) [11:34:58] (03Merged) 10jenkins-bot: wm-patch-demo: use WARNING to prevent chipset collapsing [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/923418 (https://phabricator.wikimedia.org/T332474) (owner: 10Hashar) [11:35:46] !log hashar@deploy1002 Started deploy [gerrit/gerrit@c490ae6]: wm-patch-demo: link to other patches, use WARNING to prevent chipset collapsing | T332474 [11:35:51] T332474: [wm-checks-api] Create a new gerrit bot for Patch Demo - https://phabricator.wikimedia.org/T332474 [11:35:54] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@c490ae6]: wm-patch-demo: link to other patches, use WARNING to prevent chipset collapsing | T332474 (duration: 00m 08s) [11:39:14] (03PS3) 10Marostegui: wiki-replicas.sql: Create role [puppet] - 10https://gerrit.wikimedia.org/r/923558 (https://phabricator.wikimedia.org/T337446) [11:40:33] (03CR) 10Marostegui: [C: 03+2] wiki-replicas.sql: Create role [puppet] - 10https://gerrit.wikimedia.org/r/923558 (https://phabricator.wikimedia.org/T337446) (owner: 10Marostegui) [11:46:39] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC run https://puppet-compiler.wmflabs.org/output/923531/41368/ shows this working as expected (i.e. not all hosts will run cadvisor). Nu" [puppet] - 10https://gerrit.wikimedia.org/r/923531 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [11:50:56] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bullseye [11:51:03] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye [11:57:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:57:20] (03PS5) 10Arturo Borrero Gonzalez: cloud_private: route the whole cloud public IPv4 space to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) [11:58:44] (03CR) 10Arturo Borrero Gonzalez: cloud_private: route the whole cloud public IPv4 space to cloudsw (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [12:00:33] (03PS1) 10Jelto: gitlab: make sure rsync jobs run after backup [puppet] - 10https://gerrit.wikimedia.org/r/923586 [12:02:06] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [12:02:39] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41369/console" [puppet] - 10https://gerrit.wikimedia.org/r/923586 (owner: 10Jelto) [12:06:06] (03PS5) 10Clément Goubert: mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) [12:06:53] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10MoritzMuehlenhoff) I'd say let's just remove legacy_compat, nothing should rely on it anymore. [12:07:01] 10SRE, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10MoritzMuehlenhoff) [12:10:54] 10SRE, 10Infrastructure-Foundations: Redefine privileges and access for perf-roots group - https://phabricator.wikimedia.org/T207666 (10MoritzMuehlenhoff) 05Open→03Invalid Indeed, there's nothing really to fix here (or something changed between 2018 and now): perf-roots grants a few people root access on a... [12:11:47] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 76 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:12:38] (03PS1) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on some top wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923588 (https://phabricator.wikimedia.org/T329366) [12:13:27] (03PS2) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on some top wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923588 (https://phabricator.wikimedia.org/T329366) [12:13:32] 10SRE, 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, and 2 others: Create a spicerack cookbook to empty a ganeti node from VMs - https://phabricator.wikimedia.org/T203964 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [12:15:51] (03PS1) 10Hashar: wm-patch-demo: do not return runs when there are no wikis [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/923589 (https://phabricator.wikimedia.org/T332474) [12:16:13] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/923324/41372/" [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [12:17:17] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 4 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:20:05] (03CR) 10Hashar: [C: 03+2] wm-patch-demo: do not return runs when there are no wikis [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/923589 (https://phabricator.wikimedia.org/T332474) (owner: 10Hashar) [12:20:33] (03CR) 10EoghanGaffney: [C: 03+1] gitlab: make sure rsync jobs run after backup [puppet] - 10https://gerrit.wikimedia.org/r/923586 (owner: 10Jelto) [12:20:45] (03Merged) 10jenkins-bot: wm-patch-demo: do not return runs when there are no wikis [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/923589 (https://phabricator.wikimedia.org/T332474) (owner: 10Hashar) [12:21:09] !log hashar@deploy1002 Started deploy [gerrit/gerrit@0932557]: wm-patch-demo: do not return runs when there are no wikis | T332474 [12:21:14] T332474: [wm-checks-api] Create a new gerrit bot for Patch Demo - https://phabricator.wikimedia.org/T332474 [12:21:17] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@0932557]: wm-patch-demo: do not return runs when there are no wikis | T332474 (duration: 00m 08s) [12:21:53] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: make sure rsync jobs run after backup [puppet] - 10https://gerrit.wikimedia.org/r/923586 (owner: 10Jelto) [12:22:30] (03PS6) 10Clément Goubert: mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) [12:26:05] (03PS4) 10Clément Goubert: testwikidatawiki: Fix missing mobile redir to k8s [puppet] - 10https://gerrit.wikimedia.org/r/923384 (https://phabricator.wikimedia.org/T337490) [12:26:07] (03PS4) 10Clément Goubert: mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923385 (https://phabricator.wikimedia.org/T337490) [12:26:09] (03PS7) 10Clément Goubert: mw-on-k8s: Redirect closed wikis to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) [12:31:13] (03PS3) 10David Caro: gitlab.runners: allow cloudvps public proxied serivces [puppet] - 10https://gerrit.wikimedia.org/r/923547 (https://phabricator.wikimedia.org/T336130) [12:36:39] [12:39:55] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [12:40:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10jbond) @Hghani i had forgot to add you to the ldap group, it should be working now. [please reopen if not [12:41:59] !log bblack@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add rest of eqiad+codfw pybal IPs - bblack@cumin1001" [12:42:45] (03CR) 10Ladsgroup: ORES: add model versions configuration (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922512 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [12:43:01] !log bblack@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add rest of eqiad+codfw pybal IPs - bblack@cumin1001" [12:43:01] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:47:14] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1023.eqiad.wmnet with OS bullseye [12:47:20] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors: - db... [12:47:39] (03CR) 10Cathal Mooney: "LGTM overall, one small thing needs to be changed then we can merge. Also made another comment but I think it's not relevant." [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [12:48:49] (03CR) 10Jelto: [C: 03+1] "looks reasonable to allow Shared GitLab Runners on all wmcloud proxy services." [puppet] - 10https://gerrit.wikimedia.org/r/923547 (https://phabricator.wikimedia.org/T336130) (owner: 10David Caro) [12:49:14] (03PS1) 10BBlack: Add pybal-low-traffic.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/923598 (https://phabricator.wikimedia.org/T334703) [12:49:18] (03PS12) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [12:50:25] (03CR) 10BBlack: [C: 03+2] Add pybal-low-traffic.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/923598 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [12:51:42] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [12:51:47] (03PS6) 10Arturo Borrero Gonzalez: cloud_private: route the whole cloud public IPv4 space to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) [12:53:15] (03CR) 10Arturo Borrero Gonzalez: cloud_private: route the whole cloud public IPv4 space to cloudsw (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [12:53:46] (03PS13) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [12:56:21] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [12:59:01] (03PS5) 10Ilias Sarantopoulos: ORES: add model versions configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922512 (https://phabricator.wikimedia.org/T319170) [13:00:45] (03PS4) 10David Caro: gitlab.runners: allow cloudvps public proxied serivces [puppet] - 10https://gerrit.wikimedia.org/r/923547 (https://phabricator.wikimedia.org/T336130) [13:01:26] (03CR) 10David Caro: [C: 03+2] "just renamed the hiera entry to reflect the meaning too, minor change" [puppet] - 10https://gerrit.wikimedia.org/r/923547 (https://phabricator.wikimedia.org/T336130) (owner: 10David Caro) [13:06:27] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [13:12:01] !log bblack@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add the new pybal IPs at edge-only sites - bblack@cumin1001" [13:13:03] !log bblack@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add the new pybal IPs at edge-only sites - bblack@cumin1001" [13:13:03] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:13:07] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10hashar) [13:17:12] (03CR) 10Mvolz: rest-gateway: add citoid support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [13:19:28] (03CR) 10Ilias Sarantopoulos: "Done! So is this deployed along with the MediaWiki deployment schedule?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922512 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [13:20:55] (03CR) 10Herron: [C: 03+1] profile: add ensure for prometheus::cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/923530 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [13:23:57] (03CR) 10JMeybohm: [C: 04-1] "Alex and I have come to an agreement that the operator (not the apps) should be/stay deployed in staging-codfw as well. Please remove the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [13:24:30] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] profile: add ensure for prometheus::cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/923530 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [13:24:53] RECOVERY - MariaDB Replica Lag: s1 on clouddb1021 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:25:26] (03CR) 10Herron: [C: 03+1] profile: start cadvisor rollout in eqiad/codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923531 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [13:25:40] (03Abandoned) 10Ottomata: Undeploy flink-operator and uncreate service namespace in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922138 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [13:26:03] RECOVERY - MariaDB Replica IO: s1 on clouddb1021 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:27:17] (03PS14) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [13:29:48] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [13:34:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:42] (03PS1) 10Jbond: Ganeti: Add small script to display free resources in gnt groups [puppet] - 10https://gerrit.wikimedia.org/r/923608 [13:35:01] (03CR) 10Ladsgroup: ORES: add model versions configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922512 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [13:35:13] (03CR) 10JMeybohm: [C: 03+2] Stop validating against k8s 1.16, add validation against 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922829 (owner: 10JMeybohm) [13:35:18] (03PS5) 10Ottomata: flink-operator - deploy in wikikube eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) [13:35:21] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Update apiVersion to be compatible with k8s 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922828 (owner: 10JMeybohm) [13:36:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41373/console" [puppet] - 10https://gerrit.wikimedia.org/r/923608 (owner: 10Jbond) [13:36:11] (03CR) 10CI reject: [V: 04-1] Update apiVersion to be compatible with k8s 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922828 (owner: 10JMeybohm) [13:36:13] (03CR) 10CI reject: [V: 04-1] Stop validating against k8s 1.16, add validation against 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922829 (owner: 10JMeybohm) [13:37:40] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Stop validating against k8s 1.16, add validation against 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922829 (owner: 10JMeybohm) [13:39:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:41:00] (03PS1) 10Jelto: Revert "Revert "miscweb: set ipv4 and ipv6 for 15 and annual blackbox check"" [puppet] - 10https://gerrit.wikimedia.org/r/923524 [13:41:32] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM - https://phabricator.wikimedia.org/T337555 (10jbond) [13:41:48] (03CR) 10Ladsgroup: [C: 04-1] "I'm planning to deploy a change that'd reduce 10-12% of pressure from jobrunners (I24a5431149be8), let's not step on each other toes to be" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923588 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [13:41:56] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM - https://phabricator.wikimedia.org/T337556 (10jbond) [13:42:21] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM - https://phabricator.wikimedia.org/T337556 (10jbond) 05Open→03In progress p:05Triage→03Medium [13:42:31] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM - https://phabricator.wikimedia.org/T337555 (10jbond) 05Open→03In progress [13:43:33] (03PS2) 10Jelto: Revert "Revert "miscweb: set ipv4 and ipv6 for 15 and annual blackbox check"" [puppet] - 10https://gerrit.wikimedia.org/r/923524 (https://phabricator.wikimedia.org/T300171) [13:45:48] !log jbond@cumin1001 START - Cookbook sre.ganeti.makevm for new host puppetdb1003.eqiad.wmnet [13:45:50] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [13:46:12] (03CR) 10Jelto: [C: 03+2] Revert "Revert "miscweb: set ipv4 and ipv6 for 15 and annual blackbox check"" [puppet] - 10https://gerrit.wikimedia.org/r/923524 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [13:46:45] !log jbond@cumin2002 START - Cookbook sre.ganeti.makevm for new host puppetdb2003.codfw.wmnet [13:46:46] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [13:48:55] (03PS1) 10Elukey: changeprop: add hack to allow quotes in lift wing's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/923611 [13:49:54] (03PS2) 10Elukey: changeprop: add hack to allow quotes in lift wing's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/923611 [13:50:55] (03CR) 10Muehlenhoff: Ganeti: Add small script to display free resources in gnt groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923608 (owner: 10Jbond) [13:51:25] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [13:52:36] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:53:31] (03PS3) 10Elukey: changeprop: add hack to allow quotes in lift wing's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/923611 [13:55:28] (03CR) 10Herron: [C: 03+2] mwlog: remove redis instance [puppet] - 10https://gerrit.wikimedia.org/r/923348 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [13:55:31] !log jbond@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:55:37] (03PS4) 10Elukey: changeprop: add hack to allow quotes in lift wing's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/923611 [13:55:42] !log jbond@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host puppetdb2003.codfw.wmnet [13:55:44] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T337247 (10Jhancock.wm) @jcrespo thank you for the insight! @BBlack could you assist me with this? when would be a good time for this? I know we're about to go into a holiday weekend. but the server itself is not impacted, just the idrac. [13:56:07] (03CR) 10Elukey: "Not really proud of this change but I can't come up with anything easier, lemme know your thoughts!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/923611 (owner: 10Elukey) [13:56:15] (03PS6) 10Ottomata: flink-operator - deploy in wikikube eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) [13:56:23] !log jbond@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:56:30] !log jbond@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host puppetdb1003.eqiad.wmnet [13:56:38] !log jbond@cumin2002 START - Cookbook sre.ganeti.makevm for new host puppetdb2003.codfw.wmnet [13:56:40] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [13:56:56] (03PS7) 10Ottomata: flink-operator - deploy in wikikube eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) [13:57:12] (03CR) 10Ottomata: flink-operator - deploy in wikikube eqiad and codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/922874 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [13:58:07] !log jbond@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:58:10] !log jbond@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host puppetdb2003.codfw.wmnet [13:58:19] !log jbond@cumin2002 START - Cookbook sre.ganeti.makevm for new host puppetboard2003.codfw.wmnet [13:58:20] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [14:01:24] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM puppetboard2003.codfw.wmnet - jbond@cumin2002" [14:02:30] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM puppetboard2003.codfw.wmnet - jbond@cumin2002" [14:02:30] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:02:30] !log jbond@cumin2002 START - Cookbook sre.dns.wipe-cache puppetboard2003.codfw.wmnet on all recursors [14:02:33] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard2003.codfw.wmnet on all recursors [14:02:54] !log jbond@cumin1001 START - Cookbook sre.ganeti.makevm for new host puppetboard1003.eqiad.wmnet [14:02:56] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [14:02:58] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM puppetboard2003.codfw.wmnet - jbond@cumin2002" [14:03:35] (03PS3) 10Ottomata: mw-page-content-change-enrich - deploy in eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922839 (https://phabricator.wikimedia.org/T330507) [14:03:48] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM puppetboard2003.codfw.wmnet - jbond@cumin2002" [14:03:48] !log jbond@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host puppetboard2003.codfw.wmnet [14:05:23] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM puppetboard1003.eqiad.wmnet - jbond@cumin1001" [14:06:28] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM puppetboard1003.eqiad.wmnet - jbond@cumin1001" [14:06:28] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:06:28] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetboard1003.eqiad.wmnet on all recursors [14:06:31] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard1003.eqiad.wmnet on all recursors [14:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:58] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM puppetboard1003.eqiad.wmnet - jbond@cumin1001" [14:08:01] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM puppetboard1003.eqiad.wmnet - jbond@cumin1001" [14:08:01] !log jbond@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host puppetboard1003.eqiad.wmnet [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:06] (ProbeDown) firing: (4) Service miscweb1003:443 has failed probes (http_15_wikipedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:22] (03PS1) 10Jelto: Revert "Revert "Revert "miscweb: set ipv4 and ipv6 for 15 and annual blackbox check""" [puppet] - 10https://gerrit.wikimedia.org/r/923627 [14:20:51] 10SRE, 10Security-Team, 10Security: Add github.com/wikimedia as an SCM for Semgrep Cloud - https://phabricator.wikimedia.org/T337561 (10sbassett) [14:21:00] (03PS2) 10Jbond: Ganeti: Add small script to display free resources in gnt groups [puppet] - 10https://gerrit.wikimedia.org/r/923608 [14:21:28] (03CR) 10Jbond: Ganeti: Add small script to display free resources in gnt groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923608 (owner: 10Jbond) [14:22:01] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM - https://phabricator.wikimedia.org/T337555 (10jbond) 05In progress→03Resolved a:03jbond created [14:22:10] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM - https://phabricator.wikimedia.org/T337556 (10jbond) 05In progress→03Resolved a:03jbond cerated [14:22:54] (03CR) 10Jelto: [C: 03+2] "blackbox checks probably also needs another port (30443). I'll revert for now and open a clean change Monday to try out the new settings." [puppet] - 10https://gerrit.wikimedia.org/r/923627 (owner: 10Jelto) [14:24:36] 10SRE, 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10Jhancock.wm) @Marostegui I am looking for a suitable cpu replacement in our decommissioned servers. In the meantime Log Event 265 recommends a BIOS update. The bios is very out of date on this one and I am running that t... [14:24:54] 10SRE, 10Release-Engineering-Team, 10Security-Team, 10serviceops-collab, 10Security: Add github.com/wikimedia as an SCM for Semgrep Cloud - https://phabricator.wikimedia.org/T337561 (10jbond) @sbassett im not sure who manages the github account. for noa ill tag #serviceops-collab who manage gitlab and #... [14:25:00] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,dc=eqiad,name="parse.*" [14:25:08] 10SRE, 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10Marostegui) Thanks! [14:25:09] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,dc=eqiad,name="parse.*" [14:25:24] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,dc=eqiad,name=parse.* [14:26:36] 10SRE, 10Release-Engineering-Team, 10Security-Team, 10serviceops-collab, 10Security: Add github.com/wikimedia as an SCM for Semgrep Cloud - https://phabricator.wikimedia.org/T337561 (10sbassett) Hey @jbond - I had been talking to @Clement_Goubert on Slack about this, so that's why I tagged them/SRE. In... [14:26:44] !log oblivian@puppetmaster1001 conftool action : set/weight=10; selector: cluster=videoscaler,dc=eqiad,name=parse.* [14:30:24] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10jbond) [14:33:36] (03PS1) 10Jbond: httpd: set legacy_compat to absent [puppet] - 10https://gerrit.wikimedia.org/r/923615 (https://phabricator.wikimedia.org/T258686) [14:33:38] (03PS1) 10Jbond: httpd: remove legacy_compat option [puppet] - 10https://gerrit.wikimedia.org/r/923616 (https://phabricator.wikimedia.org/T258686) [14:34:34] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10jbond) > The httpd class still has the legacy_compat option but nobody uses it anymore: This is not exactly true. The default valu... [14:35:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:16] (03CR) 10AikoChou: "I saw an usage of message_key_fields here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/923611 (owner: 10Elukey) [14:36:28] (03CR) 10CI reject: [V: 04-1] httpd: remove legacy_compat option [puppet] - 10https://gerrit.wikimedia.org/r/923616 (https://phabricator.wikimedia.org/T258686) (owner: 10Jbond) [14:36:57] (03CR) 10CI reject: [V: 04-1] httpd: set legacy_compat to absent [puppet] - 10https://gerrit.wikimedia.org/r/923615 (https://phabricator.wikimedia.org/T258686) (owner: 10Jbond) [14:38:19] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:25] (03CR) 10Elukey: changeprop: add hack to allow quotes in lift wing's config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/923611 (owner: 10Elukey) [14:39:54] (03PS2) 10Jbond: httpd: set legacy_compat to absent [puppet] - 10https://gerrit.wikimedia.org/r/923615 (https://phabricator.wikimedia.org/T258686) [14:39:56] (03PS2) 10Jbond: httpd: remove legacy_compat option [puppet] - 10https://gerrit.wikimedia.org/r/923616 (https://phabricator.wikimedia.org/T258686) [14:40:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:41:53] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/923551 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [14:43:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/923552 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [14:44:50] (03CR) 10Jbond: [C: 03+1] cloud_private: route the whole cloud public IPv4 space to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) (owner: 10Arturo Borrero Gonzalez) [14:45:57] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:00] (03PS1) 10Lucas Werkmeister (WMDE): Remove wmgWikibaseTmpWbsubscribersSensibleOutput feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923619 (https://phabricator.wikimedia.org/T335783) [14:51:28] (03PS3) 10Jbond: proffile::firewall: create new firewall profile [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) [14:51:30] (03PS13) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [14:51:32] (03PS3) 10Jbond: base::firewall: remove the old firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/922816 (https://phabricator.wikimedia.org/T279683) [14:51:34] (03PS13) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [14:51:37] (03PS15) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [14:51:39] (03PS1) 10Jbond: firewall: drop block_abuse_nets parameter [puppet] - 10https://gerrit.wikimedia.org/r/923620 (https://phabricator.wikimedia.org/T279683) [14:51:45] (03CR) 10AikoChou: [C: 03+1] changeprop: add hack to allow quotes in lift wing's config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/923611 (owner: 10Elukey) [14:53:11] (03CR) 10Jbond: proffile::firewall: create new firewall profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [14:53:19] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Hghani) yes it is working, thanks [14:53:37] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:06] (03CR) 10Klausman: [C: 03+1] changeprop: add hack to allow quotes in lift wing's config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/923611 (owner: 10Elukey) [14:55:53] (03PS1) 10Lucas Werkmeister (WMDE): Remove wmgWikibaseTmpEnableLabelsInApiSummaries feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923623 (https://phabricator.wikimedia.org/T335107) [14:56:57] (03CR) 10Klausman: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [14:57:18] (03CR) 10CI reject: [V: 04-1] firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [14:57:24] (03PS1) 10AikoChou: Declare mediawiki.page_outlink_topic_prediction_change.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923571 (https://phabricator.wikimedia.org/T328899) [14:59:13] (03PS9) 10Giuseppe Lavagetto: ml-services: deploy bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [14:59:15] (03PS1) 10Giuseppe Lavagetto: HelmFileAsset: correctly check for existence of data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/923625 [15:00:36] (03CR) 10Dzahn: [C: 03+1] "I don't see any apache 2.2 on https://debmonitor.wikimedia.org/packages/apache2 and didn't see anything in the repo, per my previous comme" [puppet] - 10https://gerrit.wikimedia.org/r/923615 (https://phabricator.wikimedia.org/T258686) (owner: 10Jbond) [15:01:23] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:38] (03CR) 10Elukey: [C: 03+1] HelmFileAsset: correctly check for existence of data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/923625 (owner: 10Giuseppe Lavagetto) [15:04:06] (ProbeDown) resolved: (4) Service miscweb1003:443 has failed probes (http_15_wikipedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:04:52] 10SRE, 10Release-Engineering-Team, 10Security-Team, 10serviceops-collab, 10Security: Add github.com/wikimedia as an SCM for Semgrep Cloud - https://phabricator.wikimedia.org/T337561 (10Dzahn) We don't really have a relation to the github Wikimedia organization. [15:04:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] HelmFileAsset: correctly check for existence of data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/923625 (owner: 10Giuseppe Lavagetto) [15:07:37] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:22] !log nskaggs@cumin1001 START - Cookbook sre.wikireplicas.update-views [15:09:15] (03CR) 10Ilias Sarantopoulos: "LGTM, seems like a good workaround!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/923611 (owner: 10Elukey) [15:09:25] (03CR) 10Ilias Sarantopoulos: [C: 03+1] changeprop: add hack to allow quotes in lift wing's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/923611 (owner: 10Elukey) [15:09:30] (03CR) 10Ottomata: [C: 03+1] "Thanks aiko! Are you ready for this to be deployed? You'll have to change the stream name you use in your code." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923571 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [15:10:14] (03CR) 10Hnowlan: [C: 03+1] "I don't like the why but I'll allow the how." [deployment-charts] - 10https://gerrit.wikimedia.org/r/923611 (owner: 10Elukey) [15:14:24] (03Merged) 10jenkins-bot: HelmFileAsset: correctly check for existence of data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/923625 (owner: 10Giuseppe Lavagetto) [15:15:01] (03CR) 10Giuseppe Lavagetto: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [15:15:23] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:06] (03CR) 10Dzahn: [C: 04-2] "needs to happen _after_ decom cookbook which would destroy the file system. but we still keep that around for a little while longer. so no" [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [15:16:27] (03PS1) 10Daniel Kinzler: Switch VisualEditor to not use RESTbase on small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923650 (https://phabricator.wikimedia.org/T320529) [15:17:13] (03CR) 10CI reject: [V: 04-1] Switch VisualEditor to not use RESTbase on small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923650 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [15:17:47] (03PS2) 10Daniel Kinzler: Switch VisualEditor to not use RESTbase on small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923650 (https://phabricator.wikimedia.org/T320529) [15:20:35] (03CR) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on some top wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923588 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [15:21:38] (03PS3) 10Daniel Kinzler: Switch VisualEditor to not use RESTbase on small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923650 (https://phabricator.wikimedia.org/T320529) [15:23:15] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:31] (03CR) 10Bartosz Dziewoński: [C: 03+1] Switch VisualEditor to not use RESTbase on small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923650 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [15:27:37] (03CR) 10Elukey: [C: 03+2] ml-services: deploy bloom-3b model [deployment-charts] - 10https://gerrit.wikimedia.org/r/922583 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [15:28:42] (03PS1) 10Dzahn: microsites: remove annualreport, migrated to k8s [puppet] - 10https://gerrit.wikimedia.org/r/923652 (https://phabricator.wikimedia.org/T300171) [15:30:35] (03CR) 10Elukey: [C: 03+2] changeprop: add hack to allow quotes in lift wing's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/923611 (owner: 10Elukey) [15:30:49] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:31:03] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:46] !log nskaggs@cumin1001 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [15:32:03] 10SRE-OnFire, 10Traffic, 10conftool, 10serviceops, 10Sustainability (Incident Followup): Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10JMeybohm) a:03BBlack As you seem to be working on this I'm bluntly assign... [15:32:11] (03CR) 10Dzahn: "hieradata/common/profile/trafficserver/backend.yaml shows that it points to miscweb on 30443 - so kubernetes." [puppet] - 10https://gerrit.wikimedia.org/r/923652 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [15:32:33] (03PS1) 10Jbond: puppetboard: add add new role and config for new puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/923653 [15:32:56] (03CR) 10CI reject: [V: 04-1] puppetboard: add add new role and config for new puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/923653 (owner: 10Jbond) [15:34:48] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [15:34:57] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [15:35:33] (03PS2) 10Jbond: puppetboard: add add new role and config for new puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/923653 [15:36:43] (03PS1) 10Dzahn: microsites: remove bienvenida.wikimedia.org, migrated to k8s [puppet] - 10https://gerrit.wikimedia.org/r/923655 (https://phabricator.wikimedia.org/T300171) [15:36:43] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:37:08] (03CR) 10Dzahn: [C: 04-2] "I know, not ready yet, just preparing some changes for later on a Friday." [puppet] - 10https://gerrit.wikimedia.org/r/923655 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [15:37:39] (03CR) 10Jbond: [C: 03+2] puppetboard: add add new role and config for new puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/923653 (owner: 10Jbond) [15:38:51] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:57] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:40:07] (03PS1) 10Dzahn: httpbb: move tests for bienvenida.wikimedia.org to miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/923656 (https://phabricator.wikimedia.org/T300171) [15:40:37] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host puppetboard1003.eqiad.wmnet with OS bookworm [15:40:43] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:41:59] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host puppetboard2003.codfw.wmnet with OS bookworm [15:43:00] (03PS18) 10Eevans: cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [15:43:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet: split BGP code into separate profile [puppet] - 10https://gerrit.wikimedia.org/r/923551 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [15:43:24] (03CR) 10CI reject: [V: 04-1] cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [15:43:38] (03CR) 10Dzahn: [C: 04-1] "eh yeah, we didn't merge this which could be called a fail but we didn't need it because gerrit service was masked manually" [puppet] - 10https://gerrit.wikimedia.org/r/920773 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [15:45:39] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [15:46:35] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:03] 10SRE, 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10Jhancock.wm) @Marostegui the BIOS update is complete. I found a suitable CPU replacement. Do we want to give that a try now or see if the BIOS update did the trick. LMK if you wanna swap and if it's safe to do so at t... [15:47:36] 10SRE, 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10Marostegui) Let's go for the CPU swap too. You can do it anytime. The host isn't in use [15:49:11] 10SRE, 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10Jhancock.wm) I forgot to ask. was it CPU1 or CPU2 that was having the issue? [15:50:12] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [15:51:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet::bgp: set up route lookup rule only for /32 VIPs [puppet] - 10https://gerrit.wikimedia.org/r/923552 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [15:51:28] (03PS3) 10Arturo Borrero Gonzalez: cloud_private_subnet::bgp: set up route lookup rule only for /32 VIPs [puppet] - 10https://gerrit.wikimedia.org/r/923552 (https://phabricator.wikimedia.org/T324992) [15:52:17] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2005-dev.private.codfw.wikimedia.cloud - aborrero@cumin2002" [15:52:29] 10SRE, 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10Marostegui) It doesn't say on the error: ` 2023-05-25 05:16:13 SYS1003 System CPU Resetting. ` [15:52:51] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:46] (03PS4) 10Arturo Borrero Gonzalez: cloud_private_subnet::bgp: set up route lookup rule only for /32 VIPs [puppet] - 10https://gerrit.wikimedia.org/r/923552 (https://phabricator.wikimedia.org/T324992) [15:53:57] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:53:57] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:54:33] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2005-dev.private.codfw.wikimedia.cloud - aborrero@cumin2002" [15:54:33] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:55:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2005-dev: move to the new network setup [puppet] - 10https://gerrit.wikimedia.org/r/923301 (https://phabricator.wikimedia.org/T336564) (owner: 10Arturo Borrero Gonzalez) [15:55:27] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:55:27] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:56:41] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 95 probes of 709 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:57:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:57:35] (03Abandoned) 10Dzahn: gerrit2002: mask gerrit service [puppet] - 10https://gerrit.wikimedia.org/r/920773 (https://phabricator.wikimedia.org/T334521) (owner: 10Dzahn) [16:00:33] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet::bgp: set up route lookup rule only for /32 VIPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923552 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [16:02:11] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 43 probes of 709 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:03:19] (03PS7) 10Arturo Borrero Gonzalez: cloud_private: route the whole cloud public IPv4 space to cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/923324 (https://phabricator.wikimedia.org/T336963) [16:04:48] (03CR) 10Jbond: Create cookbook to upgrade Apache Traffic Server (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [16:05:44] (03PS3) 10Hnowlan: svg: attempt to build valid locales from hyphenated languages [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/923368 (https://phabricator.wikimedia.org/T337139) [16:07:36] (03CR) 10Hnowlan: [C: 03+2] engine: Remove custom XCF handler in favor of ImageMagick [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/619864 (https://phabricator.wikimedia.org/T260285) (owner: 10AntiCompositeNumber) [16:08:11] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:19] 10SRE, 10PyBal, 10Release-Engineering-Team, 10Scap, and 4 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) An initial draft of a postmortem for this issue has been posted at: https://wikitec... [16:11:15] (03PS7) 10Dzahn: planet: the HTTPS_PROXY itself is accessed via http [puppet] - 10https://gerrit.wikimedia.org/r/902513 [16:12:01] (03CR) 10Jbond: Create cookbook to upgrade Apache Traffic Server (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [16:12:15] 10SRE, 10SRE-Access-Requests: Requesting access to global root for nskaggs - https://phabricator.wikimedia.org/T337571 (10nskaggs) [16:15:16] 10SRE, 10SRE-Access-Requests: Requesting access to global root for nskaggs - https://phabricator.wikimedia.org/T337571 (10kchapman) I approve this request as @nskaggs' manager [16:15:57] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:37] (03Merged) 10jenkins-bot: engine: Remove custom XCF handler in favor of ImageMagick [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/619864 (https://phabricator.wikimedia.org/T260285) (owner: 10AntiCompositeNumber) [16:23:10] (03PS1) 10Jbond: sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 [16:23:39] (03PS19) 10Eevans: cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [16:23:47] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:08] (03CR) 10CI reject: [V: 04-1] cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [16:24:44] 10SRE, 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10Jhancock.wm) I replace both since we're not sure. server has booted without issues. all components are green in the idrac dashboard. it's all yours now! I do see some slight discoloration on the old CPU2. not sure if it... [16:25:01] (03CR) 10Jbond: sre.cdn: move common functions to base class (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [16:25:51] (03CR) 10CI reject: [V: 04-1] sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [16:28:39] 10SRE, 10SRE-Access-Requests: Requesting access to global root for nskaggs - https://phabricator.wikimedia.org/T337571 (10Dzahn) Technically this is a request for membership in group "ops". So that is treated like any other "add user to existing group" process per https://wikitech.wikimedia.org/wiki/SRE/Produc... [16:30:03] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:35] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:23] (03PS1) 10Nskaggs: Add nskaggs to global root [puppet] - 10https://gerrit.wikimedia.org/r/923665 (https://phabricator.wikimedia.org/T337571) [16:34:54] (03PS2) 10Hnowlan: thumbor: move xcf support to imagemagick [deployment-charts] - 10https://gerrit.wikimedia.org/r/921053 (https://phabricator.wikimedia.org/T260285) [16:36:58] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host puppetboard1003.eqiad.wmnet with OS bookworm [16:37:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for nskaggs - https://phabricator.wikimedia.org/T337571 (10nskaggs) [16:37:41] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:50] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host puppetboard2003.codfw.wmnet with OS bookworm [16:40:19] (03PS1) 10Jbond: admin: add nskaggs to ops [puppet] - 10https://gerrit.wikimedia.org/r/923667 (https://phabricator.wikimedia.org/T337571) [16:40:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:40:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for nskaggs - https://phabricator.wikimedia.org/T337571 (10nskaggs) [16:42:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for nskaggs - https://phabricator.wikimedia.org/T337571 (10jbond) [16:44:11] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for nskaggs - https://phabricator.wikimedia.org/T337571 (10jbond) p:05Triage→03Medium [16:45:19] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:45] (03Abandoned) 10Jbond: admin: add nskaggs to ops [puppet] - 10https://gerrit.wikimedia.org/r/923667 (https://phabricator.wikimedia.org/T337571) (owner: 10Jbond) [16:52:57] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:56:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for nskaggs - https://phabricator.wikimedia.org/T337571 (10lmata) >>! In T337571#8883627, @Dzahn wrote: > Technically this is a request for membership in group "ops". So that is treated like any other "add user to existing group... [16:56:28] (03CR) 10Jbond: [C: 03+2] Add nskaggs to global root [puppet] - 10https://gerrit.wikimedia.org/r/923665 (https://phabricator.wikimedia.org/T337571) (owner: 10Nskaggs) [17:00:12] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for nskaggs - https://phabricator.wikimedia.org/T337571 (10jbond) [17:00:37] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:14] (03PS1) 10Jbond: sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 [17:06:46] (03PS2) 10Jbond: sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 [17:08:17] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:25] (03CR) 10CI reject: [V: 04-1] sre: update base class with an upgrade action [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [17:10:03] (03CR) 10Jbond: sre: update base class with an upgrade action (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [17:16:03] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:06] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T337451 (10wiki_willy) a:03Jclark-ctr [17:23:06] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923677 (https://phabricator.wikimedia.org/T330216) [17:23:08] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923677 (https://phabricator.wikimedia.org/T330216) (owner: 10TrainBranchBot) [17:23:51] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:24:26] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923677 (https://phabricator.wikimedia.org/T330216) (owner: 10TrainBranchBot) [17:24:33] (03PS1) 10Jbond: wmcs: add wmcs-roots to all wmcs roles [puppet] - 10https://gerrit.wikimedia.org/r/923681 [17:28:05] (03CR) 10BCornwall: sre: update base class with an upgrade action (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [17:29:41] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:07] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:50] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.10 refs T330216 [17:31:55] T330216: 1.41.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T330216 [17:33:13] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:36:48] (03PS2) 10Jbond: wmcs: add wmcs-roots to all wmcs roles [puppet] - 10https://gerrit.wikimedia.org/r/923681 [17:37:49] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:00] !log demon@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.10 refs T330216 (duration: 06m 10s) [17:38:05] T330216: 1.41.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T330216 [17:45:29] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:16] (03PS3) 10Jbond: wmcs: add wmcs-roots to wmcs data Persistence roles [puppet] - 10https://gerrit.wikimedia.org/r/923681 [17:46:18] (03PS1) 10Jbond: admin: add wmcs-roots to wmcs-admins [puppet] - 10https://gerrit.wikimedia.org/r/923684 [17:47:22] (03CR) 10Jbond: [C: 04-1] admin: add wmcs-roots to wmcs-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923684 (owner: 10Jbond) [17:48:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:49:51] (03CR) 10Jbond: [C: 04-1] admin: add wmcs-roots to wmcs-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923684 (owner: 10Jbond) [17:53:07] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:37] (03PS1) 10Herron: import upstream 0.6.2 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/923685 [18:00:51] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:09] herron: oh nice! [18:02:16] :) [18:02:55] (03PS1) 10Ottomata: mw-page-content-change-enrich - bump to image version 1.18.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923687 (https://phabricator.wikimedia.org/T328925) [18:06:11] (03CR) 10Jbond: "thanks for the comments see inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/923670 (owner: 10Jbond) [18:07:35] (03PS1) 10Hashar: wm-checks-api: add support for DUCT [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/923688 (https://phabricator.wikimedia.org/T331651) [18:08:29] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:11:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:11:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:14:05] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:14:17] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49993 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:16:07] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:11] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923689 (https://phabricator.wikimedia.org/T330216) [18:18:13] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923689 (https://phabricator.wikimedia.org/T330216) (owner: 10TrainBranchBot) [18:19:00] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923689 (https://phabricator.wikimedia.org/T330216) (owner: 10TrainBranchBot) [18:23:57] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:04] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.10 refs T330216 [18:26:09] T330216: 1.41.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T330216 [18:26:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:30:13] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:33:33] (03PS2) 10Hashar: wm-checks-api: add support for DUCT [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/923688 (https://phabricator.wikimedia.org/T331651) [18:37:33] (03CR) 10Hashar: "See T331651#8883870 and below for the rendering ;)" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/923688 (https://phabricator.wikimedia.org/T331651) (owner: 10Hashar) [18:38:01] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:42:06] (03CR) 10Jforrester: "Neat!" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/923688 (https://phabricator.wikimedia.org/T331651) (owner: 10Hashar) [18:45:41] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:53:31] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:23] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:35] (03CR) 10JHathaway: [C: 03+1] install_console: restrict options used [puppet] - 10https://gerrit.wikimedia.org/r/922559 (https://phabricator.wikimedia.org/T117348) (owner: 10Jbond) [19:05:18] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2005-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336564 (10Jhancock.wm) @cmooney I moved the patch to switch cloudsw1-b1-codfw, port ge-1/0/13, but I can't get the netbox script to work. the server name is not sho... [19:08:18] (03CR) 10JHathaway: puppet-merge: implement Lock out, tag out (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/922915 (https://phabricator.wikimedia.org/T248872) (owner: 10Jbond) [19:09:09] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:11:43] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - bump to image version 1.18.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923687 (https://phabricator.wikimedia.org/T328925) (owner: 10Ottomata) [19:15:19] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:22] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [19:15:25] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:19:13] (03PS1) 10Ottomata: mw-page-content-change-enrich - set proper value of error sink stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/923693 (https://phabricator.wikimedia.org/T330507) [19:20:32] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - set proper value of error sink stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/923693 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [19:21:24] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [19:21:27] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:21:43] (03CR) 10Stang: Change project logo for Wikimania to Wikimania 2023 version T337044 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921610 (owner: 10Robertsky) [19:22:55] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:24:51] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:24:54] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:30:33] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:26] (03PS1) 10Cwhite: prometheus: don't add empty targets [puppet] - 10https://gerrit.wikimedia.org/r/923576 (https://phabricator.wikimedia.org/T320620) [19:38:13] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:45:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:45:51] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:53:31] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:56:20] (03CR) 10Bking: query_service: Permit python2 on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [19:57:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:57:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:58:24] (03CR) 10BryanDavis: [C: 03+1] Update apiVersion to be compatible with k8s 1.27.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/922828 (owner: 10JMeybohm) [20:01:11] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [20:07:52] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [20:08:51] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:52] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [20:15:07] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:22] 10SRE, 10ops-codfw, 10DBA: db2110 crashed - https://phabricator.wikimedia.org/T337445 (10Marostegui) Thank you!. I'll bring Mariadb up on Monday and leave it running for a few days before repooling it, to make sure everything is stable [20:15:52] (03CR) 10TheDJ: "adding some reviewers" [puppet] - 10https://gerrit.wikimedia.org/r/876293 (owner: 10Nintendofan885) [20:22:57] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:24:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:25:22] (03PS12) 10Andrew Bogott: wmcs prometheus: include 'OPENSTACK->CLOUD' in prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/916590 (https://phabricator.wikimedia.org/T330759) [20:25:24] (03PS20) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [20:25:26] (03PS1) 10Andrew Bogott: clouds.yaml: remove keystoneadmin section [puppet] - 10https://gerrit.wikimedia.org/r/923696 (https://phabricator.wikimedia.org/T330759) [20:25:28] (03PS1) 10Andrew Bogott: Set OS_CLOUD in wmcs-openstack.sh [puppet] - 10https://gerrit.wikimedia.org/r/923697 (https://phabricator.wikimedia.org/T337577) [20:26:17] (03PS2) 10Cwhite: prometheus: don't add empty targets [puppet] - 10https://gerrit.wikimedia.org/r/923576 (https://phabricator.wikimedia.org/T320620) [20:29:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:30:41] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:38:21] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:10] (03PS1) 10Ottomata: mw-page-content-change-enrich - bump to image version 1.19.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923699 (https://phabricator.wikimedia.org/T328925) [20:39:59] (03PS3) 10Cwhite: prometheus: don't add empty targets [puppet] - 10https://gerrit.wikimedia.org/r/923576 (https://phabricator.wikimedia.org/T320620) [20:46:01] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:48] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - bump to image version 1.19.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/923699 (https://phabricator.wikimedia.org/T328925) (owner: 10Ottomata) [20:47:19] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [20:47:22] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:50:09] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:50:12] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:53:43] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:16] 10SRE, 10SRE-Access-Requests: Transfer Neil Shah-Quinn's production access to new developer account - https://phabricator.wikimedia.org/T337591 (10nshahquinn-wmf) [21:01:29] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:39] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:14:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Transfer Neil Shah-Quinn's production access to new developer account - https://phabricator.wikimedia.org/T337591 (10RhinosF1) [21:15:27] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:36] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Transfer Neil Shah-Quinn's production access to new developer account - https://phabricator.wikimedia.org/T337591 (10kzimmerman) Approved as Neil's manager [21:23:13] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:29:03] (03PS5) 10Robertsky: Change project logo for Wikimania to Wikimania 2023 version T337044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921610 [21:30:51] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:09] (03CR) 10Robertsky: Change project logo for Wikimania to Wikimania 2023 version T337044 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921610 (owner: 10Robertsky) [21:38:29] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:09] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:53:49] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:03] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:07:45] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:15:23] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:23:03] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:30:49] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:36:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Transfer Neil Shah-Quinn's production access to new developer account - https://phabricator.wikimedia.org/T337591 (10nshahquinn-wmf) > https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#Renaming_shell_users Just to be clear, my prod account... [22:38:39] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:18] (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921610 (owner: 10Robertsky) [22:46:27] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:52:43] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:29] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:04:09] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T337451 (10Jclark-ctr) 05Open→03Resolved Replaced failed cable [23:08:13] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:22] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/output/916914/41386/" [puppet] - 10https://gerrit.wikimedia.org/r/916914 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [23:16:03] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:19:12] (03CR) 10Cwhite: "PCC NOOP: https://puppet-compiler.wmflabs.org/output/923576/41387/" [puppet] - 10https://gerrit.wikimedia.org/r/923576 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [23:23:51] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:07] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:57] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Transfer Neil Shah-Quinn's production access to new developer account - https://phabricator.wikimedia.org/T337591 (10nshahquinn-wmf) [23:45:41] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:29] !log removing 2 files for legal compliance [23:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:21] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale