[00:03:13] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:04:54] RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:59] !log rsyncing /root and /mnt/gitlab-backup of gitlab1001 to /srv/gitlab-backup on gitlab1004 (/srv/gitlab-backup was automounted after creating it and has > 200G free) T274463 [00:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:08] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [00:09:13] (03PS5) 10Cwhite: opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) [00:09:40] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:50] (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [00:09:54] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:10:37] (03PS6) 10Cwhite: opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) [00:11:16] (03CR) 10CI reject: [V: 04-1] opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [00:12:09] (03CR) 10Cwhite: opensearch_dashboards: add backup script enable job (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [00:15:19] (03PS7) 10Cwhite: opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) [00:19:26] (03PS1) 10Dzahn: backup: switch fileset for gitlab from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/800357 (https://phabricator.wikimedia.org/T274463) [00:22:34] (03PS1) 10Dzahn: gitlab::dump: backup files on gitlab1004 in Bacula [puppet] - 10https://gerrit.wikimedia.org/r/800358 (https://phabricator.wikimedia.org/T274463) [00:24:05] (03CR) 10Dzahn: "not all paths in the file set exist on this host, only /srv/gitlab-backups but I would hope Bacula doesn't care and just skips what isn't " [puppet] - 10https://gerrit.wikimedia.org/r/800358 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [00:26:06] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:26:26] RECOVERY - Disk space on gitlab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1001&var-datasource=eqiad+prometheus/ops [00:26:32] !log gitlab1001 deleted backups from last 3 days after rsync to gitlab1004 - freeing disk space, starting the full-backup service once again, should finish now without running out of disk - T2744463 [00:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:54] !log gitlab1001 deleted backups from last 3 days after rsync to gitlab1004 - freeing disk space, starting the full-backup service once again, should finish now without running out of disk - T274463 [00:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:00] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [00:28:08] RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:03] (03PS1) 10Dzahn: site/gitlab: make gitlab2002 another backup dump location [puppet] - 10https://gerrit.wikimedia.org/r/800366 (https://phabricator.wikimedia.org/T274463) [00:31:39] (03CR) 10Dzahn: [C: 04-1] "not yet (gitlab1001 still tries to dump to /mnt) but soon and we need to not forget this" [puppet] - 10https://gerrit.wikimedia.org/r/800357 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [00:35:22] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35582/" [puppet] - 10https://gerrit.wikimedia.org/r/800366 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [00:40:55] (03PS1) 10Dzahn: gitlab::dump: add gitlab1004 to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/800384 (https://phabricator.wikimedia.org/T274463) [00:42:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:43:37] (03CR) 10Dzahn: [C: 03+2] gitlab::dump: add gitlab1004 to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/800384 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [00:45:55] !log rsyncing /srv/gitlab-backup from gitlab1004 to gitlab2002 | systemctl status full-backup ..in progress on gitlab1001 - T274463 [00:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:02] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [00:53:01] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:54:54] (03CR) 10Ori: "I cherry-picked this on the Beta Cluster puppet master and confirmed that logs from the function-* services made it to logstash." [puppet] - 10https://gerrit.wikimedia.org/r/800282 (https://phabricator.wikimedia.org/T309319) (owner: 10Ori) [01:02:13] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:07:09] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [01:07:18] (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:07:19] (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:08:36] πŸ‘‹ looking [01:08:37] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3753 bytes in 6.016 second response time https://wikitech.wikimedia.org/wiki/Docker [01:08:37] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [01:09:09] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:12:18] (ProbeDown) resolved: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:12:19] (ProbeDown) resolved: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:12:30] ACKNOWLEDGEMENT - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn backup in progress https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [01:12:30] ACKNOWLEDGEMENT - Disk space on gitlab1001 is CRITICAL: DISK CRITICAL - free space: /mnt/gitlab-backup 2956 MB (3% inode=99%): daniel_zahn backup in progress https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1001&var-datasource=eqiad+prometheus/ops [01:13:34] having trouble loading grafana dashboards [01:13:49] PROBLEM - PHP7 rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:13:55] kind of busy handling gitlab alerts [01:14:10] ack [01:14:11] but on that one. only got the resolved page now [01:14:49] grafana1002 is super sluggish over ssh talso [01:14:51] *also [01:15:39] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 122712 bytes in 1.748 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [01:15:41] RECOVERY - PHP7 rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 302 Found - 566 bytes in 6.714 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:15:50] ^ ok.. that for now [01:16:12] is it possible the gitlab backup caused some network saturation? [01:16:15] grafana1002 - i can get on it and load seems to recover [01:16:21] yes, it is [01:16:31] I am still copying actually [01:16:48] but I can stop it [01:17:00] I just wanted to feel better by having a second copy in the other DC [01:17:13] because.. long story, but otherwise we only had one copy [01:17:21] and backups need to be fixed [01:19:29] rzl: I have 37G of 45G I need .. hrmm [01:19:52] 41 now [01:20:21] mutante: understood -- grafana is recovering but still sluggish, [01:20:22] grafana dashboard loading [01:20:27] well there was supposed to be a link there [01:20:34] https://grafana.wikimedia.org/goto/d6GuDM9nk?orgId=1 [01:21:09] looking into docker-registry and the other stuff that hiccuped too, not sure if they shared a common row or something [01:21:19] I mean, all VMs, but maybe on the same host [01:21:35] nothing happened when I copied within the same DC. then I got the idea to also copy cross DC [01:21:40] it's rack A1 [01:21:49] if we do that again, maybe we throttle the transfer :) [01:21:54] setting that to active in netbox as well [01:22:42] the source is the same ganeti host I guess [01:23:13] and ack, re A1 -- I can see the traffic increases in librenms but I'm not network-smart enough to know if that was actually the cause of those healthcheck failures -- seems plausible though [01:23:18] absolutely, will use --bandwidth something if I do that again (I hope that's not the case that we have to) [01:23:33] I'm going to wander back off if you're all set, then :) thanks [01:24:07] sorry about the alert. thanks [01:24:20] it's a mess with the gitlab backups :( [01:24:39] and I did that to avoid Murphy's law, not trigger it, heh [01:25:25] the good part: at least we _have_ a complete full backup now. that wasn't the case. laters [01:27:07] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={LIST,PATCH,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:28:08] rsync finished [01:30:33] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:32:14] woah, and that means restore of the backup on the passive host worked as well.. that's good [01:33:27] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:33:29] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:19] PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:49:13] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:50:29] PROBLEM - Check systemd state on ms-be1039 is CRITICAL: CRITICAL - degraded: The following units failed: session-341469.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:57:25] RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:01:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298560)', diff saved to https://phabricator.wikimedia.org/P28618 and previous config saved to /var/cache/conftool/dbconfig/20220527-020111-ladsgroup.json [02:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:01:20] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [02:03:07] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:16:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P28619 and previous config saved to /var/cache/conftool/dbconfig/20220527-021616-ladsgroup.json [02:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:03] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:31:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P28620 and previous config saved to /var/cache/conftool/dbconfig/20220527-023122-ladsgroup.json [02:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298560)', diff saved to https://phabricator.wikimedia.org/P28621 and previous config saved to /var/cache/conftool/dbconfig/20220527-024627-ladsgroup.json [02:46:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [02:46:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [02:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 6 hosts with reason: Maintenance [02:46:34] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [02:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 6 hosts with reason: Maintenance [02:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:13] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:50:43] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:52:45] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:13:55] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 111 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [03:22:03] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:24:25] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:05] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.069 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:50:51] jenkins seems to think that every patch is a merge conflict [03:50:51] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/799430/14 [03:50:51] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SyntaxHighlight_GeSHi/+/793622 [03:51:54] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/examples/+/799464 literally just creating an empty patch [03:58:32] ^ reported at T309371 [03:58:33] T309371: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 [04:03:14] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:18:29] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:14] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:54:03] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:54:41] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:05:09] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:06:43] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:00] 10SRE, 10ops-eqiad: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 (10Marostegui) Sounds good @wiki_willy - let us know when we'd need to schedule some downtime for the host. Thanks! [05:35:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 5%: After ugprading mysql', diff saved to https://phabricator.wikimedia.org/P28622 and previous config saved to /var/cache/conftool/dbconfig/20220527-053510-root.json [05:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:30] (03PS1) 10Marostegui: control-mariadb-10.6-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/800596 (https://phabricator.wikimedia.org/T308915) [05:37:38] (03CR) 10CI reject: [V: 04-1] control-mariadb-10.6-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/800596 (https://phabricator.wikimedia.org/T308915) (owner: 10Marostegui) [05:39:24] uh? [05:39:56] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/800596 (https://phabricator.wikimedia.org/T308915) (owner: 10Marostegui) [05:40:47] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/800596 (https://phabricator.wikimedia.org/T308915) (owner: 10Marostegui) [05:41:21] (03Merged) 10jenkins-bot: control-mariadb-10.6-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/800596 (https://phabricator.wikimedia.org/T308915) (owner: 10Marostegui) [05:48:40] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) @ArielGlenn Can you please reassign the ticket? I have no clue who - concretly - is WMCS? [05:50:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: After ugprading mysql', diff saved to https://phabricator.wikimedia.org/P28623 and previous config saved to /var/cache/conftool/dbconfig/20220527-055014-root.json [05:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: After ugprading mysql', diff saved to https://phabricator.wikimedia.org/P28624 and previous config saved to /var/cache/conftool/dbconfig/20220527-060518-root.json [06:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:49] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:11:21] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:12:35] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:17:09] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:20:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: After ugprading mysql', diff saved to https://phabricator.wikimedia.org/P28625 and previous config saved to /var/cache/conftool/dbconfig/20220527-062022-root.json [06:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:46] 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10Majavah) [06:20:47] (03Abandoned) 10Elukey: Add Aiko and Kevin to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/791036 (https://phabricator.wikimedia.org/T307927) (owner: 10Elukey) [06:29:39] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:32:31] 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10TheresNoTime) fwiw, +1 β€” be very useful to have an additional user who could resolve issues like {T309371} [06:35:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: After ugprading mysql', diff saved to https://phabricator.wikimedia.org/P28626 and previous config saved to /var/cache/conftool/dbconfig/20220527-063525-root.json [06:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1076-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:44:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1076-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:46:26] (03PS4) 10Slyngshede: Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/799257 [06:46:35] (03CR) 10CI reject: [V: 04-1] Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [06:47:27] (03PS5) 10Slyngshede: Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/799257 [06:47:37] (03CR) 10CI reject: [V: 04-1] Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [06:50:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: After ugprading mysql', diff saved to https://phabricator.wikimedia.org/P28627 and previous config saved to /var/cache/conftool/dbconfig/20220527-065029-root.json [06:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:30] (03PS1) 10Slyngshede: Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/800612 [06:52:39] (03CR) 10CI reject: [V: 04-1] Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/800612 (owner: 10Slyngshede) [06:53:22] (03Abandoned) 10Slyngshede: Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/800612 (owner: 10Slyngshede) [06:55:52] Anyone around who can restart zuul? https://phabricator.wikimedia.org/T308943#7947453 suggests it's the resolution to T309371 [06:55:53] T309371: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220527T0700) [07:01:53] (03PS1) 10Slyngshede: P:hadoop::master - Remove Hadoop FairScheduler log cleanup. [puppet] - 10https://gerrit.wikimedia.org/r/800614 [07:02:02] (03CR) 10CI reject: [V: 04-1] P:hadoop::master - Remove Hadoop FairScheduler log cleanup. [puppet] - 10https://gerrit.wikimedia.org/r/800614 (owner: 10Slyngshede) [07:02:41] (03Abandoned) 10Slyngshede: Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [07:12:16] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:17:29] TheresNoTime: I'll give it a whack, and see it behaves better [07:21:58] (03Restored) 10Slyngshede: Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [07:22:55] (03PS2) 10Slyngshede: P:hadoop::master - Remove Hadoop FairScheduler log cleanup [puppet] - 10https://gerrit.wikimedia.org/r/800614 [07:23:32] (03CR) 10CI reject: [V: 04-1] P:hadoop::master - Remove Hadoop FairScheduler log cleanup [puppet] - 10https://gerrit.wikimedia.org/r/800614 (owner: 10Slyngshede) [07:24:56] (03PS3) 10Slyngshede: P:hadoop::master - Remove Hadoop FairScheduler log cleanup [puppet] - 10https://gerrit.wikimedia.org/r/800614 [07:25:02] 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10Zabe) [07:28:30] (03PS6) 10Slyngshede: Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/799257 [07:29:06] (03CR) 10jenkins-bot: Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [07:30:28] (03PS7) 10Slyngshede: Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/799257 [07:30:56] TheresNoTime: It seems to be running again. [07:31:11] slyngs: thank you! :D [07:32:04] <_joe_> TheresNoTime: sigh sorry I was under the shower [07:32:07] (03CR) 10Filippo Giunchedi: [C: 03+1] aptrepo: add opensearch2 thirdparty component [puppet] - 10https://gerrit.wikimedia.org/r/800294 (https://phabricator.wikimedia.org/T304440) (owner: 10Cwhite) [07:32:38] _joe_: I view it as a learning experience... that and it explained why my own stuff broke :-) [07:32:47] <_joe_> eheh [07:32:47] :-P [07:33:23] _joe_: Zuul on contint1001 is masked though, and I'm not sure it that's be design [07:33:37] <_joe_> uhm [07:34:24] <_joe_> I assume that's to get it not to restart automatically by package upgrades? [07:34:36] <_joe_> but hasharAway will know better when he's back [07:35:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [07:37:38] <_joe_> slyngs: ah I see, contint2001 is the currently active server [07:37:48] <_joe_> so that's why zuul is masked on 1001 [07:38:12] that'll do it! :D [07:38:36] (03PS8) 10Slyngshede: Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/799257 [07:39:10] (03CR) 10CI reject: [V: 04-1] Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [07:39:20] _joe_: How do you check which server is active? [07:39:46] <_joe_> slyngs: see operations/puppet/hieradata/hosts/contint1001.yaml and the corresponding for 2001 [07:40:21] (03PS9) 10Slyngshede: Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/799257 [07:40:40] <_joe_> slyngs: for multi-dc stuff that is used to serve user traffcic, we have discovery.wmnet records pointing to the nearest active cluster for you [07:45:16] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35584/console" [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [07:47:15] (03CR) 10Slyngshede: "Updated patch to remove timer and cleanup script." [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [07:47:56] (03Abandoned) 10Slyngshede: P:hadoop::master - Remove Hadoop FairScheduler log cleanup [puppet] - 10https://gerrit.wikimedia.org/r/800614 (owner: 10Slyngshede) [07:59:39] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:03:13] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:07:50] <_joe_> !log restarted rsyslog on kubernetes1014 [08:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:59] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:27:59] (KubernetesRsyslogDown) resolved: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:31:58] (KubernetesRsyslogDown) resolved: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:44:47] (03PS4) 10Filippo Giunchedi: puppetdb: create dbs before grants [puppet] - 10https://gerrit.wikimedia.org/r/800031 (https://phabricator.wikimedia.org/T296550) [08:44:49] (03PS7) 10Filippo Giunchedi: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 [08:45:54] (03CR) 10CI reject: [V: 04-1] cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi) [08:46:08] (03CR) 10Filippo Giunchedi: puppetdb: create dbs before grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800031 (https://phabricator.wikimedia.org/T296550) (owner: 10Filippo Giunchedi) [08:48:07] (03CR) 10Elukey: [V: 03+1] "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [08:48:09] (03PS8) 10Filippo Giunchedi: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 [08:48:29] (03PS11) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) [08:49:26] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35585/console" [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [08:54:26] (03PS1) 10Jelto: idp: add gitlab-new to idp [puppet] - 10https://gerrit.wikimedia.org/r/800666 (https://phabricator.wikimedia.org/T307142) [08:58:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [08:58:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [08:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298560)', diff saved to https://phabricator.wikimedia.org/P28629 and previous config saved to /var/cache/conftool/dbconfig/20220527-085819-ladsgroup.json [08:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:26] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [09:01:55] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:05:09] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:08:09] (03PS1) 10Kevin Bazira: ml-services: add euwiki & fawiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/800670 (https://phabricator.wikimedia.org/T307418) [09:16:31] (03CR) 10Elukey: [C: 03+1] "Left a note that can be bypassed, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [09:17:01] (03CR) 10Elukey: [C: 03+2] ml-services: add euwiki & fawiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/800670 (https://phabricator.wikimedia.org/T307418) (owner: 10Kevin Bazira) [09:17:14] (03CR) 10Jelto: [C: 03+2] wikimedia.org: add gitlab-new records + PTR [dns] - 10https://gerrit.wikimedia.org/r/799334 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [09:17:16] (03CR) 10Jbond: [C: 03+2] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/800252 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:17:25] (03PS3) 10Jelto: wikimedia.org: add gitlab-new records + PTR [dns] - 10https://gerrit.wikimedia.org/r/799334 (https://phabricator.wikimedia.org/T307142) [09:17:48] (03CR) 10AikoChou: [C: 03+1] ml-services: add euwiki & fawiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/800670 (https://phabricator.wikimedia.org/T307418) (owner: 10Kevin Bazira) [09:17:59] 10SRE, 10SRE-Access-Requests: Requesting access to PII in Superset for TheresNoTime - https://phabricator.wikimedia.org/T309383 (10TheresNoTime) [09:18:24] (03CR) 10Jbond: [C: 03+2] profile::auto_restarts: make comment match class name, minor grammar [puppet] - 10https://gerrit.wikimedia.org/r/800235 (owner: 10Dzahn) [09:19:31] (03CR) 10Jbond: [C: 03+1] parsoid::testing: add an auto_restart service for nginx [puppet] - 10https://gerrit.wikimedia.org/r/800241 (owner: 10Dzahn) [09:22:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [09:22:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [09:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298560)', diff saved to https://phabricator.wikimedia.org/P28630 and previous config saved to /var/cache/conftool/dbconfig/20220527-092233-ladsgroup.json [09:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:41] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [09:23:11] 10SRE, 10SRE-Access-Requests: Requesting access to PII in Superset for TheresNoTime - https://phabricator.wikimedia.org/T309383 (10KSiebert) I am Sammy's manager and give my permission. [09:24:41] !log killed hewiki's refresh link suggestions job (T299021) [09:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:47] T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021 [09:26:01] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:24] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:01] (03PS2) 10Jbond: resolvconf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800255 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:36:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/800031 (https://phabricator.wikimedia.org/T296550) (owner: 10Filippo Giunchedi) [09:37:28] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb postgress server: fix dependcey loop - https://phabricator.wikimedia.org/T296550 (10jbond) With filippos latest patch the only outstanding error is ` May 27 08:10:10 filippo-pdb-01 puppet-agent[17242]: (/Stage[main]/Postgr... [09:39:03] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi) [09:39:48] (03CR) 10Jbond: [C: 03+2] "I fixed the spec test otherwise lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/800255 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:48:01] !log run authdns-update for gitlab-new https://gerrit.wikimedia.org/r/c/operations/dns/+/799334 - T307142 [09:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:09] T307142: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 [09:52:48] (03PS1) 10Giuseppe Lavagetto: mediawiki-httpd: correctly link expires.conf [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/800675 (https://phabricator.wikimedia.org/T309358) [09:53:02] (03PS1) 10Jbond: redfish: add `files` to the list of data parameters to request [software/spicerack] - 10https://gerrit.wikimedia.org/r/800676 [09:54:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] developer-portal: add to service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799429 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [09:54:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] developer-portal: add to service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799429 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [09:57:09] (03CR) 10Alexandros Kosiaris: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/798664 (https://phabricator.wikimedia.org/T308308) (owner: 10Alexandros Kosiaris) [09:59:19] (03CR) 10Filippo Giunchedi: cfssl: write pretty json (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi) [10:01:15] (03PS3) 10Alexandros Kosiaris: admin: Add sgimeno to restricted [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) [10:01:37] (03CR) 10CI reject: [V: 04-1] admin: Add sgimeno to restricted [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris) [10:02:47] (03CR) 10Elukey: [C: 03+1] mediawiki-httpd: correctly link expires.conf [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/800675 (https://phabricator.wikimedia.org/T309358) (owner: 10Giuseppe Lavagetto) [10:03:05] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:04:21] (03PS4) 10Alexandros Kosiaris: admin: Add sgimeno to restricted [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) [10:04:45] (03CR) 10Alexandros Kosiaris: admin: Add sgimeno to restricted (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris) [10:05:17] (03CR) 10Alexandros Kosiaris: "This is ready and awaits manager approval" [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris) [10:05:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1068.eqiad.wmnet with OS bullseye [10:05:49] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1068.eqiad.wmnet with OS bullseye [10:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:52] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki-httpd: correctly link expires.conf [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/800675 (https://phabricator.wikimedia.org/T309358) (owner: 10Giuseppe Lavagetto) [10:12:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1071.eqiad.wmnet with OS bullseye [10:12:32] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1071.eqiad.wmnet with OS bullseye [10:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1068.eqiad.wmnet with reason: host reimage [10:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:18] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:24:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1068.eqiad.wmnet with reason: host reimage [10:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:15] (03PS1) 10Ladsgroup: docroot: Improve design of noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800680 [10:33:42] (03PS1) 10Btullis: Configure .gitignore to exclude the vendor subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/800681 [10:38:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1068.eqiad.wmnet with OS bullseye [10:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:52] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1068.eqiad.wmnet with OS bullseye completed: - ms-be1068 (**PASS**) - Downtim... [10:42:53] (03CR) 10Jbond: [C: 03+1] cfssl: write pretty json (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi) [10:43:25] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1071.eqiad.wmnet with reason: host reimage [10:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:41] (03CR) 10Jbond: [C: 03+2] Configure .gitignore to exclude the vendor subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/800681 (owner: 10Btullis) [10:45:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1069.eqiad.wmnet with OS bullseye [10:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:47] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1069.eqiad.wmnet with OS bullseye [10:46:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1071.eqiad.wmnet with reason: host reimage [10:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:17] PROBLEM - puppet last run on gitlab2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:52:44] RECOVERY - puppet last run on gitlab2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:56:02] PROBLEM - Host ms-be1070 is DOWN: PING CRITICAL - Packet loss = 100% [10:57:28] RECOVERY - Host ms-be1070 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [10:58:50] (03PS1) 10GergΕ‘ Tisza: Log output of scheduled MediaWiki maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/800683 (https://phabricator.wikimedia.org/T285896) [11:00:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1069.eqiad.wmnet with reason: host reimage [11:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1071.eqiad.wmnet with OS bullseye [11:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:33] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1071.eqiad.wmnet with OS bullseye completed: - ms-be1071 (**PASS**) - Downtim... [11:03:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1069.eqiad.wmnet with reason: host reimage [11:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:22] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:08:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [11:12:42] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:12:54] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [11:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [11:15:14] (03PS1) 10Jcrespo: mariabackup: Make the vendor detection account for known variations [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/800708 (https://phabricator.wikimedia.org/T309303) [11:18:03] (03PS1) 10Jelto: wikimedia.org: move gitlab-replica from netbox to dns repo [dns] - 10https://gerrit.wikimedia.org/r/800709 (https://phabricator.wikimedia.org/T307142) [11:18:40] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [11:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:58] (03CR) 10CI reject: [V: 04-1] wikimedia.org: move gitlab-replica from netbox to dns repo [dns] - 10https://gerrit.wikimedia.org/r/800709 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [11:20:46] (03PS2) 10Jcrespo: mariabackup: Make the vendor detection account for known variations [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/800708 (https://phabricator.wikimedia.org/T309303) [11:21:53] (03CR) 10Jelto: "fail expected, see https://wikitech.wikimedia.org/wiki/DNS/Netbox#Atomically_deploy_auto-generated_records_and_a_manual_change" [dns] - 10https://gerrit.wikimedia.org/r/800709 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [11:32:43] (03PS1) 10Ladsgroup: Add change_user_editcount_to_unsigned_T309311.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/800710 (https://phabricator.wikimedia.org/T309311) [11:33:34] (03PS2) 10Jelto: wikimedia.org: move gitlab-replica from netbox to dns repo [dns] - 10https://gerrit.wikimedia.org/r/800709 (https://phabricator.wikimedia.org/T307142) [11:33:59] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:34:28] (03CR) 10CI reject: [V: 04-1] wikimedia.org: move gitlab-replica from netbox to dns repo [dns] - 10https://gerrit.wikimedia.org/r/800709 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [11:38:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1070.eqiad.wmnet with OS bullseye [11:38:49] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1070.eqiad.wmnet with OS bullseye [11:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:41] !log jnuche@deploy1002 install-world aborted: (duration: 00m 02s) [11:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:02] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1070.eqiad.wmnet with reason: host reimage [11:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1070.eqiad.wmnet with reason: host reimage [11:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1069.eqiad.wmnet with OS bullseye [11:55:41] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1069.eqiad.wmnet with OS bullseye completed: - ms-be1069 (**PASS**) - Downtim... [11:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:33] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35586/console" [puppet] - 10https://gerrit.wikimedia.org/r/800031 (https://phabricator.wikimedia.org/T296550) (owner: 10Filippo Giunchedi) [12:09:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1070.eqiad.wmnet with OS bullseye [12:09:07] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1070.eqiad.wmnet with OS bullseye completed: - ms-be1070 (**PASS**) - Downtim... [12:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:23] (03CR) 10Marostegui: [C: 03+1] Add change_user_editcount_to_unsigned_T309311.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/800710 (https://phabricator.wikimedia.org/T309311) (owner: 10Ladsgroup) [12:12:39] (03CR) 10Marostegui: [C: 03+1] Add drop_page_restrictions_T60674.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/800183 (https://phabricator.wikimedia.org/T60674) (owner: 10Ladsgroup) [12:13:01] (03CR) 10Marostegui: [C: 03+1] Icinga: add page hashtag to paging host alerts [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans) [12:14:17] (03CR) 10Cathal Mooney: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/800709 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [12:14:50] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:14:51] (03CR) 10Ladsgroup: [C: 03+2] Add drop_page_restrictions_T60674.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/800183 (https://phabricator.wikimedia.org/T60674) (owner: 10Ladsgroup) [12:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:55] (03CR) 10Ladsgroup: [C: 03+2] Add change_user_editcount_to_unsigned_T309311.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/800710 (https://phabricator.wikimedia.org/T309311) (owner: 10Ladsgroup) [12:15:17] (03Merged) 10jenkins-bot: Add drop_page_restrictions_T60674.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/800183 (https://phabricator.wikimedia.org/T60674) (owner: 10Ladsgroup) [12:15:21] (03Merged) 10jenkins-bot: Add change_user_editcount_to_unsigned_T309311.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/800710 (https://phabricator.wikimedia.org/T309311) (owner: 10Ladsgroup) [12:17:15] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:19] (03CR) 10Marostegui: [C: 03+1] mariabackup: Make the vendor detection account for known variations [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/800708 (https://phabricator.wikimedia.org/T309303) (owner: 10Jcrespo) [12:22:05] (03CR) 10Jcrespo: [C: 03+2] mariabackup: Make the vendor detection account for known variations [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/800708 (https://phabricator.wikimedia.org/T309303) (owner: 10Jcrespo) [12:23:36] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:20] (03PS1) 10Urbanecm: throttle: Add new throttle rule + remove expired ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800711 (https://phabricator.wikimedia.org/T309395) [12:25:57] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:32] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:37] 10SRE, 10ops-eqiad: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 (10Marostegui) [12:32:39] (03Restored) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie) [12:32:49] (03PS2) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) [12:33:02] (03Abandoned) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie) [12:35:39] !log jnuche@deploy1002 install-world aborted: (duration: 01m 32s) [12:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:36] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:35] (03CR) 10Cathal Mooney: [C: 03+1] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/800709 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [12:40:41] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10DBA, 10GlobalBlocking, 10Wikimedia-Incident: 2022-05-05 Wikimedia full site outage - https://phabricator.wikimedia.org/T307647 (10Marostegui) [12:40:51] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) 05Openβ†’03Stalled Going to stall this until we are on 10.6 [12:41:21] (03CR) 10Jelto: [C: 03+2] wikimedia.org: move gitlab-replica from netbox to dns repo [dns] - 10https://gerrit.wikimedia.org/r/800709 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [12:43:26] !log run authdns-update for gitlab-replica https://gerrit.wikimedia.org/r/c/operations/dns/+/800709 - T307142 [12:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:31] T307142: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 [12:52:24] !log jnuche@deploy1002 install-world aborted: (duration: 03m 22s) [12:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:20] (03PS1) 10Jelto: wikimedia.org: move gitlab from netbox to dns repo [dns] - 10https://gerrit.wikimedia.org/r/800719 (https://phabricator.wikimedia.org/T307142) [12:58:09] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) [12:58:38] (03CR) 10CI reject: [V: 04-1] wikimedia.org: move gitlab from netbox to dns repo [dns] - 10https://gerrit.wikimedia.org/r/800719 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [12:59:05] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (10cmooney) 05Openβ†’03Resolved Work for this is now completed, will update design task once confirmed there are no niggles with reimaging. [12:59:11] (03PS1) 10Marostegui: site.pp: db1128 current situation [puppet] - 10https://gerrit.wikimedia.org/r/800720 (https://phabricator.wikimedia.org/T309303) [13:00:03] (03CR) 10Marostegui: [C: 03+2] site.pp: db1128 current situation [puppet] - 10https://gerrit.wikimedia.org/r/800720 (https://phabricator.wikimedia.org/T309303) (owner: 10Marostegui) [13:00:46] (03CR) 10Jelto: "fail expected, see https://wikitech.wikimedia.org/wiki/DNS/Netbox#Atomically_deploy_auto-generated_records_and_a_manual_change" [dns] - 10https://gerrit.wikimedia.org/r/800719 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [13:01:40] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:03:42] (03CR) 10Cathal Mooney: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/800719 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [13:03:50] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/800719 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [13:04:04] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:04:05] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [13:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:28] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:06] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:08:44] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:51] (03CR) 10Cathal Mooney: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/800719 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [13:10:18] (03PS1) 10Elukey: ml-services: bump docker image for articlequality pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/800723 [13:12:15] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/800719 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [13:13:27] (03CR) 10Jelto: [C: 03+2] wikimedia.org: move gitlab from netbox to dns repo [dns] - 10https://gerrit.wikimedia.org/r/800719 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [13:13:44] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/800723 (owner: 10Elukey) [13:14:50] !log run authdns-update for gitlab.wikimedia.org https://gerrit.wikimedia.org/r/c/operations/dns/+/800719 - T307142 [13:14:53] (03CR) 10Elukey: [C: 03+2] ml-services: bump docker image for articlequality pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/800723 (owner: 10Elukey) [13:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:55] T307142: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 [13:15:18] (03CR) 10AikoChou: [C: 03+1] ml-services: bump docker image for articlequality pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/800723 (owner: 10Elukey) [13:16:15] !log jnuche@deploy1002 install-world aborted: (duration: 00m 25s) [13:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:14] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:19] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:06] (03CR) 10Jelto: [C: 04-1] "I have some concerns if that helps. The idea is bacula can catch backups on gitlab1004 when it's not possible on gitlab1001 due to disk is" [puppet] - 10https://gerrit.wikimedia.org/r/800358 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [13:27:55] (03PS1) 10Cathal Mooney: Add includes in reverse DNS zone files for new cloudsw subnets [dns] - 10https://gerrit.wikimedia.org/r/800727 (https://phabricator.wikimedia.org/T304989) [13:28:43] (03CR) 10CI reject: [V: 04-1] Add includes in reverse DNS zone files for new cloudsw subnets [dns] - 10https://gerrit.wikimedia.org/r/800727 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [13:29:50] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:32] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:35:42] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:35:56] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:10] (03PS1) 10Jelto: gitlab: use gitlab1004 as replia/passive host [puppet] - 10https://gerrit.wikimedia.org/r/800728 (https://phabricator.wikimedia.org/T307142) [13:39:36] (03PS1) 10Cathal Mooney: Add new per-rack cloudsw subnets for e4 and f4 to networks data [puppet] - 10https://gerrit.wikimedia.org/r/800730 (https://phabricator.wikimedia.org/T304989) [13:39:47] (03CR) 10Cathal Mooney: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/800727 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [13:46:19] (03PS1) 10Cathal Mooney: Install server changes to support new subnets cloud racks c8 and d5 [puppet] - 10https://gerrit.wikimedia.org/r/800731 (https://phabricator.wikimedia.org/T304989) [13:47:15] (03CR) 10Cathal Mooney: [C: 03+2] Add includes in reverse DNS zone files for new cloudsw subnets [dns] - 10https://gerrit.wikimedia.org/r/800727 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [13:48:02] (03CR) 10Cathal Mooney: "Self-merging as there is time sensitivity to the order of changes between zone file and Netbox. Third similar change this week so I am co" [dns] - 10https://gerrit.wikimedia.org/r/800727 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [13:48:40] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [13:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:50] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [13:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:43] (03PS1) 10Elukey: ml-services: bump docker image for draftquality [deployment-charts] - 10https://gerrit.wikimedia.org/r/800732 (https://phabricator.wikimedia.org/T309102) [13:49:50] (03CR) 10Cathal Mooney: [C: 03+2] Add includes in reverse DNS zone files for new cloudsw subnets [dns] - 10https://gerrit.wikimedia.org/r/800727 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [13:51:47] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:24] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:47] (03CR) 10Elukey: [C: 03+2] ml-services: bump docker image for draftquality [deployment-charts] - 10https://gerrit.wikimedia.org/r/800732 (https://phabricator.wikimedia.org/T309102) (owner: 10Elukey) [13:56:35] (03PS1) 10Andrew Bogott: Horizon: disable creation of new proxies under .wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/800735 (https://phabricator.wikimedia.org/T305391) [13:56:38] (03CR) 10Kevin Bazira: ml-services: bump docker image for draftquality (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/800732 (https://phabricator.wikimedia.org/T309102) (owner: 10Elukey) [13:56:52] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:44] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:06] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:00:08] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:56] (03CR) 10Herron: [C: 03+1] opensearch_dashboards: add backup script enable job (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [14:10:07] (03CR) 10Herron: [C: 03+1] "πŸ‘" [puppet] - 10https://gerrit.wikimedia.org/r/800294 (https://phabricator.wikimedia.org/T304440) (owner: 10Cwhite) [14:22:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [14:22:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [14:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T60674)', diff saved to https://phabricator.wikimedia.org/P28632 and previous config saved to /var/cache/conftool/dbconfig/20220527-142219-ladsgroup.json [14:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:29] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [14:26:01] (03PS1) 10Ladsgroup: Remove page_restrictions field from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/800739 (https://phabricator.wikimedia.org/T60674) [14:26:32] (03CR) 10Marostegui: [C: 03+1] Remove page_restrictions field from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/800739 (https://phabricator.wikimedia.org/T60674) (owner: 10Ladsgroup) [14:29:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T60674)', diff saved to https://phabricator.wikimedia.org/P28633 and previous config saved to /var/cache/conftool/dbconfig/20220527-142921-ladsgroup.json [14:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:28] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [14:29:38] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:29:53] (03PS2) 10Ladsgroup: Remove page_restrictions field from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/800739 (https://phabricator.wikimedia.org/T60674) [14:29:58] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Remove page_restrictions field from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/800739 (https://phabricator.wikimedia.org/T60674) (owner: 10Ladsgroup) [14:31:40] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:42:15] (03PS1) 10Ladsgroup: Depool clouddb10(17|18|19|20) [puppet] - 10https://gerrit.wikimedia.org/r/800692 (https://phabricator.wikimedia.org/T60674) [14:42:23] (03PS2) 10Ladsgroup: Depool clouddb10(17|18|19|20) [puppet] - 10https://gerrit.wikimedia.org/r/800692 (https://phabricator.wikimedia.org/T60674) [14:42:44] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Depool clouddb10(17|18|19|20) [puppet] - 10https://gerrit.wikimedia.org/r/800692 (https://phabricator.wikimedia.org/T60674) (owner: 10Ladsgroup) [14:44:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P28634 and previous config saved to /var/cache/conftool/dbconfig/20220527-144426-ladsgroup.json [14:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:16] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:51:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298560)', diff saved to https://phabricator.wikimedia.org/P28635 and previous config saved to /var/cache/conftool/dbconfig/20220527-145135-ladsgroup.json [14:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:40] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [14:53:55] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10User-zeljkofilipin: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10zeljkofilipin) [14:59:22] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Aklapper) Unassigning (if I understand correctly); this is already tagged with #Cloud-Services [14:59:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P28636 and previous config saved to /var/cache/conftool/dbconfig/20220527-145931-ladsgroup.json [14:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:49] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Aklapper) a:05ArielGlennβ†’03None [15:06:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P28637 and previous config saved to /var/cache/conftool/dbconfig/20220527-150640-ladsgroup.json [15:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:32] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:12:18] (03PS1) 10Ladsgroup: dbproxy: Repool the old batch, Depool the new one [puppet] - 10https://gerrit.wikimedia.org/r/800693 (https://phabricator.wikimedia.org/T60674) [15:12:27] (03PS2) 10Ladsgroup: dbproxy: Repool the old batch, Depool the new one [puppet] - 10https://gerrit.wikimedia.org/r/800693 (https://phabricator.wikimedia.org/T60674) [15:13:01] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] dbproxy: Repool the old batch, Depool the new one [puppet] - 10https://gerrit.wikimedia.org/r/800693 (https://phabricator.wikimedia.org/T60674) (owner: 10Ladsgroup) [15:14:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T60674)', diff saved to https://phabricator.wikimedia.org/P28638 and previous config saved to /var/cache/conftool/dbconfig/20220527-151436-ladsgroup.json [15:14:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [15:14:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [15:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:44] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [15:14:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T60674)', diff saved to https://phabricator.wikimedia.org/P28639 and previous config saved to /var/cache/conftool/dbconfig/20220527-151444-ladsgroup.json [15:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:16] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:21:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P28640 and previous config saved to /var/cache/conftool/dbconfig/20220527-152145-ladsgroup.json [15:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T60674)', diff saved to https://phabricator.wikimedia.org/P28641 and previous config saved to /var/cache/conftool/dbconfig/20220527-152355-ladsgroup.json [15:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:01] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [15:26:51] (03Abandoned) 10Zabe: Acquire fresh actor id [extensions/CheckUser] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/798817 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [15:30:48] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:31:18] (03PS1) 10Ladsgroup: dbproxy: Repool clouddb101(3|4|5|6) [puppet] - 10https://gerrit.wikimedia.org/r/800748 (https://phabricator.wikimedia.org/T60674) [15:31:45] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] dbproxy: Repool clouddb101(3|4|5|6) [puppet] - 10https://gerrit.wikimedia.org/r/800748 (https://phabricator.wikimedia.org/T60674) (owner: 10Ladsgroup) [15:33:34] (03CR) 10Dzahn: "It was only meant to capture once the files that currently only exist on gitlab1004 but not on another server. It was not about a long-ter" [puppet] - 10https://gerrit.wikimedia.org/r/800358 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [15:36:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298560)', diff saved to https://phabricator.wikimedia.org/P28642 and previous config saved to /var/cache/conftool/dbconfig/20220527-153650-ladsgroup.json [15:36:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:36:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:57] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [15:36:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298560)', diff saved to https://phabricator.wikimedia.org/P28643 and previous config saved to /var/cache/conftool/dbconfig/20220527-153658-ladsgroup.json [15:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P28644 and previous config saved to /var/cache/conftool/dbconfig/20220527-153900-ladsgroup.json [15:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [15:41:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [15:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:04] (03CR) 10Andrew Bogott: [C: 03+2] cloudweb2002-dev is not behind LVS [puppet] - 10https://gerrit.wikimedia.org/r/795365 (owner: 10Majavah) [15:46:24] (03PS3) 10Jbond: WIP: Early start on firmware cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [15:50:36] (03CR) 10CI reject: [V: 04-1] WIP: Early start on firmware cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [15:50:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [15:50:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [15:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T309311)', diff saved to https://phabricator.wikimedia.org/P28645 and previous config saved to /var/cache/conftool/dbconfig/20220527-155049-ladsgroup.json [15:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:59] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [15:53:11] (03PS1) 10Andrew Bogott: Revert "cloudweb2002-dev is not behind LVS" [puppet] - 10https://gerrit.wikimedia.org/r/800695 [15:54:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P28646 and previous config saved to /var/cache/conftool/dbconfig/20220527-155405-ladsgroup.json [15:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:39] (03CR) 10Andrew Bogott: [C: 03+2] Revert "cloudweb2002-dev is not behind LVS" [puppet] - 10https://gerrit.wikimedia.org/r/800695 (owner: 10Andrew Bogott) [15:55:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T309311)', diff saved to https://phabricator.wikimedia.org/P28647 and previous config saved to /var/cache/conftool/dbconfig/20220527-155510-ladsgroup.json [15:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Beta Cluster: ship logs from docker services to logstash [puppet] - 10https://gerrit.wikimedia.org/r/800282 (https://phabricator.wikimedia.org/T309319) (owner: 10Ori) [16:02:47] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: disable creation of new proxies under .wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/800735 (https://phabricator.wikimedia.org/T305391) (owner: 10Andrew Bogott) [16:03:43] (03PS1) 10Ahmon Dancy: Turn mw_releases into a list [puppet] - 10https://gerrit.wikimedia.org/r/800758 (https://phabricator.wikimedia.org/T299648) [16:04:49] (03CR) 10Ahmon Dancy: [C: 03+1] Turn mw_releases into a list [puppet] - 10https://gerrit.wikimedia.org/r/800758 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [16:09:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T60674)', diff saved to https://phabricator.wikimedia.org/P28648 and previous config saved to /var/cache/conftool/dbconfig/20220527-160910-ladsgroup.json [16:09:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:09:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:17] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [16:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P28649 and previous config saved to /var/cache/conftool/dbconfig/20220527-161015-ladsgroup.json [16:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Andrew) a:05Andrewβ†’03Jclark-ctr I just noticed that this is still assigned to me! I don't think there any action items l... [16:16:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [16:16:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [16:16:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [16:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [16:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:30] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: bug T305391 [16:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:36] T305391: Disable creation of new web proxies under .wmflabs.org - https://phabricator.wikimedia.org/T305391 [16:21:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:21:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:21:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T60674)', diff saved to https://phabricator.wikimedia.org/P28650 and previous config saved to /var/cache/conftool/dbconfig/20220527-162204-ladsgroup.json [16:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:18] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [16:23:42] (03PS4) 10Jbond: WIP: Early start on firmware cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [16:24:09] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: bug T305391 (duration: 05m 39s) [16:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:15] T305391: Disable creation of new web proxies under .wmflabs.org - https://phabricator.wikimedia.org/T305391 [16:25:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P28651 and previous config saved to /var/cache/conftool/dbconfig/20220527-162520-ladsgroup.json [16:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:18] (03PS1) 10Krinkle: Follow-up I8d62aedb: Fix .rotation mixin [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/800696 [16:28:35] (03CR) 10CI reject: [V: 04-1] WIP: Early start on firmware cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [16:30:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T60674)', diff saved to https://phabricator.wikimedia.org/P28652 and previous config saved to /var/cache/conftool/dbconfig/20220527-163025-ladsgroup.json [16:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:32] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [16:40:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T309311)', diff saved to https://phabricator.wikimedia.org/P28653 and previous config saved to /var/cache/conftool/dbconfig/20220527-164026-ladsgroup.json [16:40:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [16:40:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [16:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T309311)', diff saved to https://phabricator.wikimedia.org/P28654 and previous config saved to /var/cache/conftool/dbconfig/20220527-164034-ladsgroup.json [16:40:37] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [16:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P28655 and previous config saved to /var/cache/conftool/dbconfig/20220527-164530-ladsgroup.json [16:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T309311)', diff saved to https://phabricator.wikimedia.org/P28656 and previous config saved to /var/cache/conftool/dbconfig/20220527-165117-ladsgroup.json [16:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:23] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [17:00:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P28657 and previous config saved to /var/cache/conftool/dbconfig/20220527-170035-ladsgroup.json [17:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P28658 and previous config saved to /var/cache/conftool/dbconfig/20220527-170622-ladsgroup.json [17:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:04] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10User-zeljkofilipin: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10phuedx) >>! In T306181#7914450, @akosiaris wrote: > I notice that things that take like 1s... [17:12:58] (03PS5) 10Stang: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) [17:15:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298560)', diff saved to https://phabricator.wikimedia.org/P28659 and previous config saved to /var/cache/conftool/dbconfig/20220527-171537-ladsgroup.json [17:15:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T60674)', diff saved to https://phabricator.wikimedia.org/P28660 and previous config saved to /var/cache/conftool/dbconfig/20220527-171541-ladsgroup.json [17:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [17:15:43] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [17:15:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [17:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T60674)', diff saved to https://phabricator.wikimedia.org/P28661 and previous config saved to /var/cache/conftool/dbconfig/20220527-171548-ladsgroup.json [17:15:49] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [17:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P28662 and previous config saved to /var/cache/conftool/dbconfig/20220527-172127-ladsgroup.json [17:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T60674)', diff saved to https://phabricator.wikimedia.org/P28663 and previous config saved to /var/cache/conftool/dbconfig/20220527-172444-ladsgroup.json [17:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:51] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [17:24:59] (03PS1) 10Majavah: dynamicproxy: add zones endpoint [puppet] - 10https://gerrit.wikimedia.org/r/800775 [17:25:14] (03PS1) 10Ladsgroup: db_maint_mapper_sal: Add Category:MariaDB to the report [software] - 10https://gerrit.wikimedia.org/r/800776 [17:26:30] (03PS2) 10Ladsgroup: db_maint_mapper_sal: Add Category:MariaDB to the report [software] - 10https://gerrit.wikimedia.org/r/800776 [17:29:04] (03CR) 10Andrew Bogott: dynamicproxy: add zones endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800775 (owner: 10Majavah) [17:30:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P28664 and previous config saved to /var/cache/conftool/dbconfig/20220527-173042-ladsgroup.json [17:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T309311)', diff saved to https://phabricator.wikimedia.org/P28665 and previous config saved to /var/cache/conftool/dbconfig/20220527-173632-ladsgroup.json [17:36:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [17:36:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [17:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:40] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [17:36:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T309311)', diff saved to https://phabricator.wikimedia.org/P28666 and previous config saved to /var/cache/conftool/dbconfig/20220527-173641-ladsgroup.json [17:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P28667 and previous config saved to /var/cache/conftool/dbconfig/20220527-173949-ladsgroup.json [17:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:16] topranks: I saw changes that move DNS entries out of netbox and into the DNS repo (for gitlab-new etc). I need the exact same thing but for gerrit. Previously tried to add in netbox but it wasn't the correct way. Though have done a ton of DNS changes before netbox was around. What is the correct approach, could I check with you possibly? [17:41:03] it's the same special case where there are public IPs, not behind LVS, secondary service IP vs server IP ... [17:42:23] mutante: ok yeah, on the wider point of what should be left in Netbox, and what the valid reasons for doing it outside that are I'm not that familiar. [17:42:26] ah, sorry, forgot about the timezone for a moment. It's Friday night already. I'll ask the same thing in Phabricator or next week. [17:42:35] But seems reasonable given your description [17:42:39] no probs, I have a few mins [17:42:48] gitlab is following what gerrit did previously [17:43:26] Basic thing is to first update Netbox, clear the hostname from the "DNS" field of the IP address in question [17:43:44] so.. if I were to add names directly in the DNS repo and do nothing in netbox.. basically just like I would have done it before netbox existed.. and just autdns-update and don't use any cookbook.. then is it ok now? [17:43:52] And best to add a description starting "Keep manual DNS" to it instead [17:43:55] Like these: [17:43:56] or all that but run the cookbook afterwards? [17:43:56] Keep manual DNS: [17:44:09] https://netbox.wikimedia.org/search/?q=gitlab-replica.wikimedia.org [17:44:16] No you need to remove the DNS hostname from the IP in netbox [17:44:19] oh, so they ARE actually still in netbox [17:44:28] then moving them out of netbox wasnt what I thought it was [17:44:38] Without doing that, and running the sre.dns.netbox cookbook, there will be double entries [17:44:50] Which will mean CI will fail for your manual patch of the zonefile. [17:44:57] jelto was following wikitech let me see if I can find. [17:45:00] But basic approach is: [17:45:07] 1) Upload patch with manual entries [17:45:25] 2) Remove the entries from Netbox as I described above [17:45:42] 3) Run CI again on gerrit, should get nice green tick [17:45:47] 4) Merge [17:45:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P28668 and previous config saved to /var/cache/conftool/dbconfig/20220527-174547-ladsgroup.json [17:45:52] 5) Run authdns update [17:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:10] but what if they don't exist in netbox at all yet [17:46:26] I had deleted my previous entries because they were not correct [17:47:09] I had followed docs for "special case" on wikitech. but that was a different type of special cases afaict [17:47:11] Ah sorry [17:47:21] If they don't exist in Netbox you can just follow the old manual process [17:47:42] I was talking about a situation an existing hostname was moving from netbox to manual [17:47:57] If it doesn't exist just edit the zonefile and submit patch via gerrit [17:48:00] :)) that's great. That makes it easier [17:48:36] previously I had tried the other approach with the "reserved in netbox" entries but that was not for this type of service IP [17:48:50] cool. I'm out the door now but if nobody else looks at it I can review Monday am put me on as a reviewer. [17:48:54] 10SRE, 10Data-Engineering, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Milimetric) > I'm very intrigued @Milimetric about your comment about reinstrumenting pageviews in a declarative way (that sounds like it could help with some of our work ar... [17:49:09] The IP should be present in Netbox - this isn't a DNS thing - but just so it's documented / not used for anything else. [17:50:00] eh, ok, I got a bit confused still about it being in netbox or not. let me do that next week with a review [17:50:08] enjoy the weekend [17:51:55] ok cool same to you :) [17:52:07] thanks, cya [17:54:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P28669 and previous config saved to /var/cache/conftool/dbconfig/20220527-175455-ladsgroup.json [17:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T309311)', diff saved to https://phabricator.wikimedia.org/P28670 and previous config saved to /var/cache/conftool/dbconfig/20220527-175819-ladsgroup.json [17:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:26] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [17:59:37] 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10Dzahn) [18:00:50] 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10Dzahn) checked off boxes (L3 signed, NDA, has existing shell access, etc). Will need approval from group approver (Tyler). [18:00:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298560)', diff saved to https://phabricator.wikimedia.org/P28671 and previous config saved to /var/cache/conftool/dbconfig/20220527-180052-ladsgroup.json [18:00:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:00:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:00] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [18:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:06] (03CR) 10Jcrespo: [C: 03+1] db_maint_mapper_sal: Add Category:MariaDB to the report [software] - 10https://gerrit.wikimedia.org/r/800776 (owner: 10Ladsgroup) [18:10:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T60674)', diff saved to https://phabricator.wikimedia.org/P28672 and previous config saved to /var/cache/conftool/dbconfig/20220527-181000-ladsgroup.json [18:10:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [18:10:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [18:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:06] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:33] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:10:45] (03CR) 10Ladsgroup: [C: 03+2] db_maint_mapper_sal: Add Category:MariaDB to the report [software] - 10https://gerrit.wikimedia.org/r/800776 (owner: 10Ladsgroup) [18:13:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P28673 and previous config saved to /var/cache/conftool/dbconfig/20220527-181324-ladsgroup.json [18:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:22] 10SRE, 10Analytics, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Milimetric) @Tsevener is right, and that's the access that @RhinosF1 pointed to. @Dmantena: unfortunately, due to how authentication and authorization works more broadly at wmf, th... [18:14:25] (03Merged) 10jenkins-bot: db_maint_mapper_sal: Add Category:MariaDB to the report [software] - 10https://gerrit.wikimedia.org/r/800776 (owner: 10Ladsgroup) [18:14:35] 10SRE, 10Data-Engineering, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Milimetric) [18:16:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [18:16:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [18:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T60674)', diff saved to https://phabricator.wikimedia.org/P28674 and previous config saved to /var/cache/conftool/dbconfig/20220527-181650-ladsgroup.json [18:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:57] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:20:35] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:25:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T60674)', diff saved to https://phabricator.wikimedia.org/P28675 and previous config saved to /var/cache/conftool/dbconfig/20220527-182523-ladsgroup.json [18:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:31] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:28:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P28676 and previous config saved to /var/cache/conftool/dbconfig/20220527-182829-ladsgroup.json [18:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:11] (03CR) 10Majavah: dynamicproxy: add zones endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800775 (owner: 10Majavah) [18:40:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P28677 and previous config saved to /var/cache/conftool/dbconfig/20220527-184028-ladsgroup.json [18:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T309311)', diff saved to https://phabricator.wikimedia.org/P28678 and previous config saved to /var/cache/conftool/dbconfig/20220527-184334-ladsgroup.json [18:43:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [18:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [18:43:41] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [18:43:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [18:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [18:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [18:49:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [18:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T309311)', diff saved to https://phabricator.wikimedia.org/P28679 and previous config saved to /var/cache/conftool/dbconfig/20220527-184938-ladsgroup.json [18:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:46] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [18:51:04] (03CR) 10Cwhite: [C: 03+2] "PCC checks out: https://puppet-compiler.wmflabs.org/pcc-worker1002/35587/" [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [18:51:10] (03PS8) 10Cwhite: opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) [18:51:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Jclark-ctr) name rack Unit Port CableID an-presto1006 e1 29 29 20220068 an-presto1007 e1 31 31 20220061 an-presto1008 e2 31 31 20220066 an-pre... [18:52:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Jclark-ctr) [18:53:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Jclark-ctr) a:05Jclark-ctrβ†’03Cmjohnson [18:53:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [18:53:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [18:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P28680 and previous config saved to /var/cache/conftool/dbconfig/20220527-185533-ladsgroup.json [18:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:28] (03Abandoned) 10Stang: zhwikiquote: Add logo variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792973 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [18:57:40] 10SRE, 10Data-Engineering, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Dzahn) thanks @Milimetric.that makes sense. it was just out of habit to still use that tag. gotcha for next time [19:03:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [19:03:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [19:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:19] (03PS1) 10Andrew Bogott: cloudweb2002-dev is not behind LVS [puppet] - 10https://gerrit.wikimedia.org/r/800787 [19:03:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T309311)', diff saved to https://phabricator.wikimedia.org/P28682 and previous config saved to /var/cache/conftool/dbconfig/20220527-190320-ladsgroup.json [19:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:29] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [19:04:02] (03CR) 10Andrew Bogott: [C: 03+2] cloudweb2002-dev is not behind LVS [puppet] - 10https://gerrit.wikimedia.org/r/800787 (owner: 10Andrew Bogott) [19:06:25] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:08:40] 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10Dzahn) a:03thcipriani (if this needs an additional sponsor I can be that) [19:10:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T309311)', diff saved to https://phabricator.wikimedia.org/P28683 and previous config saved to /var/cache/conftool/dbconfig/20220527-191015-ladsgroup.json [19:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:23] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [19:10:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T60674)', diff saved to https://phabricator.wikimedia.org/P28684 and previous config saved to /var/cache/conftool/dbconfig/20220527-191039-ladsgroup.json [19:10:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [19:10:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [19:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:45] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [19:10:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T60674)', diff saved to https://phabricator.wikimedia.org/P28685 and previous config saved to /var/cache/conftool/dbconfig/20220527-191047-ladsgroup.json [19:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:43] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:15:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T309311)', diff saved to https://phabricator.wikimedia.org/P28686 and previous config saved to /var/cache/conftool/dbconfig/20220527-191508-ladsgroup.json [19:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:07] (03CR) 10Dzahn: [C: 03+2] parsoid::testing: add an auto_restart service for nginx [puppet] - 10https://gerrit.wikimedia.org/r/800241 (owner: 10Dzahn) [19:18:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T60674)', diff saved to https://phabricator.wikimedia.org/P28687 and previous config saved to /var/cache/conftool/dbconfig/20220527-191829-ladsgroup.json [19:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:36] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [19:20:28] 10SRE, 10serviceops, 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [19:22:37] (03PS45) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [19:25:07] (03CR) 10CI reject: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [19:25:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P28688 and previous config saved to /var/cache/conftool/dbconfig/20220527-192521-ladsgroup.json [19:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:07] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:30:09] 10SRE, 10serviceops, 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [19:30:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P28689 and previous config saved to /var/cache/conftool/dbconfig/20220527-193013-ladsgroup.json [19:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P28690 and previous config saved to /var/cache/conftool/dbconfig/20220527-193334-ladsgroup.json [19:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:07] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:40:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P28691 and previous config saved to /var/cache/conftool/dbconfig/20220527-194026-ladsgroup.json [19:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:33] (03PS1) 10Stang: Add wmgSiteLogoVariants support for Chinese Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800793 (https://phabricator.wikimedia.org/T308620) [19:44:48] (03CR) 10CI reject: [V: 04-1] Add wmgSiteLogoVariants support for Chinese Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800793 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [19:45:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P28692 and previous config saved to /var/cache/conftool/dbconfig/20220527-194518-ladsgroup.json [19:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:34] (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800793 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [19:48:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P28693 and previous config saved to /var/cache/conftool/dbconfig/20220527-194839-ladsgroup.json [19:48:40] (03PS6) 10Stang: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) [19:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:38] (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800793 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [19:51:19] (03CR) 10Cwhite: [C: 03+2] aptrepo: add opensearch2 thirdparty component [puppet] - 10https://gerrit.wikimedia.org/r/800294 (https://phabricator.wikimedia.org/T304440) (owner: 10Cwhite) [19:55:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T309311)', diff saved to https://phabricator.wikimedia.org/P28694 and previous config saved to /var/cache/conftool/dbconfig/20220527-195531-ladsgroup.json [19:55:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [19:55:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [19:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:38] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [19:55:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T309311)', diff saved to https://phabricator.wikimedia.org/P28695 and previous config saved to /var/cache/conftool/dbconfig/20220527-195539-ladsgroup.json [19:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:52] (03PS1) 10Andrew Bogott: Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) [19:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:30] (03CR) 10CI reject: [V: 04-1] Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [20:00:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T309311)', diff saved to https://phabricator.wikimedia.org/P28696 and previous config saved to /var/cache/conftool/dbconfig/20220527-200023-ladsgroup.json [20:00:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [20:00:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [20:00:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:30] (03PS2) 10Andrew Bogott: Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) [20:00:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T309311)', diff saved to https://phabricator.wikimedia.org/P28697 and previous config saved to /var/cache/conftool/dbconfig/20220527-200037-ladsgroup.json [20:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:53] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [20:01:07] (03CR) 10CI reject: [V: 04-1] Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [20:02:33] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:02:42] (03CR) 10Stang: "Will wait for the dependent patch got merged. Also don't forget addition of $wmgSiteLogoVariantFallback for these four newly added sites." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800793 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:03:24] (03PS3) 10Andrew Bogott: Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) [20:03:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T60674)', diff saved to https://phabricator.wikimedia.org/P28698 and previous config saved to /var/cache/conftool/dbconfig/20220527-200344-ladsgroup.json [20:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:51] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [20:03:59] (03CR) 10CI reject: [V: 04-1] Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [20:04:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T309311)', diff saved to https://phabricator.wikimedia.org/P28699 and previous config saved to /var/cache/conftool/dbconfig/20220527-200453-ladsgroup.json [20:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:09] (03PS1) 10Andrew Bogott: Add fake db passwords for OpenStack Heato [labs/private] - 10https://gerrit.wikimedia.org/r/800796 (https://phabricator.wikimedia.org/T309407) [20:07:11] (03PS2) 10Andrew Bogott: Add fake db passwords for OpenStack Heat [labs/private] - 10https://gerrit.wikimedia.org/r/800796 (https://phabricator.wikimedia.org/T309407) [20:14:21] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add fake db passwords for OpenStack Heat [labs/private] - 10https://gerrit.wikimedia.org/r/800796 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [20:14:27] (03PS4) 10Andrew Bogott: Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) [20:15:03] (03CR) 10CI reject: [V: 04-1] Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [20:19:58] (03PS5) 10Andrew Bogott: Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) [20:20:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P28700 and previous config saved to /var/cache/conftool/dbconfig/20220527-201959-ladsgroup.json [20:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:35] (03CR) 10CI reject: [V: 04-1] Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [20:21:34] (03PS6) 10Andrew Bogott: Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) [20:22:54] (03PS7) 10Andrew Bogott: Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) [20:24:48] (03PS46) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [20:25:02] (03PS8) 10Andrew Bogott: Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) [20:27:36] (03CR) 10CI reject: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [20:35:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P28701 and previous config saved to /var/cache/conftool/dbconfig/20220527-203504-ladsgroup.json [20:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:15] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:41:53] (03PS47) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [20:43:41] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:43:55] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-deploy03 unresponsive - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) p:05Triageβ†’03High [20:44:55] (03CR) 10CI reject: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [20:45:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T309311)', diff saved to https://phabricator.wikimedia.org/P28702 and previous config saved to /var/cache/conftool/dbconfig/20220527-204505-ladsgroup.json [20:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:13] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [20:50:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T309311)', diff saved to https://phabricator.wikimedia.org/P28703 and previous config saved to /var/cache/conftool/dbconfig/20220527-205009-ladsgroup.json [20:50:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [20:50:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [20:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:16] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [20:50:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T309311)', diff saved to https://phabricator.wikimedia.org/P28704 and previous config saved to /var/cache/conftool/dbconfig/20220527-205017-ladsgroup.json [20:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:58] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-deploy03 unresponsive - https://phabricator.wikimedia.org/T309413 (10dancy) I rebooted using the horizon UI. [20:52:11] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-deploy03 unresponsive - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) 05Openβ†’03Resolved a:03dancy @dancy rebooted `deployment-deploy03` and it is now accessible [20:54:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T309311)', diff saved to https://phabricator.wikimedia.org/P28705 and previous config saved to /var/cache/conftool/dbconfig/20220527-205434-ladsgroup.json [20:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P28706 and previous config saved to /var/cache/conftool/dbconfig/20220527-210010-ladsgroup.json [21:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:26] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/800282 (https://phabricator.wikimedia.org/T309319) (owner: 10Ori) [21:09:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P28707 and previous config saved to /var/cache/conftool/dbconfig/20220527-210939-ladsgroup.json [21:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:02] (03PS48) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [21:15:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P28708 and previous config saved to /var/cache/conftool/dbconfig/20220527-211515-ladsgroup.json [21:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:33] (03CR) 10CI reject: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [21:24:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P28709 and previous config saved to /var/cache/conftool/dbconfig/20220527-212444-ladsgroup.json [21:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:26] (03PS49) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [21:30:01] (03CR) 10CI reject: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [21:30:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T309311)', diff saved to https://phabricator.wikimedia.org/P28710 and previous config saved to /var/cache/conftool/dbconfig/20220527-213020-ladsgroup.json [21:30:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [21:30:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [21:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:26] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [21:30:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T309311)', diff saved to https://phabricator.wikimedia.org/P28711 and previous config saved to /var/cache/conftool/dbconfig/20220527-213028-ladsgroup.json [21:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:30] (03PS9) 10Andrew Bogott: Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) [21:32:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298560)', diff saved to https://phabricator.wikimedia.org/P28712 and previous config saved to /var/cache/conftool/dbconfig/20220527-213255-ladsgroup.json [21:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:02] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [21:33:07] (03CR) 10CI reject: [V: 04-1] Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [21:33:49] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:35:35] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:36:22] (03PS10) 10Andrew Bogott: Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) [21:37:51] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:38:17] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:39:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T309311)', diff saved to https://phabricator.wikimedia.org/P28713 and previous config saved to /var/cache/conftool/dbconfig/20220527-213949-ladsgroup.json [21:39:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [21:39:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [21:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:56] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [21:39:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T309311)', diff saved to https://phabricator.wikimedia.org/P28714 and previous config saved to /var/cache/conftool/dbconfig/20220527-213957-ladsgroup.json [21:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:40] (03PS50) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [21:41:50] (03PS11) 10Andrew Bogott: Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) [21:42:32] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-deploy03 unresponsive - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) 05Resolvedβ†’03Open a:05dancyβ†’03TheresNoTime Issue repeated, looking at it now [21:43:21] (03CR) 10CI reject: [V: 04-1] Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [21:43:49] (03PS12) 10Andrew Bogott: Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) [21:44:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T309311)', diff saved to https://phabricator.wikimedia.org/P28715 and previous config saved to /var/cache/conftool/dbconfig/20220527-214414-ladsgroup.json [21:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:45] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:45:11] (03CR) 10CI reject: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [21:48:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P28716 and previous config saved to /var/cache/conftool/dbconfig/20220527-214800-ladsgroup.json [21:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:35] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 46.85 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [21:51:11] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:51:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:51:49] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 75.78 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [21:52:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T309311)', diff saved to https://phabricator.wikimedia.org/P28717 and previous config saved to /var/cache/conftool/dbconfig/20220527-215213-ladsgroup.json [21:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:19] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [21:55:39] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:56:07] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:59:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P28718 and previous config saved to /var/cache/conftool/dbconfig/20220527-215919-ladsgroup.json [21:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P28719 and previous config saved to /var/cache/conftool/dbconfig/20220527-220305-ladsgroup.json [22:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:45] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-deploy03 crashed twice - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) p:05Highβ†’03Triage a:05TheresNoTimeβ†’03None [22:07:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P28720 and previous config saved to /var/cache/conftool/dbconfig/20220527-220718-ladsgroup.json [22:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P28721 and previous config saved to /var/cache/conftool/dbconfig/20220527-221424-ladsgroup.json [22:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298560)', diff saved to https://phabricator.wikimedia.org/P28722 and previous config saved to /var/cache/conftool/dbconfig/20220527-221810-ladsgroup.json [22:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:18] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [22:18:45] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:22:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P28723 and previous config saved to /var/cache/conftool/dbconfig/20220527-222223-ladsgroup.json [22:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T309311)', diff saved to https://phabricator.wikimedia.org/P28724 and previous config saved to /var/cache/conftool/dbconfig/20220527-222929-ladsgroup.json [22:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:37] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [22:33:57] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:34:23] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:36:39] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-deploy03 crashed twice - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) While running a step of `beta-update-databases-eqiad`, we go OOM and unresponsive: ` PID USER PR NI VIRT RES SHR S %CPU %MEM... [22:37:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T309311)', diff saved to https://phabricator.wikimedia.org/P28725 and previous config saved to /var/cache/conftool/dbconfig/20220527-223728-ladsgroup.json [22:37:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:37:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:36] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [22:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:45] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:41:20] (03PS2) 10Jforrester: Follow-up I8d62aedb: Fix .rotation mixin [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/800696 (owner: 10Krinkle) [22:41:39] (03CR) 10Jforrester: "Re-cherry-picked now it's merged so we get the nice git hash in the blame." [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/800696 (owner: 10Krinkle) [22:55:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [22:55:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [22:55:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [22:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [22:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:06] (03PS13) 10Andrew Bogott: Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) [22:58:08] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-deploy03 crashed twice - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) a:03TheresNoTime [22:59:03] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:00:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [23:00:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [23:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T309311)', diff saved to https://phabricator.wikimedia.org/P28726 and previous config saved to /var/cache/conftool/dbconfig/20220527-230040-ladsgroup.json [23:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:48] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [23:00:49] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:02:37] (03PS14) 10Andrew Bogott: Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) [23:06:17] (03CR) 10Andrew Bogott: [C: 03+2] Rough in manifest and config for OpenStack Heat [puppet] - 10https://gerrit.wikimedia.org/r/800794 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [23:09:49] (03PS1) 10Andrew Bogott: Pass in codfw1dev-specific rabbit pass to heat profile [puppet] - 10https://gerrit.wikimedia.org/r/800810 (https://phabricator.wikimedia.org/T309407) [23:13:04] (03CR) 10Andrew Bogott: [C: 03+2] Pass in codfw1dev-specific rabbit pass to heat profile [puppet] - 10https://gerrit.wikimedia.org/r/800810 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [23:16:57] (03PS1) 10Andrew Bogott: Add initial (mostly empty) policy.yaml for OpenStack heat [puppet] - 10https://gerrit.wikimedia.org/r/800811 (https://phabricator.wikimedia.org/T309407) [23:17:47] (03CR) 10CI reject: [V: 04-1] Add initial (mostly empty) policy.yaml for OpenStack heat [puppet] - 10https://gerrit.wikimedia.org/r/800811 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [23:21:49] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-deploy03 crashed twice - https://phabricator.wikimedia.org/T309413 (10Zabe) FTR, it seems like beta-update-databases-eqiad was running out of memory while trying to perform the migration added in https://gerrit.wikimedia.org/r/c/med... [23:26:53] (03PS2) 10Andrew Bogott: Add initial (mostly empty) policy.yaml for OpenStack heat [puppet] - 10https://gerrit.wikimedia.org/r/800811 (https://phabricator.wikimedia.org/T309407) [23:29:00] (03CR) 10Andrew Bogott: [C: 03+2] Add initial (mostly empty) policy.yaml for OpenStack heat [puppet] - 10https://gerrit.wikimedia.org/r/800811 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [23:36:10] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) [23:38:12] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) [23:40:05] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) [23:41:53] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:44:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T309311)', diff saved to https://phabricator.wikimedia.org/P28727 and previous config saved to /var/cache/conftool/dbconfig/20220527-234427-ladsgroup.json [23:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:38] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [23:55:26] (03PS1) 10Andrew Bogott: Add transport_url to heat.conf [puppet] - 10https://gerrit.wikimedia.org/r/800823 (https://phabricator.wikimedia.org/T309407) [23:57:40] (03CR) 10Andrew Bogott: [C: 03+2] Add transport_url to heat.conf [puppet] - 10https://gerrit.wikimedia.org/r/800823 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [23:59:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P28728 and previous config saved to /var/cache/conftool/dbconfig/20220527-235932-ladsgroup.json [23:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log