[00:00:04] RoanKattouw and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220210T0000). [00:00:04] No Gerrit patches in the queue for this window AFAICS. [00:00:21] indeed, nothing to do [00:01:08] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:03:53] (Juniper alarm active) resolved: Alert for device lsw3-codfw.mgmt.codfw.wmnet - Juniper alarm active got acknowledged - https://alerts.wikimedia.org [00:05:21] (03PS1) 104nn1l2: jawikivoyage: Change talk namespace names from トーク to ノート [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761497 (https://phabricator.wikimedia.org/T262155) [00:05:43] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn) T287348#7699428 [00:06:40] Is B&C window open? [00:10:45] RoanKattouw and Urbanecm: Are you available? [00:10:57] jouncebot: now [00:10:57] For the next 0 hour(s) and 49 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220210T0000) [00:11:20] nn1l2: you added your patch 10 minutes after the empty window started, so up to the deployers if they come back for this [00:11:40] yes, I know [00:11:51] no complaints [00:12:00] cool cool [00:12:08] Yes I'm here [00:12:09] I can reschedule it if need be [00:12:10] I can deploy [00:12:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [00:12:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [00:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:52] (03CR) 10Catrope: [C: 03+2] jawikivoyage: Change talk namespace names from トーク to ノート [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761497 (https://phabricator.wikimedia.org/T262155) (owner: 104nn1l2) [00:13:55] (03Merged) 10jenkins-bot: jawikivoyage: Change talk namespace names from トーク to ノート [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761497 (https://phabricator.wikimedia.org/T262155) (owner: 104nn1l2) [00:14:48] nn1l2: Your change is on mwdebug1002, please test [00:14:56] ok [00:17:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:29] LGTM [00:18:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:18:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:37] RoanKattouw: you can sync, thanks [00:19:17] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:761497|jawikivoyage: Change talk namespace names from トーク to ノート (T262155)]] (duration: 00m 54s) [00:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:21] T262155: Request for settings about namespaces on ja.wikivoyage - https://phabricator.wikimedia.org/T262155 [00:19:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:06] Deployed [00:20:12] Thanks! [00:20:13] I ran namespaceDupes.php and it didn't find anything [00:24:03] RoanKattouw: I forgot Module namespace. Are you available if I submit another patch in at most 5 mins? Sorry :( [00:24:13] Sure! [00:24:36] It's 4:24pm here so I'm still at work :) [00:31:58] win 5 [00:32:02] (03PS1) 104nn1l2: jawikivoyage: Change module talk namespace from トーク to ノート [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761501 (https://phabricator.wikimedia.org/T262155) [00:33:01] (03CR) 10Catrope: [C: 03+2] jawikivoyage: Change module talk namespace from トーク to ノート [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761501 (https://phabricator.wikimedia.org/T262155) (owner: 104nn1l2) [00:33:44] (03Merged) 10jenkins-bot: jawikivoyage: Change module talk namespace from トーク to ノート [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761501 (https://phabricator.wikimedia.org/T262155) (owner: 104nn1l2) [00:34:07] Added [00:34:50] nn1l2: Ready for testing on mwdebug1002 [00:35:04] ok [00:36:06] RoanKattouw: Good to go [00:37:24] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:761501|jawikivoyage: Change module talk namespace from トーク to ノート (T262155)]] (duration: 00m 50s) [00:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:29] T262155: Request for settings about namespaces on ja.wikivoyage - https://phabricator.wikimedia.org/T262155 [00:38:45] Ran namespaceDupes again and it found nothing again, thankfully [00:39:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:35] Thanks! [00:40:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:40:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:54] OOI, why are we changing these locally on a wiki? [00:42:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:06] (03CR) 10Cwhite: "This is a good idea." [puppet] - 10https://gerrit.wikimedia.org/r/761455 (https://phabricator.wikimedia.org/T292175) (owner: 10Herron) [00:47:42] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:17] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/761403 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [00:50:53] (03CR) 10Cwhite: [C: 03+1] "De-duplication here seems reasonable to me." [puppet] - 10https://gerrit.wikimedia.org/r/761285 (owner: 10Filippo Giunchedi) [00:51:55] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/761279 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [01:00:04] twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220210T0100). [01:15:06] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:19:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [01:19:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [01:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298554)', diff saved to https://phabricator.wikimedia.org/P20439 and previous config saved to /var/cache/conftool/dbconfig/20220210-011920-ladsgroup.json [01:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:25] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [01:26:30] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:55:08] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edi [01:55:08] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:28:32] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:30:16] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:37:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298554)', diff saved to https://phabricator.wikimedia.org/P20440 and previous config saved to /var/cache/conftool/dbconfig/20220210-023749-ladsgroup.json [02:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:55] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [02:51:18] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:52:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P20441 and previous config saved to /var/cache/conftool/dbconfig/20220210-025253-ladsgroup.json [02:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:00] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [03:07:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P20442 and previous config saved to /var/cache/conftool/dbconfig/20220210-030758-ladsgroup.json [03:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298554)', diff saved to https://phabricator.wikimedia.org/P20443 and previous config saved to /var/cache/conftool/dbconfig/20220210-032303-ladsgroup.json [03:23:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [03:23:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [03:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298554)', diff saved to https://phabricator.wikimedia.org/P20444 and previous config saved to /var/cache/conftool/dbconfig/20220210-032310-ladsgroup.json [03:24:08] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [03:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:28] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 166 probes of 660 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:34:52] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 61 probes of 660 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:40:44] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:59:52] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:27:48] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edi [04:27:48] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [04:30:06] (03PS1) 10Jforrester: Short circut updating stats when the page is not reviewable [extensions/FlaggedRevs] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761408 (https://phabricator.wikimedia.org/T301433) [04:30:52] (03PS1) 10Jforrester: Short circut updating stats when the page is not reviewable [extensions/FlaggedRevs] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761409 (https://phabricator.wikimedia.org/T301433) [04:42:08] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:54:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298554)', diff saved to https://phabricator.wikimedia.org/P20445 and previous config saved to /var/cache/conftool/dbconfig/20220210-045442-ladsgroup.json [04:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:48] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [04:56:52] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:59:14] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:09:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P20446 and previous config saved to /var/cache/conftool/dbconfig/20220210-050946-ladsgroup.json [05:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P20447 and previous config saved to /var/cache/conftool/dbconfig/20220210-052451-ladsgroup.json [05:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298554)', diff saved to https://phabricator.wikimedia.org/P20448 and previous config saved to /var/cache/conftool/dbconfig/20220210-053956-ladsgroup.json [05:39:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [05:39:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [05:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:01] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [05:40:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298554)', diff saved to https://phabricator.wikimedia.org/P20449 and previous config saved to /var/cache/conftool/dbconfig/20220210-054003-ladsgroup.json [05:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20450 and previous config saved to /var/cache/conftool/dbconfig/20220210-054045-root.json [05:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:49] (03PS1) 10Marostegui: add_tl_target_id_T300775.py: Increase downtime [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761525 (https://phabricator.wikimedia.org/T300775) [05:42:08] (03CR) 10Marostegui: [V: 03+2 C: 03+2] add_tl_target_id_T300775.py: Increase downtime [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761525 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [05:48:38] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:49:06] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:49:08] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:49:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove recentchangeslinked group from s1 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P20451 and previous config saved to /var/cache/conftool/dbconfig/20220210-054911-marostegui.json [05:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:16] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [05:50:48] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:51:32] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:51:49] I'll look in a few minutes at lists1001 [05:52:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:52:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:52:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:53:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:53:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [05:53:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [05:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [05:53:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [05:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T300382)', diff saved to https://phabricator.wikimedia.org/P20452 and previous config saved to /var/cache/conftool/dbconfig/20220210-055400-marostegui.json [05:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:04] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [05:54:33] (03PS1) 10Marostegui: change_ir_value_T300382.py: Fixes [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761526 (https://phabricator.wikimedia.org/T300382) [05:54:48] (03CR) 10Marostegui: [V: 03+2 C: 03+2] change_ir_value_T300382.py: Fixes [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761526 (https://phabricator.wikimedia.org/T300382) (owner: 10Marostegui) [05:55:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T300382)', diff saved to https://phabricator.wikimedia.org/P20453 and previous config saved to /var/cache/conftool/dbconfig/20220210-055507-marostegui.json [05:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20454 and previous config saved to /var/cache/conftool/dbconfig/20220210-055548-root.json [05:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:28] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2022-04-26 08:09:10 +0000 (expires in 75 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:57:58] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 26 Apr 2022 08:09:10 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:58:26] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 26 Apr 2022 08:09:10 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:01:30] !log Drop tendril database from db1115 T297605 [06:01:33] it recovered by itself, nice [06:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:34] T297605: Shutdown Tendril and dbtree - https://phabricator.wikimedia.org/T297605 [06:02:35] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:07:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1115.eqiad.wmnet with OS bullseye [06:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P20455 and previous config saved to /var/cache/conftool/dbconfig/20220210-061012-marostegui.json [06:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20456 and previous config saved to /var/cache/conftool/dbconfig/20220210-061052-root.json [06:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:23] (03PS1) 10Marostegui: netboot.cfg: Add db1115 to RAID1 partman [puppet] - 10https://gerrit.wikimedia.org/r/761527 (https://phabricator.wikimedia.org/T297605) [06:13:00] (03CR) 10Marostegui: [C: 03+2] netboot.cfg: Add db1115 to RAID1 partman [puppet] - 10https://gerrit.wikimedia.org/r/761527 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [06:13:09] !log marostegui@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1115.eqiad.wmnet with OS bullseye [06:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:43] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10AndyRussG) Thanks again @jcrespo! I tried to find a Phab task about the original Google Search Console setup, to see who might have been involved in that, but I couldn't find one. T... [06:18:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1115.eqiad.wmnet with OS bullseye [06:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:12] !log marostegui@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1115.eqiad.wmnet with OS bullseye [06:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:16] (03PS1) 10Marostegui: netboot.cfg: Use reuse-raid1-2dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/761528 (https://phabricator.wikimedia.org/T297605) [06:24:02] (03CR) 10Marostegui: [C: 03+2] netboot.cfg: Use reuse-raid1-2dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/761528 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [06:25:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P20457 and previous config saved to /var/cache/conftool/dbconfig/20220210-062517-marostegui.json [06:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20458 and previous config saved to /var/cache/conftool/dbconfig/20220210-062556-root.json [06:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:16] (03CR) 10Razzi: Add cookbooks for running maintain-views (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [06:28:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1115.eqiad.wmnet with OS bullseye [06:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:22] (03CR) 10Elukey: Add cookbooks for running maintain-views (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [06:40:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T300382)', diff saved to https://phabricator.wikimedia.org/P20459 and previous config saved to /var/cache/conftool/dbconfig/20220210-064021-marostegui.json [06:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [06:40:27] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [06:40:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [06:40:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [06:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [06:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [06:40:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [06:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T300382)', diff saved to https://phabricator.wikimedia.org/P20460 and previous config saved to /var/cache/conftool/dbconfig/20220210-064049-marostegui.json [06:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20461 and previous config saved to /var/cache/conftool/dbconfig/20220210-064059-root.json [06:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [06:41:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [06:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T300775)', diff saved to https://phabricator.wikimedia.org/P20462 and previous config saved to /var/cache/conftool/dbconfig/20220210-064149-marostegui.json [06:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:53] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [06:41:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T300382)', diff saved to https://phabricator.wikimedia.org/P20463 and previous config saved to /var/cache/conftool/dbconfig/20220210-064156-marostegui.json [06:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300775)', diff saved to https://phabricator.wikimedia.org/P20464 and previous config saved to /var/cache/conftool/dbconfig/20220210-064411-marostegui.json [06:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:43] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2006.codfw.wmnet with OS bullseye [06:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:23] (03CR) 10Krinkle: [C: 03+1] "LGTM, ready to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761441 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [06:56:38] (03PS1) 10Marostegui: site.pp: Remove tendrilr references [puppet] - 10https://gerrit.wikimedia.org/r/761529 (https://phabricator.wikimedia.org/T297605) [06:57:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P20465 and previous config saved to /var/cache/conftool/dbconfig/20220210-065701-marostegui.json [06:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:11] (03PS2) 10Marostegui: site.pp: Remove tendril references [puppet] - 10https://gerrit.wikimedia.org/r/761529 (https://phabricator.wikimedia.org/T297605) [06:57:59] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove tendril references [puppet] - 10https://gerrit.wikimedia.org/r/761529 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [06:58:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298554)', diff saved to https://phabricator.wikimedia.org/P20466 and previous config saved to /var/cache/conftool/dbconfig/20220210-065842-ladsgroup.json [06:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:47] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [06:59:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P20467 and previous config saved to /var/cache/conftool/dbconfig/20220210-065916-marostegui.json [06:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:00] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [07:06:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1115.eqiad.wmnet with OS bullseye [07:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:35] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:12:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P20468 and previous config saved to /var/cache/conftool/dbconfig/20220210-071206-marostegui.json [07:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P20469 and previous config saved to /var/cache/conftool/dbconfig/20220210-071347-ladsgroup.json [07:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:51] (03PS1) 10ArielGlenn: Add a fixup script that does bulk noop jobs across wikis [dumps] - 10https://gerrit.wikimedia.org/r/761532 (https://phabricator.wikimedia.org/T301373) [07:14:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P20470 and previous config saved to /var/cache/conftool/dbconfig/20220210-071421-marostegui.json [07:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2006.codfw.wmnet with OS bullseye [07:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:01] (03PS1) 10Marostegui: netboot.cfg: db1115 back to non format [puppet] - 10https://gerrit.wikimedia.org/r/761534 (https://phabricator.wikimedia.org/T297605) [07:21:09] (03CR) 10Marostegui: [C: 03+2] netboot.cfg: db1115 back to non format [puppet] - 10https://gerrit.wikimedia.org/r/761534 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [07:22:20] (03PS1) 10Marostegui: Revert "prometheus: Temporarily switch to db2093" [puppet] - 10https://gerrit.wikimedia.org/r/761410 [07:23:04] (03CR) 10Marostegui: [C: 03+2] Revert "prometheus: Temporarily switch to db2093" [puppet] - 10https://gerrit.wikimedia.org/r/761410 (owner: 10Marostegui) [07:27:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T300382)', diff saved to https://phabricator.wikimedia.org/P20471 and previous config saved to /var/cache/conftool/dbconfig/20220210-072711-marostegui.json [07:27:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [07:27:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [07:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:16] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [07:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T300382)', diff saved to https://phabricator.wikimedia.org/P20472 and previous config saved to /var/cache/conftool/dbconfig/20220210-072718-marostegui.json [07:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T300382)', diff saved to https://phabricator.wikimedia.org/P20473 and previous config saved to /var/cache/conftool/dbconfig/20220210-072826-marostegui.json [07:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P20474 and previous config saved to /var/cache/conftool/dbconfig/20220210-072852-ladsgroup.json [07:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300775)', diff saved to https://phabricator.wikimedia.org/P20475 and previous config saved to /var/cache/conftool/dbconfig/20220210-072925-marostegui.json [07:29:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [07:29:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [07:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:30] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [07:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T300775)', diff saved to https://phabricator.wikimedia.org/P20476 and previous config saved to /var/cache/conftool/dbconfig/20220210-072933-marostegui.json [07:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:01] (03PS1) 10Marostegui: valid_section.pp: Remove tendril [puppet] - 10https://gerrit.wikimedia.org/r/761535 (https://phabricator.wikimedia.org/T297605) [07:31:19] RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:02] (03CR) 10Marostegui: [C: 03+2] valid_section.pp: Remove tendril [puppet] - 10https://gerrit.wikimedia.org/r/761535 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [07:36:48] (03CR) 10Filippo Giunchedi: [C: 03+1] add new prometheus hosts to labs-in[4,6] [homer/public] - 10https://gerrit.wikimedia.org/r/761435 (https://phabricator.wikimedia.org/T301376) (owner: 10Herron) [07:43:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P20477 and previous config saved to /var/cache/conftool/dbconfig/20220210-074331-marostegui.json [07:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298554)', diff saved to https://phabricator.wikimedia.org/P20478 and previous config saved to /var/cache/conftool/dbconfig/20220210-074356-ladsgroup.json [07:43:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [07:44:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [07:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:01] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [07:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298554)', diff saved to https://phabricator.wikimedia.org/P20479 and previous config saved to /var/cache/conftool/dbconfig/20220210-074404-ladsgroup.json [07:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:24] (03CR) 10Filippo Giunchedi: "After I3c855c8 by John this is a noop as expected!" [puppet] - 10https://gerrit.wikimedia.org/r/761294 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [07:49:15] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: inhibit warnings when a match critical alert is firing [puppet] - 10https://gerrit.wikimedia.org/r/761285 (owner: 10Filippo Giunchedi) [07:49:57] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add 'tcp' probe type [puppet] - 10https://gerrit.wikimedia.org/r/761279 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [07:52:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/754029 (https://phabricator.wikimedia.org/T298994) (owner: 10Herron) [07:55:38] (03CR) 10Filippo Giunchedi: "In addition to Cole's question, this LGTM if we're shipping all default modules by default (which I believe is the case)" [puppet] - 10https://gerrit.wikimedia.org/r/761455 (https://phabricator.wikimedia.org/T292175) (owner: 10Herron) [07:56:49] (03PS1) 10Marostegui: tendril.sql.erb: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/761572 (https://phabricator.wikimedia.org/T297605) [07:57:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/761064 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [07:57:19] (03CR) 10Filippo Giunchedi: [C: 03+1] watchrat: route donate.wm.o alerts to fr-ircmail [puppet] - 10https://gerrit.wikimedia.org/r/761403 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [07:57:29] (03CR) 10Filippo Giunchedi: [C: 03+1] graphite: whisper_cleanup: remove absented cron [puppet] - 10https://gerrit.wikimedia.org/r/751471 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:57:35] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: whisper_cleanup: remove absented cron [puppet] - 10https://gerrit.wikimedia.org/r/751471 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:57:43] (03CR) 10Marostegui: [C: 03+2] tendril.sql.erb: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/761572 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [07:58:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P20480 and previous config saved to /var/cache/conftool/dbconfig/20220210-075836-marostegui.json [07:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:42] (03PS1) 10Marostegui: tendril/maintenance.pp: Files ensure to be absent [puppet] - 10https://gerrit.wikimedia.org/r/761573 (https://phabricator.wikimedia.org/T297605) [08:01:57] (03PS1) 10ArielGlenn: fix up flow dumps config for deployment-prep cluster [puppet] - 10https://gerrit.wikimedia.org/r/761574 (https://phabricator.wikimedia.org/T300760) [08:02:53] (03CR) 10jerkins-bot: [V: 04-1] tendril/maintenance.pp: Files ensure to be absent [puppet] - 10https://gerrit.wikimedia.org/r/761573 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [08:04:17] (03PS2) 10Marostegui: tendril/maintenance.pp: Files ensure to be absent [puppet] - 10https://gerrit.wikimedia.org/r/761573 (https://phabricator.wikimedia.org/T297605) [08:08:05] (03PS1) 10Marostegui: Revert "valid_section.pp: Remove tendril" [puppet] - 10https://gerrit.wikimedia.org/r/761411 [08:10:07] (03CR) 10Marostegui: [C: 03+2] Revert "valid_section.pp: Remove tendril" [puppet] - 10https://gerrit.wikimedia.org/r/761411 (owner: 10Marostegui) [08:10:52] (03CR) 10Muehlenhoff: [C: 03+2] Set cookie_secure: On for superset [puppet] - 10https://gerrit.wikimedia.org/r/761388 (owner: 10Muehlenhoff) [08:13:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T300382)', diff saved to https://phabricator.wikimedia.org/P20481 and previous config saved to /var/cache/conftool/dbconfig/20220210-081340-marostegui.json [08:13:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:13:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:13:46] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [08:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T300382)', diff saved to https://phabricator.wikimedia.org/P20482 and previous config saved to /var/cache/conftool/dbconfig/20220210-081354-marostegui.json [08:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T300382)', diff saved to https://phabricator.wikimedia.org/P20483 and previous config saved to /var/cache/conftool/dbconfig/20220210-081501-marostegui.json [08:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:12] (03PS1) 10Marostegui: db_inventory.my.cnf: Remove tendril specific options [puppet] - 10https://gerrit.wikimedia.org/r/761575 (https://phabricator.wikimedia.org/T297605) [08:18:16] (03PS2) 10ArielGlenn: fix up flow dumps config for deployment-prep cluster [puppet] - 10https://gerrit.wikimedia.org/r/761574 (https://phabricator.wikimedia.org/T300760) [08:20:10] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33679/console" [puppet] - 10https://gerrit.wikimedia.org/r/761294 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [08:20:34] (03CR) 10ArielGlenn: [C: 03+2] fix up flow dumps config for deployment-prep cluster [puppet] - 10https://gerrit.wikimedia.org/r/761574 (https://phabricator.wikimedia.org/T300760) (owner: 10ArielGlenn) [08:20:55] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33681/console" [puppet] - 10https://gerrit.wikimedia.org/r/761294 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [08:21:11] (03PS2) 10Marostegui: db_inventory.my.cnf: Remove tendril specific options [puppet] - 10https://gerrit.wikimedia.org/r/761575 (https://phabricator.wikimedia.org/T297605) [08:21:36] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33682/console" [puppet] - 10https://gerrit.wikimedia.org/r/761294 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [08:22:24] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33684/console" [puppet] - 10https://gerrit.wikimedia.org/r/761294 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [08:22:54] (03CR) 10Marostegui: "This https://puppet-compiler.wmflabs.org/pcc-worker1003/33683/ looks good" [puppet] - 10https://gerrit.wikimedia.org/r/761575 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [08:23:06] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33685/console" [puppet] - 10https://gerrit.wikimedia.org/r/761294 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [08:23:10] (03CR) 10Marostegui: [C: 03+2] db_inventory.my.cnf: Remove tendril specific options [puppet] - 10https://gerrit.wikimedia.org/r/761575 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [08:27:40] (03PS1) 10Marostegui: production.sql.erb: Remove tendril user [puppet] - 10https://gerrit.wikimedia.org/r/761577 (https://phabricator.wikimedia.org/T297605) [08:27:51] (03CR) 10Muehlenhoff: [C: 03+2] turnilo: Set cookie_secure to On [puppet] - 10https://gerrit.wikimedia.org/r/761389 (owner: 10Muehlenhoff) [08:29:50] (03CR) 10Marostegui: [C: 03+2] production.sql.erb: Remove tendril user [puppet] - 10https://gerrit.wikimedia.org/r/761577 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [08:30:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P20484 and previous config saved to /var/cache/conftool/dbconfig/20220210-083006-marostegui.json [08:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:55] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 70 probes of 659 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:40:31] (03PS2) 10Muehlenhoff: profile::idp::client::httpd::site: Default cookie_secure to On [puppet] - 10https://gerrit.wikimedia.org/r/761390 [08:42:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/761390 (owner: 10Muehlenhoff) [08:44:13] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 56 probes of 659 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:44:47] (03PS1) 10Marostegui: change_mw_mysql_pass.sh: Tendril is gone [software] - 10https://gerrit.wikimedia.org/r/761578 (https://phabricator.wikimedia.org/T297605) [08:45:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P20485 and previous config saved to /var/cache/conftool/dbconfig/20220210-084511-marostegui.json [08:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:39] (03CR) 10Marostegui: [C: 03+2] change_mw_mysql_pass.sh: Tendril is gone [software] - 10https://gerrit.wikimedia.org/r/761578 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [08:46:07] (03Merged) 10jenkins-bot: change_mw_mysql_pass.sh: Tendril is gone [software] - 10https://gerrit.wikimedia.org/r/761578 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [08:48:09] (03PS3) 10Muehlenhoff: profile::idp::client::httpd::site: Default cookie_secure to On [puppet] - 10https://gerrit.wikimedia.org/r/761390 [08:48:33] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/761390 (owner: 10Muehlenhoff) [08:52:08] (03CR) 10ArielGlenn: [C: 03+2] Add a fixup script that does bulk noop jobs across wikis [dumps] - 10https://gerrit.wikimedia.org/r/761532 (https://phabricator.wikimedia.org/T301373) (owner: 10ArielGlenn) [08:52:32] (03Merged) 10jenkins-bot: Add a fixup script that does bulk noop jobs across wikis [dumps] - 10https://gerrit.wikimedia.org/r/761532 (https://phabricator.wikimedia.org/T301373) (owner: 10ArielGlenn) [08:53:56] (03CR) 10Jgiannelos: [C: 03+1] tegola: fix label cut on place_label layer [deployment-charts] - 10https://gerrit.wikimedia.org/r/761481 (https://phabricator.wikimedia.org/T228612) (owner: 10MSantos) [09:00:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T300382)', diff saved to https://phabricator.wikimedia.org/P20486 and previous config saved to /var/cache/conftool/dbconfig/20220210-090016-marostegui.json [09:00:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:00:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:21] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [09:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T300382)', diff saved to https://phabricator.wikimedia.org/P20487 and previous config saved to /var/cache/conftool/dbconfig/20220210-090023-marostegui.json [09:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T300382)', diff saved to https://phabricator.wikimedia.org/P20488 and previous config saved to /var/cache/conftool/dbconfig/20220210-090129-marostegui.json [09:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:36] 10SRE, 10ops-eqiad: Allocate new cabs for WMCS in rows E/F Eqiad - https://phabricator.wikimedia.org/T301414 (10Peachey88) [09:04:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298554)', diff saved to https://phabricator.wikimedia.org/P20489 and previous config saved to /var/cache/conftool/dbconfig/20220210-090415-ladsgroup.json [09:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:20] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [09:19:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P20491 and previous config saved to /var/cache/conftool/dbconfig/20220210-091920-ladsgroup.json [09:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:05] (03CR) 10Muehlenhoff: [C: 03+2] profile::idp::client::httpd::site: Default cookie_secure to On [puppet] - 10https://gerrit.wikimedia.org/r/761390 (owner: 10Muehlenhoff) [09:27:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove recentchanges group from s1 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P20492 and previous config saved to /var/cache/conftool/dbconfig/20220210-092727-marostegui.json [09:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:32] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [09:30:12] !log Remove watchdog@10.% user from db2071 T301442 [09:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:16] T301442: Audit and remove watchdog user - https://phabricator.wikimedia.org/T301442 [09:31:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P20493 and previous config saved to /var/cache/conftool/dbconfig/20220210-093141-marostegui.json [09:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P20494 and previous config saved to /var/cache/conftool/dbconfig/20220210-093425-ladsgroup.json [09:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:56] (03PS1) 10Elukey: Add ml-serve2006 to the ml-serve-codfw k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/761584 (https://phabricator.wikimedia.org/T300744) [09:41:04] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33687/console" [puppet] - 10https://gerrit.wikimedia.org/r/761584 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [09:42:18] (03PS1) 10Marostegui: production.pp: Remove tendril grants [puppet] - 10https://gerrit.wikimedia.org/r/761585 (https://phabricator.wikimedia.org/T297605) [09:43:16] !log update pcc facts [09:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T300382)', diff saved to https://phabricator.wikimedia.org/P20495 and previous config saved to /var/cache/conftool/dbconfig/20220210-094647-marostegui.json [09:46:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [09:46:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [09:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:53] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [09:46:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T300382)', diff saved to https://phabricator.wikimedia.org/P20496 and previous config saved to /var/cache/conftool/dbconfig/20220210-094655-marostegui.json [09:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T300382)', diff saved to https://phabricator.wikimedia.org/P20497 and previous config saved to /var/cache/conftool/dbconfig/20220210-094802-marostegui.json [09:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298554)', diff saved to https://phabricator.wikimedia.org/P20498 and previous config saved to /var/cache/conftool/dbconfig/20220210-094929-ladsgroup.json [09:49:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [09:49:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [09:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:35] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [09:49:35] (03PS1) 10Kosta Harlan: [WIP] linkrecommendation: Use json output format for logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/761586 (https://phabricator.wikimedia.org/T296334) [09:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:00] (03PS2) 10Kosta Harlan: [WIP] linkrecommendation: Use json output format for logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/761586 (https://phabricator.wikimedia.org/T296334) [09:52:53] (03CR) 10Volans: [C: 03+1] "LGTM, all includes are correct and exists in the Netbox generated repo." [dns] - 10https://gerrit.wikimedia.org/r/761473 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [09:55:01] (03PS2) 10Elukey: Add ml-serve2006 to the ml-serve-codfw k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/761584 (https://phabricator.wikimedia.org/T300744) [09:55:03] (03PS1) 10Elukey: profile::kubernetes::node: avoid iptables alternatives for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/761587 (https://phabricator.wikimedia.org/T300744) [09:55:25] (03CR) 10Volans: "Actually, on second check, I think we're missing to include 2 files for the cross connections with the core routers:" [dns] - 10https://gerrit.wikimedia.org/r/761473 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [09:55:44] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33689/console" [puppet] - 10https://gerrit.wikimedia.org/r/761584 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [09:59:37] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Reverse DNS zones includes for drmrs - https://phabricator.wikimedia.org/T301447 (10Volans) [09:59:43] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Reverse DNS zones includes for drmrs - https://phabricator.wikimedia.org/T301447 (10Volans) p:05Triage→03Medium [10:03:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P20499 and previous config saved to /var/cache/conftool/dbconfig/20220210-100307-marostegui.json [10:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:14] (03PS3) 10Kosta Harlan: linkrecommendation: Use json output format for logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/761586 (https://phabricator.wikimedia.org/T296334) [10:12:15] (03CR) 10Giuseppe Lavagetto: "Thanks for taking the time to review this patch." [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 (owner: 10Giuseppe Lavagetto) [10:12:51] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Use json output format for logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/761586 (https://phabricator.wikimedia.org/T296334) (owner: 10Kosta Harlan) [10:16:31] (03PS8) 10Giuseppe Lavagetto: Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 [10:16:34] (03PS7) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 [10:16:35] (03Merged) 10jenkins-bot: linkrecommendation: Use json output format for logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/761586 (https://phabricator.wikimedia.org/T296334) (owner: 10Kosta Harlan) [10:18:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P20500 and previous config saved to /var/cache/conftool/dbconfig/20220210-101812-marostegui.json [10:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T300382)', diff saved to https://phabricator.wikimedia.org/P20501 and previous config saved to /var/cache/conftool/dbconfig/20220210-103317-marostegui.json [10:33:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [10:33:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [10:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:22] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [10:33:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T300382)', diff saved to https://phabricator.wikimedia.org/P20502 and previous config saved to /var/cache/conftool/dbconfig/20220210-103324-marostegui.json [10:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:30] (03CR) 10Ladsgroup: [C: 03+1] production.pp: Remove tendril grants [puppet] - 10https://gerrit.wikimedia.org/r/761585 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [10:39:19] (03CR) 10Ladsgroup: [C: 03+1] tendril/maintenance.pp: Files ensure to be absent [puppet] - 10https://gerrit.wikimedia.org/r/761573 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [10:39:27] (03CR) 10Marostegui: [C: 03+2] tendril/maintenance.pp: Files ensure to be absent [puppet] - 10https://gerrit.wikimedia.org/r/761573 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [10:39:41] (03CR) 10Marostegui: [C: 03+2] production.pp: Remove tendril grants [puppet] - 10https://gerrit.wikimedia.org/r/761585 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [10:41:58] (03Abandoned) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud-in4: drop ACL entry for WMF wikis [homer/public] - 10https://gerrit.wikimedia.org/r/656886 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez) [10:42:05] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10fgiunchedi) [10:42:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] add new prometheus hosts to labs-in[4,6] [homer/public] - 10https://gerrit.wikimedia.org/r/761435 (https://phabricator.wikimedia.org/T301376) (owner: 10Herron) [10:42:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20503 and previous config saved to /var/cache/conftool/dbconfig/20220210-104208-root.json [10:42:10] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/761435 (https://phabricator.wikimedia.org/T301376) (owner: 10Herron) [10:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:39] 10SRE, 10vm-requests: eqiad: 3 VMs requested for datahub opensearch cluster - https://phabricator.wikimedia.org/T301383 (10akosiaris) LGTM. Docs for creating a Ganeti VM are at https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM, feel free to proceed. [10:43:18] arturo: I saw you merged the ACL change, are you deploying the change too ? [10:43:29] (03PS1) 10JMeybohm: Add tcp-notls probe to k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/761590 (https://phabricator.wikimedia.org/T300740) [10:43:57] godog: yes [10:44:02] !log deploying https://gerrit.wikimedia.org/r/c/operations/homer/public/+/761435 to core routers [10:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:11] (03CR) 10Filippo Giunchedi: [C: 03+1] Add tcp-notls probe to k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/761590 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [10:44:14] godog: doing so as we speak [10:44:28] arturo: ack, thank you! [10:44:56] (03CR) 10JMeybohm: [C: 03+2] Add tcp-notls probe to k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/761590 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [10:46:00] !log installing ruby2.5 security updates [10:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:10] jouncebot: nowandnext [10:48:11] No deployments scheduled for the next 0 hour(s) and 11 minute(s) [10:48:11] In 0 hour(s) and 11 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220210T1100) [10:48:18] awesome [10:48:27] (03CR) 10Ladsgroup: [C: 03+2] Short circut updating stats when the page is not reviewable [extensions/FlaggedRevs] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761408 (https://phabricator.wikimedia.org/T301433) (owner: 10Jforrester) [10:48:31] (03CR) 10Ladsgroup: [C: 03+2] Short circut updating stats when the page is not reviewable [extensions/FlaggedRevs] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761409 (https://phabricator.wikimedia.org/T301433) (owner: 10Jforrester) [10:51:48] (03Merged) 10jenkins-bot: Short circut updating stats when the page is not reviewable [extensions/FlaggedRevs] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761408 (https://phabricator.wikimedia.org/T301433) (owner: 10Jforrester) [10:51:52] (03CR) 10jerkins-bot: [V: 04-1] Short circut updating stats when the page is not reviewable [extensions/FlaggedRevs] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761409 (https://phabricator.wikimedia.org/T301433) (owner: 10Jforrester) [10:54:23] (03PS1) 10Filippo Giunchedi: hieradata: decom prometheus[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/761591 (https://phabricator.wikimedia.org/T296199) [10:55:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: move prometheus_nodes to WMCS role-based hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/761294 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [10:55:23] (03CR) 10Ladsgroup: [C: 03+2] "." [extensions/FlaggedRevs] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761409 (https://phabricator.wikimedia.org/T301433) (owner: 10Jforrester) [10:56:02] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hieradata: move prometheus_nodes to WMCS role-based hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/761294 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [10:57:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20505 and previous config saved to /var/cache/conftool/dbconfig/20220210-105713-root.json [10:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:37] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.21/extensions/FlaggedRevs/backend/FlaggedRevs.php: Backport: [[gerrit:761408|Short circut updating stats when the page is not reviewable (T301433)]] (duration: 00m 50s) [10:58:39] (03Merged) 10jenkins-bot: Short circut updating stats when the page is not reviewable [extensions/FlaggedRevs] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761409 (https://phabricator.wikimedia.org/T301433) (owner: 10Jforrester) [10:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:42] T301433: Wikimedia\Rdbms\DBReadOnlyError: Database is read-only: The database is read-only until replication lag decreases. - https://phabricator.wikimedia.org/T301433 [10:58:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [10:58:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [10:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298554)', diff saved to https://phabricator.wikimedia.org/P20506 and previous config saved to /var/cache/conftool/dbconfig/20220210-105853-ladsgroup.json [10:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:57] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [10:58:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:59:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220210T1100) [11:00:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:43] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [11:01:16] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.20/extensions/FlaggedRevs/backend/FlaggedRevs.php: Backport: [[gerrit:761409|Short circut updating stats when the page is not reviewable (T301433)]] (duration: 00m 49s) [11:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1021.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [11:03:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1021.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [11:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:54] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply on staging [11:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:57] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on internal [11:03:58] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on external [11:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:33] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1021 [11:04:45] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [11:05:01] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: sync on staging [11:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [11:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:06:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [11:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:22] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply on external [11:06:22] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply on internal [11:06:25] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply on staging [11:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:38] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:07:59] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: sync on external [11:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:33] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: sync on internal [11:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:43] (03PS2) 10Jbond: 2.1.0: prepare for release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/761051 [11:08:48] (03CR) 10Jbond: [C: 03+2] populate_puppetdb: Add support for reading facts directly from disk [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/760949 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [11:08:51] (03CR) 10Jbond: [C: 03+2] 2.1.0: prepare for release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/761051 (owner: 10Jbond) [11:09:37] (03PS1) 10Kosta Harlan: linkrecommendation: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/761595 (https://phabricator.wikimedia.org/T296334) [11:09:45] (03PS7) 10Jbond: C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) [11:09:57] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply on external [11:09:57] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply on internal [11:09:58] (03Merged) 10jenkins-bot: populate_puppetdb: Add support for reading facts directly from disk [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/760949 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [11:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:00] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply on staging [11:10:02] (03Merged) 10jenkins-bot: 2.1.0: prepare for release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/761051 (owner: 10Jbond) [11:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:12] (03PS8) 10Jbond: C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) [11:10:37] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/761595 (https://phabricator.wikimedia.org/T296334) (owner: 10Kosta Harlan) [11:11:06] (03PS2) 10Hnowlan: restbase: remove restbase2010 [puppet] - 10https://gerrit.wikimedia.org/r/761006 (https://phabricator.wikimedia.org/T295375) [11:11:49] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: sync on external [11:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20507 and previous config saved to /var/cache/conftool/dbconfig/20220210-111217-root.json [11:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:23] (03CR) 10jerkins-bot: [V: 04-1] C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [11:14:23] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: sync on internal [11:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:32] (03Merged) 10jenkins-bot: linkrecommendation: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/761595 (https://phabricator.wikimedia.org/T296334) (owner: 10Kosta Harlan) [11:14:32] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply on staging [11:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:35] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on internal [11:14:36] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on external [11:14:37] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on staging [11:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:21] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply on staging [11:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:24] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on internal [11:15:25] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on external [11:15:26] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on staging [11:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:10] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply on internal [11:16:10] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply on external [11:16:13] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply on staging [11:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:26] (03PS1) 10Marostegui: maintenance.pp: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/761597 (https://phabricator.wikimedia.org/T297605) [11:16:36] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply on staging [11:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:39] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on internal [11:16:40] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on external [11:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:36] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: sync on staging [11:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:41] (03CR) 10Klausman: [C: 03+1] Add ml-serve2006 to the ml-serve-codfw k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/761584 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [11:18:14] (03CR) 10Klausman: [C: 03+1] profile::kubernetes::node: avoid iptables alternatives for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/761587 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [11:18:23] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply on internal [11:18:23] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply on external [11:18:25] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply on staging [11:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:48] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: sync on external [11:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:16] (03CR) 10Ladsgroup: [C: 03+1] maintenance.pp: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/761597 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [11:19:33] (03CR) 10Marostegui: [C: 03+2] maintenance.pp: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/761597 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [11:19:45] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: sync on internal [11:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:14] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply on internal [11:20:14] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply on external [11:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:18] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply on staging [11:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:20:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T300382)', diff saved to https://phabricator.wikimedia.org/P20508 and previous config saved to /var/cache/conftool/dbconfig/20220210-112034-marostegui.json [11:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:39] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [11:20:42] (03CR) 10Hnowlan: [C: 03+2] restbase: remove restbase2010 [puppet] - 10https://gerrit.wikimedia.org/r/761006 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [11:21:06] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: sync on external [11:21:08] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: sync on internal [11:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T300382)', diff saved to https://phabricator.wikimedia.org/P20509 and previous config saved to /var/cache/conftool/dbconfig/20220210-112147-marostegui.json [11:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:28] (03CR) 10JMeybohm: [C: 03+1] "This is indeed a bit confusing currently!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/760524 (owner: 10Jelto) [11:25:09] (03PS1) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) [11:26:05] (03CR) 10MSantos: [C: 03+2] tegola: fix label cut on place_label layer [deployment-charts] - 10https://gerrit.wikimedia.org/r/761481 (https://phabricator.wikimedia.org/T228612) (owner: 10MSantos) [11:26:17] (03CR) 10jerkins-bot: [V: 04-1] Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney) [11:26:26] (03PS1) 10Majavah: prod: WRITE_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761599 (https://phabricator.wikimedia.org/T289068) [11:27:04] (03CR) 10Majavah: "I'm planning on deploying this early next week, unless you have objections." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761599 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [11:27:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20510 and previous config saved to /var/cache/conftool/dbconfig/20220210-112720-root.json [11:27:22] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10MoritzMuehlenhoff) >>! In T300985#7698166, @jhathaway wrote: >>>! In T300985#7696575, @MoritzMuehlenhoff wrote: >> Good catch! It seems a little mysterious though that this prob... [11:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:36] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase2010.codfw.wmnet [11:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:50] (03PS1) 10ArielGlenn: don't allow api jobs to have the same name [dumps] - 10https://gerrit.wikimedia.org/r/761600 (https://phabricator.wikimedia.org/T301373) [11:29:10] (03CR) 10jerkins-bot: [V: 04-1] don't allow api jobs to have the same name [dumps] - 10https://gerrit.wikimedia.org/r/761600 (https://phabricator.wikimedia.org/T301373) (owner: 10ArielGlenn) [11:29:40] (03Merged) 10jenkins-bot: tegola: fix label cut on place_label layer [deployment-charts] - 10https://gerrit.wikimedia.org/r/761481 (https://phabricator.wikimedia.org/T228612) (owner: 10MSantos) [11:30:30] (03CR) 104nn1l2: "I removed my -1 vote per T291737#7700244 maily because I don't want block it. I still believe localization should be done in the MediaWiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747973 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [11:31:19] 10SRE, 10serviceops, 10GitLab (Infrastructure): gitlab: enable IPv6 for https - https://phabricator.wikimedia.org/T300816 (10Jelto) 05Resolved→03Open It seems gitlab-runner metrics exporter for trusted runner have issues now. The auto-detected address of these runners changed to IPv6 as well and exporter... [11:32:07] (03PS1) 10Ladsgroup: DerivedPageDataUpdater: Set ParserOutput when it's passed to it [core] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761413 (https://phabricator.wikimedia.org/T301309) [11:32:09] (03CR) 104nn1l2: Change / add some namespaces and aliases on arywiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747973 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [11:32:20] (03PS1) 10Ladsgroup: DerivedPageDataUpdater: Set ParserOutput when it's passed to it [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761414 (https://phabricator.wikimedia.org/T301309) [11:32:21] (03PS2) 10ArielGlenn: don't allow api jobs to have the same name [dumps] - 10https://gerrit.wikimedia.org/r/761600 (https://phabricator.wikimedia.org/T301373) [11:33:35] (03PS9) 10Jbond: C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) [11:35:55] (03PS2) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) [11:36:07] (03CR) 10Ladsgroup: [C: 03+1] prod: WRITE_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761599 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [11:36:29] (03CR) 10jerkins-bot: [V: 04-1] Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney) [11:36:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P20511 and previous config saved to /var/cache/conftool/dbconfig/20220210-113651-marostegui.json [11:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:58] (03CR) 10Zabe: [C: 03+1] prod: WRITE_NEW for CentralAuth hidden level migration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761599 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [11:37:16] (03CR) 10Jbond: [C: 03+2] C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [11:37:20] (03PS9) 10Giuseppe Lavagetto: Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 [11:37:22] (03PS8) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 [11:38:41] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 (owner: 10Giuseppe Lavagetto) [11:40:38] (03PS10) 10Giuseppe Lavagetto: Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 [11:40:40] (03PS9) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 [11:40:53] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase2010.codfw.wmnet [11:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:59] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by hnowlan@cumin1001 for... [11:41:08] (03PS1) 10Filippo Giunchedi: team-sre: catch WMCS Prometheus scrape failures [alerts] - 10https://gerrit.wikimedia.org/r/761604 (https://phabricator.wikimedia.org/T301376) [11:42:00] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 (owner: 10Giuseppe Lavagetto) [11:42:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20512 and previous config saved to /var/cache/conftool/dbconfig/20220210-114224-root.json [11:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:22] (03PS3) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) [11:43:50] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase2009.codfw.wmnet [11:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:30] (03CR) 10jerkins-bot: [V: 04-1] DerivedPageDataUpdater: Set ParserOutput when it's passed to it [core] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761413 (https://phabricator.wikimedia.org/T301309) (owner: 10Ladsgroup) [11:51:52] (03PS1) 104nn1l2: banwikisource: Fix logo size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761605 (https://phabricator.wikimedia.org/T296459) [11:51:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P20513 and previous config saved to /var/cache/conftool/dbconfig/20220210-115156-marostegui.json [11:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:01] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase2009.codfw.wmnet [11:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:06] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by hnowlan@cumin1001 for... [11:54:08] (03PS11) 10Giuseppe Lavagetto: Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 [11:54:10] (03PS10) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 [11:55:44] (03PS16) 10D3r1ck01: Define a contact form for Chapter/Thorg application status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024) [11:57:29] (03PS1) 10Majavah: P:openstack::cumin::target: redefine Ferm $CUMIN_MASTERS [puppet] - 10https://gerrit.wikimedia.org/r/761606 [12:00:05] Amir1, Lucas_WMDE, and apergos: How many deployers does it take to do UTC morning backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220210T1200). [12:00:05] xSavitar, zabe, and nn1l2: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:09] hi [12:00:11] o/ [12:00:14] hello. there are 4 patches in the window [12:00:15] o/ [12:00:17] no trainees have signed up [12:00:18] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33690/console" [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [12:00:22] o/ [12:00:25] o/ [12:00:28] I see 3 patches without any code review (sadness!) [12:00:48] who among you patch owners can self deploy and who needs an assist? [12:00:59] I had been told that code review is not a hard requirement [12:01:19] I need a deployer [12:01:19] (all four patches look to my not entirely experienced eyes to be "not risky" patches) [12:01:36] nn1l2: noted! xSavitar, zabe ? [12:01:42] apergos: Not risky, yes! [12:01:45] me too [12:02:03] I can self deploy, but I want someone else to do it then I'll test :) [12:02:11] xSavitar: noted! [12:02:19] zabe, are you self deploy or not? [12:02:27] apergos: I need someone to do it for me [12:02:30] ok! [12:02:51] who among us wants to be the hands on deployer of the day? Lucas_WMDE, taavi, Amir1, me? [12:02:57] anything works for me [12:03:10] you spoke up first so you win the lottery :-D [12:03:15] lol [12:03:25] xSavitar: starting with your patch [12:03:28] (03CR) 10Majavah: [C: 03+2] Define a contact form for Chapter/Thorg application status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024) (owner: 10D3r1ck01) [12:03:32] all of these are config patches, do any of you patch owners have a preference for going first/last outside the order of the,.... [12:03:36] taavi: Okay! [12:03:40] well, nm, it's happening already :-) [12:04:08] (03Merged) 10jenkins-bot: Define a contact form for Chapter/Thorg application status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024) (owner: 10D3r1ck01) [12:04:30] xSavitar: your patch is available for testing on mwdebug1001 [12:04:38] Let me test now... [12:04:47] 10SRE, 10vm-requests: eqiad: 3 VMs requested for datahub opensearch cluster - https://phabricator.wikimedia.org/T301383 (10BTullis) Thanks @akosiaris. @razzi - it occurs to me that we need to create these machines outside of the analytics VLAN in order to make use of the LVS load-balancing. I dont see an issu... [12:05:10] nn1l2: (and everyone else) the problem is that deployers here running the window are not really on the hook to be code reviewers as well; thus someone else should have done it, otherwise we are really talking about the equivalent of self merge, which is frowned upon for mw (including config as far as I know) [12:05:26] so I would urge people for the future to get that handled for future patches. [12:05:48] 10SRE, 10vm-requests: eqiad: 3 VMs requested for datahub opensearch cluster - https://phabricator.wikimedia.org/T301383 (10BTullis) [12:06:20] this is different than an actual backport where the patch has been through review to make it into an extension or core main branch already. [12:07:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T300382)', diff saved to https://phabricator.wikimedia.org/P20514 and previous config saved to /var/cache/conftool/dbconfig/20220210-120701-marostegui.json [12:07:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:07:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:07] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [12:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:07:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [12:07:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [12:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T300382)', diff saved to https://phabricator.wikimedia.org/P20515 and previous config saved to /var/cache/conftool/dbconfig/20220210-120729-marostegui.json [12:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:08] (03CR) 10Volans: "Thanks for working on this! Looks mostly good, some minor nits/suggestions inline." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney) [12:08:54] It was my habit in my first days as you can see at https://phabricator.wikimedia.org/T296154#7532197 and https://phabricator.wikimedia.org/T296154#7539819, but I saw that other people don't wait for code review and schedule their patched directly. I asked here if that is OK and I was told yes, so I developed a new habit :) [12:09:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:09:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T300382)', diff saved to https://phabricator.wikimedia.org/P20516 and previous config saved to /var/cache/conftool/dbconfig/20220210-120941-marostegui.json [12:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:35] All in all, deployers should reach a consensus among themselves and then inform us and we will obey [12:10:43] yes, we should, heh [12:10:56] maybe I'll punt it to releng and ask them to decide something [12:10:59] for routine config changes (logo changes, namespaces, user rights, ...) I don't have an issue with having to review them when deploying as I'm familiar with the site req process, but I guess not every deployer is [12:11:18] I myself can never self-service due to legal reasons, WMF wont accept my NDA because of my place of residence [12:11:23] nope, and some config changes aren't those basic ones [12:11:43] that really sucks,. nn1l2 [12:11:54] I mean, there are deployers here who are happy to assist regardless [12:12:06] but it sucks that you can't get those rights because of government b.s. [12:13:20] yes, sucks! bullshit Iranian-American feud [12:13:21] xSavitar: hey, how's testing going? [12:13:33] taavi: Still on it. The wiki account needs confirmation for the form to work correctly [12:13:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298554)', diff saved to https://phabricator.wikimedia.org/P20517 and previous config saved to /var/cache/conftool/dbconfig/20220210-121336-ladsgroup.json [12:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:41] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [12:13:51] Give me a few mins and I'll be done. I'm also coordinating with my manager that is checking the account: https://meta.wikimedia.org/wiki/User:Chapthorgs [12:14:31] ok, cool [12:16:01] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/761611 (owner: 10L10n-bot) [12:18:25] !log echo "https://query.wikidata.org/" | mwscript purgeList.php # T301457 [12:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:30] T301457: Query service down - https://phabricator.wikimedia.org/T301457 [12:21:21] xSavitar: hi! any updates? [12:21:33] We are concluding, just a few more mins please [12:21:37] please :) [12:23:43] !log installing pillow security updates [12:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:18] taavi: Yes, everything is working now correctly. [12:24:21] \o/ [12:24:24] Thank you very much for the patience :) [12:24:33] great! syncing [12:24:38] Please do :) [12:24:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P20518 and previous config saved to /var/cache/conftool/dbconfig/20220210-122446-marostegui.json [12:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:34] !log taavi@deploy1002 Synchronized wmf-config/MetaContactPages.php: Config: [[gerrit:748120|Define a contact form for Chapter/Thorg application status (T298024)]] (duration: 00m 50s) [12:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:38] T298024: Create contact page form on Meta-Wiki for Chapter/Thematic org status application - https://phabricator.wikimedia.org/T298024 [12:25:44] zabe: yours are up next! any ordering requirements? [12:26:04] (03PS4) 10Majavah: MWMultiVersion: move ombudsmen.wikimedia.org to ombuds.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756734 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [12:26:13] I guess in the order they're in gerrit and on the wikitech list? [12:26:15] taavi: I see the form live. Thank you very much! [12:26:21] yes [12:26:24] xSavitar: you're welcome [12:26:32] (03CR) 10Majavah: [C: 03+2] MWMultiVersion: move ombudsmen.wikimedia.org to ombuds.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756734 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [12:27:23] (03Merged) 10jenkins-bot: MWMultiVersion: move ombudsmen.wikimedia.org to ombuds.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756734 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [12:27:46] ok, first patch is now available for testing on mwdebug1001 [12:28:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P20519 and previous config saved to /var/cache/conftool/dbconfig/20220210-122841-ladsgroup.json [12:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:59] taavi: lgtm, ombuds.wm.o is now redirecting to ombudsmen.wm.o [12:29:42] ack [12:30:17] (03PS3) 10Majavah: InitialiseSettings: move ombudsmen.wikimedia.org to ombuds.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756735 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [12:30:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:24] !log taavi@deploy1002 Synchronized multiversion/MWMultiVersion.php: Config: [[gerrit:756734|MWMultiVersion: move ombudsmen.wikimedia.org to ombuds.wikimedia.org (T273323)]] (duration: 00m 49s) [12:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:29] T273323: Rename private "ombudsmenwiki" to "ombudswiki" and change the logo - https://phabricator.wikimedia.org/T273323 [12:30:37] (03CR) 10Majavah: [C: 03+2] InitialiseSettings: move ombudsmen.wikimedia.org to ombuds.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756735 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [12:31:21] (03Merged) 10jenkins-bot: InitialiseSettings: move ombudsmen.wikimedia.org to ombuds.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756735 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [12:31:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:31:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:41] and the second one is on mwdebug1001 too [12:32:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:58] and now the wiki is showing up at ombuds.wm.o and ombudsmen.wm.o redirects [12:33:00] lgtm [12:33:05] awesome! [12:33:12] syncing [12:33:59] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:756735|InitialiseSettings: move ombudsmen.wikimedia.org to ombuds.wikimedia.org (T273323)]] (duration: 00m 49s) [12:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:17] (03PS2) 10Majavah: banwikisource: Fix logo size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761605 (https://phabricator.wikimedia.org/T296459) (owner: 104nn1l2) [12:34:23] nn1l2: yours is up next, finally [12:34:30] tahnks [12:34:48] (03CR) 10Majavah: [C: 03+2] banwikisource: Fix logo size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761605 (https://phabricator.wikimedia.org/T296459) (owner: 104nn1l2) [12:35:28] (03Merged) 10jenkins-bot: banwikisource: Fix logo size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761605 (https://phabricator.wikimedia.org/T296459) (owner: 104nn1l2) [12:35:55] nn1l2: pulled to mwdebug1001, can you test it? [12:36:02] ok [12:36:28] ... and now it's actually on mwdebug1001 [12:36:31] forgot a git rebase [12:36:36] LGTM [12:37:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:02] !log taavi@deploy1002 Synchronized static/images/project-logos/: Config: [[gerrit:761605|banwikisource: Fix logo size (T296459)]] (duration: 00m 50s) [12:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:06] T296459: Requesting logo change for ban.wikisource.org - https://phabricator.wikimedia.org/T296459 [12:38:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:38:57] !log taavi@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:761605|banwikisource: Fix logo size (T296459)]] (duration: 00m 49s) [12:38:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:46] !log purge banwikisource logos via purgeList.php T296459 [12:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:51] !log taavi@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:761605|banwikisource: Fix logo size (T296459)]] (duration: 00m 49s) [12:39:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P20520 and previous config saved to /var/cache/conftool/dbconfig/20220210-123951-marostegui.json [12:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:00] I think that's everything then? [12:40:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:40:20] there's no new patches snuck in so yeah this would be the end of the window [12:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:43] !log UTC morning deploys done [12:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:54] see y'all next time! [12:41:08] thanks to everyone for showing up and doing their thing [12:41:14] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: decom prometheus[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/761591 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [12:41:20] (03PS2) 10Filippo Giunchedi: hieradata: decom prometheus[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/761591 (https://phabricator.wikimedia.org/T296199) [12:43:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P20521 and previous config saved to /var/cache/conftool/dbconfig/20220210-124346-ladsgroup.json [12:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:46:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:05] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33691/console" [puppet] - 10https://gerrit.wikimedia.org/r/761587 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [12:47:29] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::kubernetes::node: avoid iptables alternatives for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/761587 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [12:47:32] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:47:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:51] (03CR) 10Elukey: [V: 03+1 C: 03+2] Add ml-serve2006 to the ml-serve-codfw k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/761584 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [12:48:49] !log printf '%s\n' 'https://query.wikidata.org/index.html' 'https://query.wikidata.org/embed.html' | mwscript purgeList.php # T301457 just in case [12:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:54] T301457: Query service down - https://phabricator.wikimedia.org/T301457 [12:49:51] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus2003.codfw.wmnet [12:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:56] (03CR) 10Abijeet Patro: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/761611 (owner: 10L10n-bot) [12:50:03] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: node: join: add support for nodeset query syntax [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761618 (https://phabricator.wikimedia.org/T298948) [12:50:09] (03CR) 10ArielGlenn: [C: 03+2] don't allow api jobs to have the same name [dumps] - 10https://gerrit.wikimedia.org/r/761600 (https://phabricator.wikimedia.org/T301373) (owner: 10ArielGlenn) [12:50:22] !log installing apr security updates [12:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:50] (03Merged) 10jenkins-bot: don't allow api jobs to have the same name [dumps] - 10https://gerrit.wikimedia.org/r/761600 (https://phabricator.wikimedia.org/T301373) (owner: 10ArielGlenn) [12:50:58] (03PS1) 10Elukey: Add ml-serve2006 to the ml-serve-codfw k8s cluster's BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/761619 [12:54:12] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:54:14] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: node: join: add support for nodeset query syntax [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761618 (https://phabricator.wikimedia.org/T298948) [12:54:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T300382)', diff saved to https://phabricator.wikimedia.org/P20522 and previous config saved to /var/cache/conftool/dbconfig/20220210-125456-marostegui.json [12:54:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:54:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:01] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [12:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T300382)', diff saved to https://phabricator.wikimedia.org/P20523 and previous config saved to /var/cache/conftool/dbconfig/20220210-125503-marostegui.json [12:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:51] (03PS1) 10Elukey: profile::rsyslog::kubernetes: add support for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/761620 (https://phabricator.wikimedia.org/T300744) [12:58:47] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts prometheus2003.codfw.wmnet [12:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298554)', diff saved to https://phabricator.wikimedia.org/P20524 and previous config saved to /var/cache/conftool/dbconfig/20220210-125850-ladsgroup.json [12:58:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:58:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:55] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [12:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:04] (03CR) 10Elukey: [C: 03+2] profile::rsyslog::kubernetes: add support for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/761620 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [12:59:16] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus2003.codfw.wmnet [12:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:17] (03PS3) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: node: join: add support for nodeset query syntax [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761618 (https://phabricator.wikimedia.org/T298948) [13:05:14] PROBLEM - Check systemd state on kubernetes1012 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300382)', diff saved to https://phabricator.wikimedia.org/P20525 and previous config saved to /var/cache/conftool/dbconfig/20220210-130818-marostegui.json [13:08:21] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts prometheus2003.codfw.wmnet [13:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:25] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [13:08:26] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [ed [13:08:26] ] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:08:27] (KubernetesRsyslogDown) firing: rsyslog on ml-serve2006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [13:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:17] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus1003.eqiad.wmnet [13:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:30] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:10:02] RECOVERY - Check systemd state on kubernetes1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:50] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:13:26] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve2006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [13:13:42] (KubernetesRsyslogDown) firing: rsyslog on ml-serve2006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [13:13:56] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve2006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [13:14:06] this is a new node under testing, but it should be donwtimed [13:14:27] (KubernetesCalicoDown) firing: ml-serve2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [13:15:27] (KubernetesRsyslogDown) firing: rsyslog on ml-serve2006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [13:15:43] elukey: yeah the downtime cookbook doesn't support setting alertmanager silences, it is WIP [13:16:31] godog: I used the icinga-downtime on alert1001, I just realized that it is not enough.. how should I add downtime? (never done it) [13:16:35] via alert manager's ui? [13:17:15] elukey: yeah that's correct [13:17:44] check out https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_&_acknowledgements and let me know if sth isn't clear [13:18:01] also I just noticed that 'instance' for k8s has FQDNs not hostnames ;_; [13:19:16] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [13:19:45] (03CR) 10Jbond: "lgtm but see inline for questions" [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [13:20:24] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10fgiunchedi) >>! In T293209#7675048, @fgiunchedi wrote: >>>! In T293209#7670485, @Volans wrote: >> I had a chat with @jbond about this yesterday, putting... [13:21:19] added thanks :) [13:21:34] sure np [13:21:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: node: join: add support for nodeset query syntax [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761618 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [13:22:49] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus1003.eqiad.wmnet [13:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:14] (03PS2) 10Majavah: P:openstack::cumin::target: redefine Ferm $CUMIN_MASTERS [puppet] - 10https://gerrit.wikimedia.org/r/761606 [13:23:22] 10ops-codfw, 10decommission-hardware: decommission prometheus2003.codfw.wmnet - https://phabricator.wikimedia.org/T301465 (10fgiunchedi) [13:23:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P20526 and previous config saved to /var/cache/conftool/dbconfig/20220210-132323-marostegui.json [13:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:15] 10ops-eqiad, 10decommission-hardware: decommission prometheus1003.eqiad.wmnet - https://phabricator.wikimedia.org/T301466 (10fgiunchedi) [13:24:36] (03CR) 10Majavah: P:openstack::cumin::target: redefine Ferm $CUMIN_MASTERS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [13:25:43] (JobUnavailable) firing: (4) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:27:21] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-tests: silence cronjob emails [puppet] - 10https://gerrit.wikimedia.org/r/761625 [13:28:03] (03PS2) 10Arturo Borrero Gonzalez: toolforge: automated-tests: silence cronjob emails [puppet] - 10https://gerrit.wikimedia.org/r/761625 [13:28:37] !log installing lxml security updates [13:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:11] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-tests: silence cronjob emails [puppet] - 10https://gerrit.wikimedia.org/r/761625 (owner: 10Arturo Borrero Gonzalez) [13:30:15] (03CR) 10Majavah: "PCC, failures seem unexpected: https://puppet-compiler.wmflabs.org/pcc-worker1001/33695/" [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [13:32:31] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33696/console" [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [13:34:55] (03PS1) 10KartikMistry: Enable SectionTranslation in Occitan and Luganda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761626 (https://phabricator.wikimedia.org/T301443) [13:34:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33697/console" [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [13:35:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33698/console" [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [13:37:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33699/console" [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [13:38:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P20527 and previous config saved to /var/cache/conftool/dbconfig/20220210-133827-marostegui.json [13:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:18] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: puppet-catalog-compiler: compilation result randomly places servers in the 'failed' section - https://phabricator.wikimedia.org/T224977 (10jbond) Still seem to be having issues here, in the following deployment-deploy03.deployment-prep.e... [13:42:45] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, and 2 others: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10lmata) a:03lmata Pulling this in as it's to reflect its actual progress. [13:50:23] !log installing apache security updates on otrs1001/ticket.wikimedia.org [13:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:33] (03CR) 10Jforrester: "recheck" [core] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761413 (https://phabricator.wikimedia.org/T301309) (owner: 10Ladsgroup) [13:53:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300382)', diff saved to https://phabricator.wikimedia.org/P20529 and previous config saved to /var/cache/conftool/dbconfig/20220210-135332-marostegui.json [13:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:38] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [13:53:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [13:53:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [13:53:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [13:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [13:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:54:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T300382)', diff saved to https://phabricator.wikimedia.org/P20530 and previous config saved to /var/cache/conftool/dbconfig/20220210-135411-marostegui.json [13:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:31] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10dcausse) For some production jobs we still use the proxy to access: - MW APIs (all our sites) - ores.wikimedia.org For de... [13:58:21] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:34] !log installing apache security updates on phab1001/phabricator.wikimedia.org [14:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:04:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298554)', diff saved to https://phabricator.wikimedia.org/P20531 and previous config saved to /var/cache/conftool/dbconfig/20220210-140500-ladsgroup.json [14:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:05] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [14:05:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T300382)', diff saved to https://phabricator.wikimedia.org/P20532 and previous config saved to /var/cache/conftool/dbconfig/20220210-140525-marostegui.json [14:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:30] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [14:05:47] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Ottomata) > MW APIs (all our sites) BTW, the proper way to access MW APIs from within our networks is to use e.g. https://... [14:08:40] (03CR) 10JMeybohm: [C: 03+1] Add ml-serve2006 to the ml-serve-codfw k8s cluster's BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/761619 (owner: 10Elukey) [14:08:59] (03CR) 10Elukey: [C: 03+2] Add ml-serve2006 to the ml-serve-codfw k8s cluster's BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/761619 (owner: 10Elukey) [14:10:31] !log `elukey@cumin1001:~$ homer 'cr*codfw*' commit "Add ml-serve2006 to the k8s ml-serve-codfw cluster's neighbors"` [14:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:23] (03PS1) 10Jbond: puppet_compiler: add filter [puppet] - 10https://gerrit.wikimedia.org/r/761633 [14:14:22] (03PS1) 10Ladsgroup: db2138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/761634 (https://phabricator.wikimedia.org/T300510) [14:14:57] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve2006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [14:15:53] (03PS2) 10Ladsgroup: db2138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/761634 (https://phabricator.wikimedia.org/T300510) [14:16:13] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:16:29] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db2138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/761634 (https://phabricator.wikimedia.org/T300510) (owner: 10Ladsgroup) [14:16:55] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:17:23] (03CR) 10Jbond: [C: 03+2] puppet_compiler: add filter [puppet] - 10https://gerrit.wikimedia.org/r/761633 (owner: 10Jbond) [14:19:38] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-serve2005.codfw.wmnet [14:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:43] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-serve2006.codfw.wmnet [14:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:47] (03PS1) 10Zabe: Remove ombudsmen.wikimedia.org from mediawiki.yaml [puppet] - 10https://gerrit.wikimedia.org/r/761635 (https://phabricator.wikimedia.org/T273323) [14:20:21] (03CR) 10Volans: "Looks mostly ok to me, replies inline and a couple of comments on the test file." [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [14:20:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P20533 and previous config saved to /var/cache/conftool/dbconfig/20220210-142030-marostegui.json [14:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:52] (03PS1) 10Elukey: Add bgp/confd conf for ml-serve2006 [puppet] - 10https://gerrit.wikimedia.org/r/761636 [14:22:46] (03CR) 10Elukey: [C: 03+2] Add bgp/confd conf for ml-serve2006 [puppet] - 10https://gerrit.wikimedia.org/r/761636 (owner: 10Elukey) [14:23:12] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-serve2006.codfw.wmnet [14:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:49] 10SRE, 10Service-deployment-requests: New Service Request SchemaTree - https://phabricator.wikimedia.org/T301471 (10Michaelcochez) [14:25:39] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 84, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:26:21] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 117, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:29:27] (KubernetesCalicoDown) resolved: ml-serve2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [14:31:18] (03PS1) 10Majavah: P:wmcs::services::ntp: filter out self on peers list [puppet] - 10https://gerrit.wikimedia.org/r/761637 [14:35:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P20534 and previous config saved to /var/cache/conftool/dbconfig/20220210-143535-marostegui.json [14:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:43] (JobUnavailable) firing: (5) Reduced availability for job gitlab in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [14:37:42] (03PS1) 10Jbond: pcc_uploader: ensure we update the puppet repo before refreshing [puppet] - 10https://gerrit.wikimedia.org/r/761638 [14:39:06] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10akosiaris) >>! In T300977#7700725, @Ottomata wrote: >> MW APIs (all our sites) > BTW, the proper way to access MW APIs fro... [14:39:25] (03CR) 10Jbond: [V: 03+1 C: 03+1] "LGTM but will leave for someone from wmcs to +1/merge" [puppet] - 10https://gerrit.wikimedia.org/r/761606 (owner: 10Majavah) [14:39:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I don't love the idea of multiple logging for the various releases in an environment, but I'm ok with it for now tbh." [deployment-charts] - 10https://gerrit.wikimedia.org/r/760524 (owner: 10Jelto) [14:40:53] (03PS8) 10Giuseppe Lavagetto: apache: Replace zero.wikipedia.org vhost alias with redirect [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [14:41:58] (03CR) 10Jbond: [C: 03+2] pcc_uploader: ensure we update the puppet repo before refreshing [puppet] - 10https://gerrit.wikimedia.org/r/761638 (owner: 10Jbond) [14:44:03] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Ottomata) > avoiding a SPOF (there aren't that many web proxies nor is it a highly available setup cause there isn't any n... [14:44:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] apache: Replace zero.wikipedia.org vhost alias with redirect [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [14:47:08] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10jbond) > This has bitten me before when I used to use the webproxy internally. Don't do it! :) Its worth mentioning that... [14:48:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [14:48:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [14:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138 (T300510)', diff saved to https://phabricator.wikimedia.org/P20535 and previous config saved to /var/cache/conftool/dbconfig/20220210-144913-ladsgroup.json [14:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:18] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [14:50:10] (03PS1) 10Jbond: pcc_facts_processor: use f-string [puppet] - 10https://gerrit.wikimedia.org/r/761640 [14:50:28] (03PS1) 10Giuseppe Lavagetto: httpbb: fix redirect match [puppet] - 10https://gerrit.wikimedia.org/r/761642 [14:50:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T300382)', diff saved to https://phabricator.wikimedia.org/P20536 and previous config saved to /var/cache/conftool/dbconfig/20220210-145040-marostegui.json [14:50:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:50:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:46] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [14:50:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T300382)', diff saved to https://phabricator.wikimedia.org/P20537 and previous config saved to /var/cache/conftool/dbconfig/20220210-145047-marostegui.json [14:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:56] (03CR) 10Jbond: [C: 03+2] pcc_facts_processor: use f-string [puppet] - 10https://gerrit.wikimedia.org/r/761640 (owner: 10Jbond) [14:51:20] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpbb: fix redirect match [puppet] - 10https://gerrit.wikimedia.org/r/761642 (owner: 10Giuseppe Lavagetto) [14:51:38] (03CR) 10JMeybohm: [C: 04-1] "I'm able to produce errors like the following when running all task over all charts:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 (owner: 10Giuseppe Lavagetto) [14:52:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb: Update tests to reflect rename from ombudsmen to ombuds [puppet] - 10https://gerrit.wikimedia.org/r/761009 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [14:53:05] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:54:39] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:55:12] jouncebot: nowandnext [14:55:13] No deployments scheduled for the next 2 hour(s) and 4 minute(s) [14:55:13] In 2 hour(s) and 4 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220210T1700) [14:55:18] oof, nice [14:55:36] (03CR) 10Ladsgroup: [C: 03+2] DerivedPageDataUpdater: Set ParserOutput when it's passed to it [core] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761413 (https://phabricator.wikimedia.org/T301309) (owner: 10Ladsgroup) [14:55:39] (03CR) 10Ladsgroup: [C: 03+2] DerivedPageDataUpdater: Set ParserOutput when it's passed to it [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761414 (https://phabricator.wikimedia.org/T301309) (owner: 10Ladsgroup) [14:56:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2138.codfw.wmnet with OS bullseye [14:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:49] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:38] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:00] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:52] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb: remove tests that fail under k8s [puppet] - 10https://gerrit.wikimedia.org/r/755529 (https://phabricator.wikimedia.org/T285298) (owner: 10Giuseppe Lavagetto) [15:02:14] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Product-Analytics, 10User-Ladsgroup: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10AUgolnikova-WMF) When trying to access superset with my wikitech account, I get "service access denied due to missing p... [15:02:56] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Product-Analytics, 10User-Ladsgroup: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10AUgolnikova-WMF) 05Resolved→03Open [15:03:05] (03PS1) 10Jbond: pcc_facts_processor: fully qualify git command [puppet] - 10https://gerrit.wikimedia.org/r/761643 [15:03:09] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33700/console" [puppet] - 10https://gerrit.wikimedia.org/r/761011 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [15:03:29] (03CR) 10Jbond: [V: 03+2 C: 03+2] pcc_facts_processor: fully qualify git command [puppet] - 10https://gerrit.wikimedia.org/r/761643 (owner: 10Jbond) [15:03:53] (03PS1) 10Giuseppe Lavagetto: httpbb: also remove nonexistent from the test_suite declaration [puppet] - 10https://gerrit.wikimedia.org/r/761644 [15:06:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb: also remove nonexistent from the test_suite declaration [puppet] - 10https://gerrit.wikimedia.org/r/761644 (owner: 10Giuseppe Lavagetto) [15:06:34] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] Add redirect from ombudsmen.wm.o to ombuds.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/761011 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [15:06:43] (03PS3) 10Giuseppe Lavagetto: Add redirect from ombudsmen.wm.o to ombuds.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/761011 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [15:06:47] (03PS1) 10Jbond: pcc_facts_processor: need to split the command [puppet] - 10https://gerrit.wikimedia.org/r/761645 [15:06:57] (03CR) 10Jbond: [V: 03+2 C: 03+2] pcc_facts_processor: need to split the command [puppet] - 10https://gerrit.wikimedia.org/r/761645 (owner: 10Jbond) [15:10:30] 10SRE, 10WMDE-Technical-Wishes-Maintenance, 10serviceops: Migrate kartotherian production service to node12 - https://phabricator.wikimedia.org/T301475 (10awight) [15:10:38] 10SRE, 10WMDE-Technical-Wishes-Maintenance, 10serviceops: Migrate geoshapes production service to node12 - https://phabricator.wikimedia.org/T301476 (10awight) [15:10:43] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:11:32] (03Merged) 10jenkins-bot: DerivedPageDataUpdater: Set ParserOutput when it's passed to it [core] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761413 (https://phabricator.wikimedia.org/T301309) (owner: 10Ladsgroup) [15:11:38] (03Merged) 10jenkins-bot: DerivedPageDataUpdater: Set ParserOutput when it's passed to it [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761414 (https://phabricator.wikimedia.org/T301309) (owner: 10Ladsgroup) [15:12:14] (03CR) 10JMeybohm: [C: 04-1] Refactor Rakefile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 (owner: 10Giuseppe Lavagetto) [15:12:19] 10SRE, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10awight) [15:12:31] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:12:43] (03CR) 10Bking: "Per IRC conversation, abandoning this patch. Will work around puppet failures with cloud-init for now." [labs/private] - 10https://gerrit.wikimedia.org/r/761465 (https://phabricator.wikimedia.org/T301408) (owner: 10Bking) [15:12:48] (03Abandoned) 10Bking: deployment-prep: add search team SSH pub keys [labs/private] - 10https://gerrit.wikimedia.org/r/761465 (https://phabricator.wikimedia.org/T301408) (owner: 10Bking) [15:14:31] 10SRE, 10WMDE-Technical-Wishes-Maintenance, 10serviceops: Migrate kartotherian production service to node12 - https://phabricator.wikimedia.org/T301475 (10awight) [15:15:43] (JobUnavailable) firing: (5) Reduced availability for job gitlab in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [15:16:45] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [15:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:35] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [15:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:43] (03PS1) 10Jelto: gitlab_runner: export metrics on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/761649 (https://phabricator.wikimedia.org/T300816) [15:19:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298554)', diff saved to https://phabricator.wikimedia.org/P20538 and previous config saved to /var/cache/conftool/dbconfig/20220210-151903-ladsgroup.json [15:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:09] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [15:19:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [15:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:27] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [15:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:31] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Ottomata) Hahah, maybe what we should do is excludelist the internal domains in the webproxy! [15:20:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [15:20:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [15:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply on pinkunicorn [15:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:29] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [15:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:20] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33701/console" [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:21:43] 10Puppet, 10Infrastructure-Foundations, 10MobileFrontend (Tracking), 10User-Jdlrobson: Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425 (10ovasileva) [15:21:57] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:33] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33702/console" [puppet] - 10https://gerrit.wikimedia.org/r/761649 (https://phabricator.wikimedia.org/T300816) (owner: 10Jelto) [15:23:53] (03CR) 10JMeybohm: [C: 04-1] Refactor Rakefile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 (owner: 10Giuseppe Lavagetto) [15:25:05] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: export metrics on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/761649 (https://phabricator.wikimedia.org/T300816) (owner: 10Jelto) [15:25:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [15:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [15:26:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [15:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:05] 10SRE, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Enable search team SRE access to deployment-prep VMs - https://phabricator.wikimedia.org/T301408 (10bking) Closing, as we've found a workaround to enable access. [15:27:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [15:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:25] 10SRE, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Enable search team SRE access to deployment-prep VMs - https://phabricator.wikimedia.org/T301408 (10bking) 05Open→03Resolved [15:29:41] (03CR) 10Herron: [C: 03+1] "Good idea, was thinking about how to monitor this too" [alerts] - 10https://gerrit.wikimedia.org/r/761604 (https://phabricator.wikimedia.org/T301376) (owner: 10Filippo Giunchedi) [15:31:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2138.codfw.wmnet with OS bullseye [15:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:38] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.21/includes/Storage/DerivedPageDataUpdater.php: Backport: [[gerrit:761414|DerivedPageDataUpdater: Set ParserOutput when it's passed to it (T301309)]] (duration: 00m 53s) [15:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:43] T301309: Refreshlinks job is parsing pages twice - https://phabricator.wikimedia.org/T301309 [15:34:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P20541 and previous config saved to /var/cache/conftool/dbconfig/20220210-153408-ladsgroup.json [15:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:43] (JobUnavailable) firing: (3) Reduced availability for job gitlab in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [15:36:41] (03PS3) 10Herron: watchrat: route alerts to irc and noc@ [puppet] - 10https://gerrit.wikimedia.org/r/761064 (https://phabricator.wikimedia.org/T299147) [15:39:24] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.20/includes/Storage/DerivedPageDataUpdater.php: Backport: [[gerrit:761413|DerivedPageDataUpdater: Set ParserOutput when it's passed to it (T301309)]] (duration: 00m 50s) [15:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:29] T301309: Refreshlinks job is parsing pages twice - https://phabricator.wikimedia.org/T301309 [15:40:17] (03CR) 10Herron: [C: 03+2] watchrat: route alerts to irc and noc@ [puppet] - 10https://gerrit.wikimedia.org/r/761064 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [15:41:45] (03PS3) 10Herron: watchrat: route donate.wm.o alerts to fr-ircmail [puppet] - 10https://gerrit.wikimedia.org/r/761403 (https://phabricator.wikimedia.org/T299147) [15:43:29] (03CR) 10Herron: [C: 03+2] watchrat: route donate.wm.o alerts to fr-ircmail [puppet] - 10https://gerrit.wikimedia.org/r/761403 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [15:44:09] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10mpopov) A couple of questions/comments: >>! In T300977#7700842, @jbond wrote: > Its worth mentioning that when i took a... [15:49:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P20542 and previous config saved to /var/cache/conftool/dbconfig/20220210-154913-ladsgroup.json [15:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:04] (JobUnavailable) firing: (4) Reduced availability for job gitlab in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [15:51:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T300382)', diff saved to https://phabricator.wikimedia.org/P20543 and previous config saved to /var/cache/conftool/dbconfig/20220210-155106-marostegui.json [15:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:12] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [15:51:21] (03PS12) 10Giuseppe Lavagetto: Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 [15:53:11] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:53:28] (03PS1) 10Jelto: gitlab_runner: make exporter_listen_address depending on ::realm [puppet] - 10https://gerrit.wikimedia.org/r/761658 (https://phabricator.wikimedia.org/T300816) [15:56:12] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33703/console" [puppet] - 10https://gerrit.wikimedia.org/r/761658 (https://phabricator.wikimedia.org/T300816) (owner: 10Jelto) [16:00:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T300510)', diff saved to https://phabricator.wikimedia.org/P20544 and previous config saved to /var/cache/conftool/dbconfig/20220210-160003-ladsgroup.json [16:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:08] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [16:00:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T300510)', diff saved to https://phabricator.wikimedia.org/P20545 and previous config saved to /var/cache/conftool/dbconfig/20220210-160046-ladsgroup.json [16:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:00] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: make exporter_listen_address depending on ::realm [puppet] - 10https://gerrit.wikimedia.org/r/761658 (https://phabricator.wikimedia.org/T300816) (owner: 10Jelto) [16:01:10] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10EBernhardson) > Oh and ORES is also available under https://ores.discovery.wmnet (and it's the exact same service!) Doesn... [16:01:32] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10mpopov) > **First**: How difficult & how much overhead would it be to make the proxy redirect requests made to internal do... [16:01:59] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:03:25] (03CR) 10Giuseppe Lavagetto: Refactor Rakefile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 (owner: 10Giuseppe Lavagetto) [16:04:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298554)', diff saved to https://phabricator.wikimedia.org/P20546 and previous config saved to /var/cache/conftool/dbconfig/20220210-160417-ladsgroup.json [16:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:25] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [16:05:07] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Ottomata) > Is the intention to allow us to talk to prod in a more general fashion then? I think so, see the parent ticket... [16:05:44] jouncebot: now [16:05:44] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [16:06:10] I'll deploy a quick mw-config change then [16:06:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P20547 and previous config saved to /var/cache/conftool/dbconfig/20220210-160611-marostegui.json [16:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:39] (03PS2) 10Ppchelko: Add PHP array default settings loader benchmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761433 (https://phabricator.wikimedia.org/T300129) [16:06:45] (03CR) 10Ppchelko: [C: 03+2] Add PHP array default settings loader benchmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761433 (https://phabricator.wikimedia.org/T300129) (owner: 10Ppchelko) [16:07:25] (03Merged) 10jenkins-bot: Add PHP array default settings loader benchmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761433 (https://phabricator.wikimedia.org/T300129) (owner: 10Ppchelko) [16:07:46] !log otto@deploy1002 Started deploy [airflow-dags/analytics_test@66d6cad]: (no justification provided) [16:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:32] !log otto@deploy1002 Finished deploy [airflow-dags/analytics_test@66d6cad]: (no justification provided) (duration: 00m 46s) [16:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:27] !log ppchelko@deploy1002 Synchronized w/tmp_settings_bench.php: Config: gerrit 761433 settings benchmark - measure new static php array config load (duration: 00m 49s) [16:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:57] !log otto@deploy1002 Started deploy [airflow-dags/analytics_test@66d6cad]: (no justification provided) [16:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:49] 10SRE, 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: gitlab: enable IPv6 for https - https://phabricator.wikimedia.org/T300816 (10Jelto) 05Open→03Resolved Metrics of trusted runners are fixed. GitLab seems to automagically parse the runners address from the request/register flow. With IP... [16:12:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:13:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:17] !log otto@deploy1002 Finished deploy [airflow-dags/analytics_test@66d6cad]: (no justification provided) (duration: 04m 19s) [16:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:44] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) [16:14:44] !log otto@deploy1002 Started deploy [airflow-dags/analytics_test@66d6cad]: (no justification provided) [16:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:01] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) ganeti1021 has been updated [16:15:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:15:05] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P20548 and previous config saved to /var/cache/conftool/dbconfig/20220210-162115-marostegui.json [16:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:33] !log otto@deploy1002 Finished deploy [airflow-dags/analytics_test@66d6cad]: (no justification provided) (duration: 07m 49s) [16:22:35] !log otto@deploy1002 Started deploy [airflow-dags/analytics_test@66d6cad]: (no justification provided) [16:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:46] !log otto@deploy1002 Finished deploy [airflow-dags/analytics_test@66d6cad]: (no justification provided) (duration: 00m 11s) [16:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:26] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10Dzahn) I am not familiar with how the Google Tools were setup. I just did that as an access request because I happened to be on clinic duty that week and clicked a lot of times on "... [16:29:49] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10jbond) > First: How difficult & how much overhead would it be to make the proxy redirect requests made to internal domains... [16:33:16] (03PS2) 10Zabe: Remove ombudsmen.wikimedia.org from mediawiki.yaml [puppet] - 10https://gerrit.wikimedia.org/r/761635 (https://phabricator.wikimedia.org/T273323) [16:34:29] (03PS2) 10Cathal Mooney: Adding includes for Netbox-generated zone files for new eqiad subnets [dns] - 10https://gerrit.wikimedia.org/r/761473 (https://phabricator.wikimedia.org/T299758) [16:35:05] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Reverse DNS zones includes for drmrs - https://phabricator.wikimedia.org/T301447 (10cmooney) Added to patch for new Eqiad includes: https://gerrit.wikimedia.org/r/c/operations/dns/+/761473 [16:35:38] (03PS1) 10Dzahn: DHCP: remove etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/761661 [16:35:54] (03CR) 10Jbond: gitlab_runner: execute gitlab-runner as non-root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:36:12] (03PS1) 10Dzahn: site: remove etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/761662 [16:36:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T300382)', diff saved to https://phabricator.wikimedia.org/P20549 and previous config saved to /var/cache/conftool/dbconfig/20220210-163620-marostegui.json [16:36:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [16:36:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [16:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:36:26] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [16:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T300382)', diff saved to https://phabricator.wikimedia.org/P20550 and previous config saved to /var/cache/conftool/dbconfig/20220210-163633-marostegui.json [16:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:02] !log otto@deploy1002 Started deploy [airflow-dags/analytics_test@5b6ba8e]: (no justification provided) [16:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:11] !log otto@deploy1002 Finished deploy [airflow-dags/analytics_test@5b6ba8e]: (no justification provided) (duration: 00m 08s) [16:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T300382)', diff saved to https://phabricator.wikimedia.org/P20551 and previous config saved to /var/cache/conftool/dbconfig/20220210-163746-marostegui.json [16:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:43] 10SRE, 10ops-eqiad, 10DC-Ops: Broken disk on ganeti1011 - https://phabricator.wikimedia.org/T301240 (10Cmjohnson) @MoritzMuehlenhoff Try installing now, I re-arranged the disks, I'd imagine an install should go without an issue now and then we can re-evaluate for a failed disk. Currently, the h/w logs do no... [16:47:03] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/761473 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [16:47:23] (03PS3) 10Cathal Mooney: Adding includes for Netbox-generated zone files for new eqiad subnets [dns] - 10https://gerrit.wikimedia.org/r/761473 (https://phabricator.wikimedia.org/T299758) [16:47:45] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:26] !log otto@deploy1002 Started deploy [airflow-dags/analytics@5b6ba8e]: (no justification provided) [16:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:12] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@5b6ba8e]: (no justification provided) (duration: 01m 46s) [16:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:21] !log otto@deploy1002 Started deploy [airflow-dags/analytics@5b6ba8e]: (no justification provided) [16:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:31] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@5b6ba8e]: (no justification provided) (duration: 00m 10s) [16:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:39] (03CR) 10Cathal Mooney: [C: 03+2] Adding includes for Netbox-generated zone files for new eqiad subnets (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/761473 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [16:52:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Jclark-ctr) elastic1089 e1 u21 elastic1090 e1 u22 elastic1091 e2 u21 elastic1092 e2 u22 elastic1093 e3 u21 elastic1094 e3 u22 [16:52:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P20552 and previous config saved to /var/cache/conftool/dbconfig/20220210-165250-marostegui.json [16:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:41] 10SRE, 10ops-eqiad: 8 x SMF Patches between cages Eqiad - LVS & WMCS - https://phabricator.wikimedia.org/T301419 (10Jclark-ctr) [16:54:23] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:54:32] 10SRE, 10ops-eqiad: 8 x SMF Patches between cages Eqiad - LVS & WMCS - https://phabricator.wikimedia.org/T301419 (10Jclark-ctr) updated task for new cage Rack E4,F4 will be dedicated for WMCS [16:55:23] (03PS1) 10Hnowlan: changeprop-jobqueue: increase concurrency for backlogged jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/761664 (https://phabricator.wikimedia.org/T300914) [16:56:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbmonitor1002.wikimedia.org [16:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:41] (03CR) 10Ppchelko: [C: 03+1] changeprop-jobqueue: increase concurrency for backlogged jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/761664 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [17:00:05] jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220210T1700). [17:00:05] zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:11] (03CR) 10JMeybohm: [C: 03+1] Refactor Rakefile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 (owner: 10Giuseppe Lavagetto) [17:00:16] o/ [17:00:19] hey, looking [17:00:42] aha, the sequel :) looks good, let me run PCC real quick and then we'll do the same maneuver as yesterday [17:00:46] er, as Tuesday [17:01:08] !log etherpad going down for maintenance [17:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:37] mutante: break a leg! :D [17:02:00] (03PS2) 10Dzahn: site: add etherpad role to etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/758562 (https://phabricator.wikimedia.org/T300568) [17:02:05] thanks;) [17:02:16] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33705/console" [puppet] - 10https://gerrit.wikimedia.org/r/761635 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [17:03:06] (03CR) 10Dzahn: [C: 03+2] site: add etherpad role to etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/758562 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [17:03:46] !log rzl@cumin2001:~$ sudo cumin A:mw "disable-puppet T273323" [17:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:52] T273323: Rename private "ombudsmenwiki" to "ombudswiki" and change the logo - https://phabricator.wikimedia.org/T273323 [17:03:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbmonitor1002.wikimedia.org [17:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:06] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) Update diagram for step 2 {F34947662} [17:05:10] (03CR) 10RLazarus: [V: 03+1 C: 03+2] Remove ombudsmen.wikimedia.org from mediawiki.yaml [puppet] - 10https://gerrit.wikimedia.org/r/761635 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [17:05:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1009.eqiad.wmnet with OS stretch [17:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:24] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1009.eqiad.wmnet with OS stretch [17:05:48] (03PS1) 10Cwhite: add cwhite to root-authorized-keys [labs/private] - 10https://gerrit.wikimedia.org/r/761668 [17:05:49] (03PS1) 10Dzahn: site: move etherpad1002 back to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/761669 (https://phabricator.wikimedia.org/T300568) [17:06:15] (03PS2) 10Dzahn: switch etherpad.discovery.wmnet to etherpad1003 [dns] - 10https://gerrit.wikimedia.org/r/758561 (https://phabricator.wikimedia.org/T300568) [17:06:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1010.eqiad.wmnet with OS stretch [17:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:22] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1010.eqiad.wmnet with OS stretch [17:06:35] PROBLEM - etherpad_up reduced availability on alert1001 is CRITICAL: 0 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:06:43] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: increase concurrency for backlogged jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/761664 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [17:06:43] ^ expected [17:07:31] zabe: done at mwdebug1001, have a look? [17:07:39] yea, well, not really expected but in a way yes:) [17:07:48] looking [17:07:51] downtimed all the other alerts but this prometheus based stuff is separate [17:07:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P20553 and previous config saved to /var/cache/conftool/dbconfig/20220210-170755-marostegui.json [17:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:14] mutante: oh, my mistake sorry :) [17:08:55] ACKNOWLEDGEMENT - etherpad_up reduced availability on alert1001 is CRITICAL: 0 le 0.8 daniel_zahn scheduled maintenance https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:08:58] rzl: lgtm, redirect is still working [17:08:59] ACKNOWLEDGEMENT - etherpad_up reduced availability on alert1001 is CRITICAL: 0 le 0.8 daniel_zahn scheduled maintenance https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:10:05] zabe: rad, re-enabling puppet then [17:10:14] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] "Merging so Cole can look at a codfw1dev puppetmaster issue" [labs/private] - 10https://gerrit.wikimedia.org/r/761668 (owner: 10Cwhite) [17:10:25] ACKNOWLEDGEMENT - etherpad_lite_process_running on etherpad1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js daniel_zahn process name changed in new version https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [17:10:38] process is running but under a new name :) [17:10:50] !log rzl@cumin2001:~$ sudo cumin A:mw "enable-puppet T273323" [17:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:54] T273323: Rename private "ombudsmenwiki" to "ombudswiki" and change the logo - https://phabricator.wikimedia.org/T273323 [17:11:09] thanks for your help again :) [17:11:14] thank you! [17:11:17] (03CR) 10Dzahn: [C: 03+2] switch etherpad.discovery.wmnet to etherpad1003 [dns] - 10https://gerrit.wikimedia.org/r/758561 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [17:11:18] (03Merged) 10jenkins-bot: changeprop-jobqueue: increase concurrency for backlogged jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/761664 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [17:11:59] btw is there specific reason you took cumin2001 this time? [17:12:06] (03PS1) 10Ottomata: airflow - Set up research instance and deployment of airflow-dags [puppet] - 10https://gerrit.wikimedia.org/r/761670 (https://phabricator.wikimedia.org/T295380) [17:12:13] (03PS1) 10Ladsgroup: First sweep of clean up of tendril [puppet] - 10https://gerrit.wikimedia.org/r/761671 (https://phabricator.wikimedia.org/T297605) [17:12:20] ha, good eye :) not particularly, they're both good [17:12:43] I just moved from the east coast to the west coast of the U.S. a few months ago, so cumin2001 is a little more pleasant to type on -- lower latency [17:12:43] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1011.eqiad.wmnet with OS stretch [17:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:48] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1011.eqiad.wmnet with OS stretch [17:12:50] but my muscle memory still fills in 1001 sometimes :) [17:13:13] ah ;) [17:13:43] (03PS1) 10Dzahn: etherpad: fix process monitoring after version upgrade, node->nodejs [puppet] - 10https://gerrit.wikimedia.org/r/761672 (https://phabricator.wikimedia.org/T300568) [17:14:16] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync on production [17:14:18] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync on staging [17:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:26] (03CR) 10Dzahn: [C: 03+2] etherpad: fix process monitoring after version upgrade, node->nodejs [puppet] - 10https://gerrit.wikimedia.org/r/761672 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [17:14:39] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33706/console" [puppet] - 10https://gerrit.wikimedia.org/r/761670 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [17:14:44] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync on production [17:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:09] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync on production [17:15:10] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync on staging [17:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:26] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync on production [17:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:42] (03PS2) 10Dzahn: site: remove etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/761662 (https://phabricator.wikimedia.org/T300568) [17:16:45] <_joe_> musikanimal: are you aware of the fact etherpad is down? [17:16:53] <_joe_> err mutante not musikanimal sorry [17:16:57] (03PS2) 10Dzahn: DHCP: remove etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/761661 (https://phabricator.wikimedia.org/T300568) [17:18:44] https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/HT5NKPKNWSW5RMIW3UPIF7ITSATIQMJL/ [17:19:11] _joe_: ^ [17:19:20] (03CR) 10Ladsgroup: "PCC https://puppet-compiler.wmflabs.org/pcc-worker1001/33708/db2093.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/761671 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [17:19:33] <_joe_> zabe: I am aware of the migration [17:19:47] <_joe_> that's why I was asking mutante if he knew it wasn't working right now [17:19:53] ok [17:19:58] sorry [17:20:06] _joe_: yes, I am aware of it [17:20:25] <_joe_> ack good, sorry :) [17:20:30] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:21:45] ^ sorry, I dont know how to schedule downtime on that. I did handle Icinga. [17:23:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T300382)', diff saved to https://phabricator.wikimedia.org/P20555 and previous config saved to /var/cache/conftool/dbconfig/20220210-172300-marostegui.json [17:23:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [17:23:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [17:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:05] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [17:23:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T300382)', diff saved to https://phabricator.wikimedia.org/P20556 and previous config saved to /var/cache/conftool/dbconfig/20220210-172307-marostegui.json [17:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:38] (03PS1) 10Ladsgroup: db2088: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/761675 (https://phabricator.wikimedia.org/T300510) [17:26:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [17:26:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [17:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [17:26:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [17:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298554)', diff saved to https://phabricator.wikimedia.org/P20557 and previous config saved to /var/cache/conftool/dbconfig/20220210-172635-ladsgroup.json [17:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:40] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [17:28:10] (03PS1) 10Hnowlan: changeprop-jobqueue: increase CPU allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/761677 (https://phabricator.wikimedia.org/T300914) [17:28:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1009.eqiad.wmnet with OS stretch [17:28:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [17:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [17:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:55] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1009.eqiad.wmnet with OS stretch completed: - ms-fe1009... [17:30:15] (03CR) 10Ppchelko: [C: 03+1] changeprop-jobqueue: increase CPU allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/761677 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [17:31:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1010.eqiad.wmnet with OS stretch [17:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:29] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1010.eqiad.wmnet with OS stretch completed: - ms-fe1010... [17:31:57] (03PS2) 10Ladsgroup: db2088: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/761675 (https://phabricator.wikimedia.org/T300510) [17:32:01] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db2088: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/761675 (https://phabricator.wikimedia.org/T300510) (owner: 10Ladsgroup) [17:34:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Cmjohnson) [17:35:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Cmjohnson) 05Open→03Resolved This has been completed [17:36:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1011.eqiad.wmnet with OS stretch [17:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:30] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1011.eqiad.wmnet with OS stretch completed: - ms-fe1011... [17:37:19] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) [17:38:25] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) [17:39:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2088.codfw.wmnet with reason: Maintenance [17:39:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2088.codfw.wmnet with reason: Maintenance [17:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2088:3311 (T300510)', diff saved to https://phabricator.wikimedia.org/P20558 and previous config saved to /var/cache/conftool/dbconfig/20220210-173932-ladsgroup.json [17:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:38] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [17:39:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2088:3312 (T300510)', diff saved to https://phabricator.wikimedia.org/P20559 and previous config saved to /var/cache/conftool/dbconfig/20220210-173957-ladsgroup.json [17:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:09] (03PS1) 10Dzahn: ssl: update cert for etherpad.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/761680 (https://phabricator.wikimedia.org/T300568) [17:40:28] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10Cmjohnson) [17:41:07] (03CR) 10Dzahn: [C: 03+2] "openssl x509 -noout -text -in etherpad.discovery.wmnet.crt | grep DNS:" [puppet] - 10https://gerrit.wikimedia.org/r/761680 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [17:41:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2088.codfw.wmnet with OS bullseye [17:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:22] (03PS1) 10Cmjohnson: updating site.pp and netboot.cfg for new restbase103[123] servers [puppet] - 10https://gerrit.wikimedia.org/r/761681 (https://phabricator.wikimedia.org/T294372) [17:44:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298554)', diff saved to https://phabricator.wikimedia.org/P20560 and previous config saved to /var/cache/conftool/dbconfig/20220210-174438-ladsgroup.json [17:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:42] (03PS1) 10AOkoth: admin: add jnuche to users [puppet] - 10https://gerrit.wikimedia.org/r/761682 (https://phabricator.wikimedia.org/T301241) [17:44:43] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [17:46:22] (03PS1) 10Andrew Bogott: profile::wmcs::nfsclient: allow hiera to overload nfs server fqdn [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) [17:46:43] 10SRE, 10SRE Observability (FY2021/2022-Q3): DX App Synthetic Monitoring App - watchmouse alert flapping due to CA expiration - https://phabricator.wikimedia.org/T292603 (10herron) 05In progress→03Resolved These checks have been disabled, please see T299147 for additional details [17:47:13] (03PS2) 10Cmjohnson: updating site.pp and netboot.cfg for new restbase103[123] servers [puppet] - 10https://gerrit.wikimedia.org/r/761681 (https://phabricator.wikimedia.org/T294372) [17:48:36] (03CR) 10Andrew Bogott: "PCC results (no-op as intended) https://puppet-compiler.wmflabs.org/pcc-worker1002/33709/" [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [17:49:55] (03PS2) 10Andrew Bogott: profile::wmcs::nfsclient: allow hiera to override nfs server fqdn [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) [17:50:25] (03PS2) 10AOkoth: admin: add jnuche to users [puppet] - 10https://gerrit.wikimedia.org/r/761682 (https://phabricator.wikimedia.org/T301241) [17:50:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] profile::wmcs::nfsclient: allow hiera to override nfs server fqdn [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [17:50:30] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:50:46] (03CR) 10Cmjohnson: [C: 03+2] updating site.pp and netboot.cfg for new restbase103[123] servers [puppet] - 10https://gerrit.wikimedia.org/r/761681 (https://phabricator.wikimedia.org/T294372) (owner: 10Cmjohnson) [17:51:06] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@4312bc3] (eqiad): Update kartotherian-package to dd11f2d [17:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] profile::wmcs::nfsclient: allow hiera to override nfs server fqdn (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [17:52:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, and 2 others: Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10Cmjohnson) [17:54:36] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10Cmjohnson) @MatthewVernon @fgiunchedi @wiki_willy ms-fe1009-1011 are yours, ms-fe1012 is in the new cage and still be used for testing at the moment, once r... [17:54:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300775)', diff saved to https://phabricator.wikimedia.org/P20561 and previous config saved to /var/cache/conftool/dbconfig/20220210-175450-marostegui.json [17:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:56] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [17:54:59] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated tests: introduce testcase for different executable paths [puppet] - 10https://gerrit.wikimedia.org/r/761684 (https://phabricator.wikimedia.org/T284767) [17:55:30] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:57:05] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@4312bc3] (eqiad): Update kartotherian-package to dd11f2d (duration: 05m 59s) [17:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:07] (03CR) 10RLazarus: [C: 03+1] admin: add jnuche to users [puppet] - 10https://gerrit.wikimedia.org/r/761682 (https://phabricator.wikimedia.org/T301241) (owner: 10AOkoth) [17:58:45] (03PS3) 10AOkoth: admin: add jnuche to users [puppet] - 10https://gerrit.wikimedia.org/r/761682 (https://phabricator.wikimedia.org/T301241) [17:59:06] (03PS3) 10Andrew Bogott: profile::wmcs::nfsclient: allow hiera to override nfs server fqdn and path [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) [17:59:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P20562 and previous config saved to /var/cache/conftool/dbconfig/20220210-175942-ladsgroup.json [17:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:46] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::nfsclient: allow hiera to override nfs server fqdn and path [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [18:00:05] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220210T1800). [18:00:57] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@4312bc3] (eqiad): Update kartotherian-package to dd11f2d [18:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:33] (03PS4) 10Andrew Bogott: profile::wmcs::nfsclient: allow hiera to override nfs server fqdn and path [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) [18:02:10] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::nfsclient: allow hiera to override nfs server fqdn and path [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [18:03:26] (03CR) 10AOkoth: [C: 03+2] admin: add jnuche to users [puppet] - 10https://gerrit.wikimedia.org/r/761682 (https://phabricator.wikimedia.org/T301241) (owner: 10AOkoth) [18:04:41] (03PS5) 10Andrew Bogott: profile::wmcs::nfsclient: allow hiera to override nfs server fqdn and path [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) [18:05:29] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::nfsclient: allow hiera to override nfs server fqdn and path [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [18:06:32] (03PS6) 10Andrew Bogott: profile::wmcs::nfsclient: allow hiera to override nfs server fqdn and path [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) [18:06:55] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@4312bc3] (eqiad): Update kartotherian-package to dd11f2d (duration: 05m 58s) [18:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:10] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::nfsclient: allow hiera to override nfs server fqdn and path [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [18:07:20] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@5699db7] (eqiad): Remove unused kartotherian-layermixer reference [18:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:35] (03PS7) 10Andrew Bogott: profile::wmcs::nfsclient: allow hiera to override nfs server fqdn and path [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) [18:09:13] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::nfsclient: allow hiera to override nfs server fqdn and path [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [18:09:52] (03PS8) 10Andrew Bogott: profile::wmcs::nfsclient: allow hiera to override nfs server fqdn and path [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) [18:09:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P20563 and previous config saved to /var/cache/conftool/dbconfig/20220210-180955-marostegui.json [18:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2088.codfw.wmnet with OS bullseye [18:10:30] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::nfsclient: allow hiera to override nfs server fqdn and path [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [18:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:09] (03PS9) 10Andrew Bogott: profile::wmcs::nfsclient: allow hiera to override nfs server fqdn and path [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) [18:12:12] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@5699db7] (eqiad): Remove unused kartotherian-layermixer reference (duration: 04m 52s) [18:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:48] (03PS2) 10Hnowlan: changeprop-jobqueue: increase CPU and memory allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/761677 (https://phabricator.wikimedia.org/T300914) [18:13:04] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@bf5fb8e] (eqiad): Remove unused kartotherian-postgres reference [18:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:18] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@bf5fb8e] (eqiad): Remove unused kartotherian-postgres reference (duration: 00m 14s) [18:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:28] (03CR) 10Andrew Bogott: profile::wmcs::nfsclient: allow hiera to override nfs server fqdn and path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [18:14:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P20564 and previous config saved to /var/cache/conftool/dbconfig/20220210-181447-ladsgroup.json [18:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:07] (03CR) 10Andrew Bogott: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/33717/" [puppet] - 10https://gerrit.wikimedia.org/r/761683 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [18:16:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1031.eqiad.wmnet with OS buster [18:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin... [18:17:03] (03PS1) 10Ladsgroup: Revert "db2138: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/761417 [18:17:20] (03PS2) 10Ladsgroup: Revert "db2138: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/761417 [18:17:24] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db2138: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/761417 (owner: 10Ladsgroup) [18:17:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1032.eqiad.wmnet with OS buster [18:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin... [18:18:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1033.eqiad.wmnet with OS buster [18:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin... [18:18:13] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds wit [18:18:13] cted value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:22:16] (03PS1) 10Dzahn: Revert "site: add etherpad role to etherpad1003" [puppet] - 10https://gerrit.wikimedia.org/r/761418 [18:22:53] (03PS1) 10Ladsgroup: ContentHandler: Avoding saving in ParserCache in search index jobs [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761419 (https://phabricator.wikimedia.org/T285993) [18:23:03] (03PS2) 10Dzahn: Revert "site: add etherpad role to etherpad1003" [puppet] - 10https://gerrit.wikimedia.org/r/761418 [18:23:14] (03PS1) 10Ladsgroup: ContentHandler: Avoding saving in ParserCache in search index jobs [core] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761420 (https://phabricator.wikimedia.org/T285993) [18:23:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T300382)', diff saved to https://phabricator.wikimedia.org/P20565 and previous config saved to /var/cache/conftool/dbconfig/20220210-182326-marostegui.json [18:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:36] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [18:23:42] jouncebot: nowandnext [18:23:42] For the next 0 hour(s) and 36 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220210T1800) [18:23:42] In 0 hour(s) and 36 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220210T1900) [18:23:50] cool [18:23:51] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Reverse DNS zones includes for drmrs - https://phabricator.wikimedia.org/T301447 (10cmooney) 05Open→03Resolved a:03cmooney Working ok after merge: ` cmooney@wikilap:~/repos/random_wmf/netbox_scripts$ dig +noall +answer -x 2620:0:860:fe0a::1 @n... [18:23:53] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10cmooney) [18:23:56] (03CR) 10Ladsgroup: [C: 03+2] ContentHandler: Avoding saving in ParserCache in search index jobs [core] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761420 (https://phabricator.wikimedia.org/T285993) (owner: 10Ladsgroup) [18:23:59] (03CR) 10Ladsgroup: [C: 03+2] ContentHandler: Avoding saving in ParserCache in search index jobs [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761419 (https://phabricator.wikimedia.org/T285993) (owner: 10Ladsgroup) [18:24:12] (03PS1) 10Dzahn: Revert "switch etherpad.discovery.wmnet to etherpad1003" [dns] - 10https://gerrit.wikimedia.org/r/761421 [18:24:55] (03PS1) 10Andrew Bogott: profile::wmcs::nfsclient: fix hiera key name for nfs server mount path [puppet] - 10https://gerrit.wikimedia.org/r/761692 (https://phabricator.wikimedia.org/T301280) [18:25:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P20566 and previous config saved to /var/cache/conftool/dbconfig/20220210-182500-marostegui.json [18:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:17] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@a5be8ac] (eqiad): Remove references to cassandra `storage_id` [18:25:17] (03CR) 10Dzahn: [C: 03+2] Revert "site: add etherpad role to etherpad1003" [puppet] - 10https://gerrit.wikimedia.org/r/761418 (owner: 10Dzahn) [18:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:30] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:25:31] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@a5be8ac] (eqiad): Remove references to cassandra `storage_id` (duration: 00m 15s) [18:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2088:3311 (T300510)', diff saved to https://phabricator.wikimedia.org/P20567 and previous config saved to /var/cache/conftool/dbconfig/20220210-182547-ladsgroup.json [18:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:52] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [18:25:52] (03CR) 10Dzahn: [C: 03+2] Revert "switch etherpad.discovery.wmnet to etherpad1003" [dns] - 10https://gerrit.wikimedia.org/r/761421 (owner: 10Dzahn) [18:26:04] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [18:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:11] (03PS1) 10Dzahn: Revert "etherpad: fix process monitoring after version upgrade, node->nodejs" [puppet] - 10https://gerrit.wikimedia.org/r/761422 [18:27:31] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@a5be8ac] (eqiad): Remove references to cassandra `storage_id` [18:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:28] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::nfsclient: fix hiera key name for nfs server mount path [puppet] - 10https://gerrit.wikimedia.org/r/761692 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [18:28:32] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@a5be8ac] (eqiad): Remove references to cassandra `storage_id` (duration: 01m 01s) [18:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:43] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298554)', diff saved to https://phabricator.wikimedia.org/P20568 and previous config saved to /var/cache/conftool/dbconfig/20220210-182952-ladsgroup.json [18:29:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [18:29:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [18:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:57] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [18:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298554)', diff saved to https://phabricator.wikimedia.org/P20569 and previous config saved to /var/cache/conftool/dbconfig/20220210-182959-ladsgroup.json [18:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2088:3312 (T300510)', diff saved to https://phabricator.wikimedia.org/P20570 and previous config saved to /var/cache/conftool/dbconfig/20220210-183107-ladsgroup.json [18:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:12] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [18:35:12] (03PS1) 10Jgiannelos: mobileapps: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/761695 [18:38:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P20571 and previous config saved to /var/cache/conftool/dbconfig/20220210-183831-marostegui.json [18:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:48] (03Merged) 10jenkins-bot: ContentHandler: Avoding saving in ParserCache in search index jobs [core] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/761420 (https://phabricator.wikimedia.org/T285993) (owner: 10Ladsgroup) [18:39:54] (03PS2) 10BBlack: Remove cp4031 from cluster data [puppet] - 10https://gerrit.wikimedia.org/r/761012 (https://phabricator.wikimedia.org/T301269) [18:39:56] (03PS1) 10BBlack: lvs1017 interface/role setup [puppet] - 10https://gerrit.wikimedia.org/r/761697 (https://phabricator.wikimedia.org/T301142) [18:40:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300775)', diff saved to https://phabricator.wikimedia.org/P20572 and previous config saved to /var/cache/conftool/dbconfig/20220210-184004-marostegui.json [18:40:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [18:40:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [18:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:10] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [18:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T300775)', diff saved to https://phabricator.wikimedia.org/P20573 and previous config saved to /var/cache/conftool/dbconfig/20220210-184012-marostegui.json [18:40:13] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.20/includes/content/ContentHandler.php: Backport: [[gerrit:761420|ContentHandler: Avoding saving in ParserCache in search index jobs (T285993)]] (duration: 00m 50s) [18:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:17] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/761695 (owner: 10Jgiannelos) [18:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:24] T285993: [SPIKE] Estimate growth in demand for Parser Cache storage - https://phabricator.wikimedia.org/T285993 [18:40:58] (03CR) 10jerkins-bot: [V: 04-1] lvs1017 interface/role setup [puppet] - 10https://gerrit.wikimedia.org/r/761697 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [18:41:11] (03PS2) 10BBlack: lvs1017 interface/role setup [puppet] - 10https://gerrit.wikimedia.org/r/761697 (https://phabricator.wikimedia.org/T301142) [18:41:29] (03Merged) 10jenkins-bot: ContentHandler: Avoding saving in ParserCache in search index jobs [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/761419 (https://phabricator.wikimedia.org/T285993) (owner: 10Ladsgroup) [18:41:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:54] (03CR) 10jerkins-bot: [V: 04-1] lvs1017 interface/role setup [puppet] - 10https://gerrit.wikimedia.org/r/761697 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [18:42:35] (03PS3) 10BBlack: lvs1017 interface/role setup [puppet] - 10https://gerrit.wikimedia.org/r/761697 (https://phabricator.wikimedia.org/T301142) [18:42:39] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.21/includes/content/ContentHandler.php: Backport: [[gerrit:761419|ContentHandler: Avoding saving in ParserCache in search index jobs (T285993)]] (duration: 00m 50s) [18:42:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:42:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:09] (03PS2) 10Ssingh: wikimedia-dns.org: add AAAA records for Wikidough [dns] - 10https://gerrit.wikimedia.org/r/761363 (https://phabricator.wikimedia.org/T301165) [18:43:53] !log lvs1013 - stopping puppet+pybal for move to lvs1017, high-traffic1 traffic fails over to lvs1020 for now - T301142 [18:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:57] T301142: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 [18:43:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:21] (03Merged) 10jenkins-bot: mobileapps: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/761695 (owner: 10Jgiannelos) [18:44:35] (03CR) 10Ssingh: "Added PTR records, which I had forgotten about in the previous patchset :)" [dns] - 10https://gerrit.wikimedia.org/r/761363 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [18:44:57] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply on staging [18:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:04] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply on staging [18:45:05] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1033.eqiad.wmnet with OS buster [18:45:07] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply on production [18:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:09] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1032.eqiad.wmnet with OS buster [18:45:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001... [18:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:12] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1031.eqiad.wmnet with OS buster [18:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001... [18:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001... [18:45:27] (03PS2) 10Ssingh: dnsdist: update AAAA records for check.wikimedia-dns.org [puppet] - 10https://gerrit.wikimedia.org/r/761362 (https://phabricator.wikimedia.org/T301165) [18:45:52] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: sync on staging [18:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] changeprop-jobqueue: increase CPU and memory allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/761677 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [18:46:03] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply on production [18:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:06] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply on staging [18:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:11] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:46:43] (03CR) 10Ssingh: [C: 03+2] dnsdist: update AAAA records for check.wikimedia-dns.org [puppet] - 10https://gerrit.wikimedia.org/r/761362 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [18:47:08] (03PS1) 10Volans: requests: fix timeout [software/pywmflib] - 10https://gerrit.wikimedia.org/r/761698 [18:47:23] (03PS4) 10BBlack: lvs1017 interface/role setup [puppet] - 10https://gerrit.wikimedia.org/r/761697 (https://phabricator.wikimedia.org/T301142) [18:47:35] (03PS1) 10Ladsgroup: Revert "db2088: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/761423 [18:47:44] (03PS2) 10Ladsgroup: Revert "db2088: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/761423 [18:48:27] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:48:29] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db2088: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/761423 (owner: 10Ladsgroup) [18:49:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:35] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: sync on production [18:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:53] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply on production [18:49:55] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply on staging [18:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:20] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: sync on production [18:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P20574 and previous config saved to /var/cache/conftool/dbconfig/20220210-185336-marostegui.json [18:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:42] !log restart all mjolnir daemons on search-loader1001 and 2001 to purge old cached node lists [18:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:07] (03CR) 10BBlack: [C: 03+2] lvs1017 interface/role setup [puppet] - 10https://gerrit.wikimedia.org/r/761697 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [18:54:17] PROBLEM - Host lvs1017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:57:28] (03CR) 10Ottomata: [V: 03+1 C: 03+2] airflow - Set up research instance and deployment of airflow-dags [puppet] - 10https://gerrit.wikimedia.org/r/761670 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [18:58:28] (03PS1) 10Jbond: utils: pcc-facts-upload [puppet] - 10https://gerrit.wikimedia.org/r/761699 [18:58:34] (03PS1) 104nn1l2: urwiki: Add patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761700 (https://phabricator.wikimedia.org/T301491) [18:59:31] RECOVERY - Host lvs1017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [18:59:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298554)', diff saved to https://phabricator.wikimedia.org/P20575 and previous config saved to /var/cache/conftool/dbconfig/20220210-185956-ladsgroup.json [19:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:01] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [19:00:05] RoanKattouw and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220210T1900). [19:00:05] Ideophagous and zabe: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:19] o/ [19:00:40] I can deploy today [19:01:11] I don't see Ideophagous, so let's start with Zabe's patch [19:01:25] !log otto@deploy1002 Started deploy [airflow-dags/research@b871faf]: (no justification provided) [19:01:26] hi [19:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:32] (03PS1) 10Andrew Bogott: define puppetmaster::web_frontend: add comments for my future self [puppet] - 10https://gerrit.wikimedia.org/r/761702 [19:01:44] (03PS3) 10Andrew Bogott: nfs add_server: create service address with prefix rather than volume name [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/761429 [19:01:47] oh, just in time [19:01:51] hello Ideophagous. Are you around? [19:01:52] !log otto@deploy1002 Finished deploy [airflow-dags/research@b871faf]: (no justification provided) (duration: 00m 27s) [19:01:55] yes [19:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:05] RECOVERY - etherpad_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:02:13] (03PS1) 10Jbond: P:puppet_compiler::puppetdb: delete cronjob [puppet] - 10https://gerrit.wikimedia.org/r/761703 [19:02:14] Ideophagous: it seems it's your first time deploying a patch via B&C, is that right? [19:02:23] indeed, it's my first time [19:02:33] (03CR) 10Jbond: [C: 03+2] utils: pcc-facts-upload [puppet] - 10https://gerrit.wikimedia.org/r/761699 (owner: 10Jbond) [19:02:45] (03CR) 10Jbond: [C: 03+2] P:puppet_compiler::puppetdb: delete cronjob [puppet] - 10https://gerrit.wikimedia.org/r/761703 (owner: 10Jbond) [19:02:50] Ideophagous: do you have the gadget from https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage installed please? [19:02:56] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:puppet_compiler::puppetdb: delete cronjob [puppet] - 10https://gerrit.wikimedia.org/r/761703 (owner: 10Jbond) [19:03:01] yes, I added the plugin [19:03:05] !log otto@deploy1002 Started deploy [airflow-dags/research@b871faf]: (no justification provided) [19:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:08] !log otto@deploy1002 Finished deploy [airflow-dags/research@b871faf]: (no justification provided) (duration: 00m 03s) [19:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:15] (03PS5) 10Urbanecm: Change / add some namespaces and aliases on arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747973 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [19:03:29] (03PS2) 10Andrew Bogott: puppetmaster::web_frontend: add comments for my future self [puppet] - 10https://gerrit.wikimedia.org/r/761702 [19:03:43] should I activate the plugin? [19:03:49] Ideophagous: not yet, I'm reviewing your patch [19:03:53] OK [19:04:04] (03CR) 10Cwhite: [C: 03+1] puppetmaster::web_frontend: add comments for my future self [puppet] - 10https://gerrit.wikimedia.org/r/761702 (owner: 10Andrew Bogott) [19:04:09] I don't understand why there are lines like `NS_TALK => 'مداكرة', // T291737`. What are they supposed to do Ideophagous please? [19:04:10] T291737: Request adding and updating namespaces on arywiki - https://phabricator.wikimedia.org/T291737 [19:04:57] (03CR) 10Andrew Bogott: [C: 03+2] puppetmaster::web_frontend: add comments for my future self [puppet] - 10https://gerrit.wikimedia.org/r/761702 (owner: 10Andrew Bogott) [19:05:16] it's to specify the namespaces [19:05:38] which namespaces? Do you want to change the translation of NS_TALK? [19:05:56] yes, you can check the request here: https://phabricator.wikimedia.org/T291737#7700244 [19:06:18] 10SRE, 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10Arnoldokoth) [19:06:31] 10SRE, 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10Arnoldokoth) 05In progress→03Resolved [19:06:33] there are also two new namespaces that were added [19:06:33] PROBLEM - etherpad_up reduced availability on alert1001 is CRITICAL: 0.5 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:06:44] Draft and Portal [19:07:00] Ideophagous: I see. Unfortunately, in that case, I can't deploy your patch today as-is. Changes of namespace translations should be done in mediawiki itself, rather than in our configuration. [19:07:10] I can guide you through that after i finish the other patch [19:07:17] alright [19:07:23] only adding the two new namespaces should be done in config [19:07:41] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:07:44] so, I shall split it into two patches, right? [19:07:57] (03CR) 10Urbanecm: [C: 03+2] Migrate $wmfStandardAutoPromote to $wmgStandardAutoPromote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761441 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [19:08:24] Ideophagous: there will be a config patch that will add two namespaces (Draft and Portal) [19:08:35] and another patch in mediawiki/core (not operations/mediawiki-config) that will change the translations [19:08:37] (03Merged) 10jenkins-bot: Migrate $wmfStandardAutoPromote to $wmgStandardAutoPromote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761441 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [19:08:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T300382)', diff saved to https://phabricator.wikimedia.org/P20576 and previous config saved to /var/cache/conftool/dbconfig/20220210-190840-marostegui.json [19:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:46] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [19:08:51] OK [19:08:56] Ideophagous: you will need to change this file https://github.com/wikimedia/mediawiki/blob/master/languages/messages/MessagesAry.php [19:09:01] (which is the source of the translations) [19:10:02] zabe: pulled to mwdebug1001 [19:10:07] can you spot check please? [19:11:04] !log lvs1017 rebooting for sanity-check after prod config - T301142 [19:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:09] T301142: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 [19:11:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:36] urbanecm: flaggedrevs wikis still seem to work, so I think we are good to go [19:11:42] let's do it! [19:11:51] zabe: as before, please keep an eye on the logs for a while, just in case [19:12:17] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@828a428] (eqiad): Configure geoshapes postgres max conns [19:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:12:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:01] !log urbanecm@deploy1002 Synchronized wmf-config/flaggedrevs.php: 72f3b31: Migrate $wmfStandardAutoPromote to $wmgStandardAutoPromote (T45956) (duration: 00m 49s) [19:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:06] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [19:13:07] zabe: synced! [19:13:29] thanks :) [19:13:34] any time :) [19:13:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:13:46] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@828a428] (eqiad): Configure geoshapes postgres max conns (duration: 01m 29s) [19:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:51] Ideophagous: please feel free to ping me if i can help you with making the new set of patches. [19:15:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P20577 and previous config saved to /var/cache/conftool/dbconfig/20220210-191501-ladsgroup.json [19:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:26] urbanecm: is there a special way to clone the repository? git remote add doesn't seem to work [19:15:40] Ideophagous: you will need to clone that repository with git clone [19:15:47] it's a different repository from the config one [19:15:52] with a different set of files, etc. [19:15:58] OK [19:16:59] cloning now [19:18:23] should I amend the config patch to keep only the new namespaces? [19:18:29] Ideophagous: yes [19:18:29] (03PS1) 10BBlack: lvs1017: fix public1-b/c interface ordering [puppet] - 10https://gerrit.wikimedia.org/r/761704 (https://phabricator.wikimedia.org/T301142) [19:18:37] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:18:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:19:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:07] (03CR) 10BBlack: [C: 03+2] lvs1017: fix public1-b/c interface ordering [puppet] - 10https://gerrit.wikimedia.org/r/761704 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [19:20:41] should the aliases be removed from config too? [19:21:54] you'd better keep aliases [19:22:04] PROBLEM - Host lvs1017 is DOWN: PING CRITICAL - Packet loss = 100% [19:22:10] Ideophagous: depends what the aliases are for [19:22:12] RECOVERY - Host lvs1017 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [19:22:26] are they the old translations? In that case, keep the old translation in the Messages file (https://github.com/wikimedia/mediawiki/blob/master/languages/messages/MessagesAry.php). a [19:23:21] Yep, those are the old translations, I will remove them from config patch then [19:23:24] okay [19:23:32] but please keep them in the MessagesAry.php file [19:23:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:21] (well, add them, as they're not there currently as everything fallbacks to arabic) [19:24:47] you'll need to add https://github.com/wikimedia/mediawiki/blob/master/languages/messages/MessagesAr.php#L105-L128 to MessagesAry.php [19:25:37] !log lvs1017 reboot again for clean network config - T301142 [19:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:42] T301142: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 [19:27:46] PROBLEM - Host lvs1017 is DOWN: PING CRITICAL - Packet loss = 100% [19:28:44] RECOVERY - Host lvs1017 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [19:30:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P20578 and previous config saved to /var/cache/conftool/dbconfig/20220210-193005-ladsgroup.json [19:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:30] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:30:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [19:31:44] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:31:47] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [19:31:50] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:31:50] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:31:50] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:31:52] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:31:52] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:31:54] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:31:56] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:32:04] looking [19:32:08] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:32:14] PROBLEM - Apache HTTP on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:14] PROBLEM - Apache HTTP on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:18] PROBLEM - Apache HTTP on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:20] the wikis seem down [19:32:21] dinner but back [19:32:21] got paged and it sounds like this might [19:32:24] jouncebot: now [19:32:24] For the next 0 hour(s) and 27 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220210T1900) [19:32:24] PROBLEM - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: testlb6_443: Serve [19:32:24] 9.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:32:25] "upstream connect error or disconnect/reset before headers. reset reason: overflow" [19:32:35] bblack: ongoing work related? [19:32:37] ^ [19:32:53] receiving user reports [19:32:54] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:32:54] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:32:54] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:32:56] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:32:57] mutante: appserver latency is up so my first guess isn't a traffic-layer issue [19:33:06] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:33:06] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:33:10] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:33:10] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:33:10] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:33:10] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:33:10] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:33:22] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:33:22] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:33:22] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:33:22] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:33:22] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:33:25] that is all just eqsin though [19:33:36] yeah also true [19:33:38] starting a status doc, I'll take IC [19:33:39] Is B&C still underway? [19:33:40] RECOVERY - Apache HTTP on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.734 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:33:40] RECOVERY - Apache HTTP on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.168 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:33:44] RECOVERY - Apache HTTP on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.238 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:33:46] nn1l2: no, we have an incident [19:33:46] nn1l2: no, we're in an incident [19:33:51] oopsie? [19:34:18] ok, so the last action right before that was: 'Repooling after maintenance db1179' [19:34:36] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:34:55] PROBLEM - Varnish has reduced HTTP availability #page on alert1001 is CRITICAL: job=varnish-text https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [19:34:56] (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [19:35:03] Please ping me if you continue this window. I'm around. [19:35:11] nn1l2: we will very likely not [19:35:18] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5006 is CRITICAL: 3.25e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [19:35:18] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5003 is CRITICAL: 3.67e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5003 [19:35:18] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5002 is CRITICAL: 3.605e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002 [19:35:24] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:35:26] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:35:26] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:35:30] (JobUnavailable) firing: (3) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:35:30] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:35:30] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:35:30] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:35:34] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:35:36] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:35:40]  [19:35:50] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:35:54] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5001 is CRITICAL: 3.604e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5001 [19:35:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [19:35:56] ready to help, do we have any idea of the cause [19:36:07] I don't think it's related to our eqiad lvs work [19:36:10] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5007 is CRITICAL: 3.776e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [19:36:14] at first glance, anyways [19:36:16] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5009 is CRITICAL: 3.864e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [19:36:16] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5014 is CRITICAL: 3.855e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5014 [19:36:18] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5011 is CRITICAL: 3.859e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [19:36:20] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5013 is CRITICAL: 4.052e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5013 [19:36:22] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5005 is CRITICAL: 3.976e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005 [19:36:26] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:36:34] I didn't think it would be but I also didnt think you'd wanna be pushing network changes mid outage [19:36:37] bblack: are theses alerts for esqin are expected [19:36:39] hence my ping in other channel [19:36:45] Hi. [19:36:53] someone created T301505 [19:36:53] T301505: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 [19:36:54] Seeing a lot of "upstream connect error or disconnect/reset before headers. reset reason: overflow" [19:36:54] ShakespeareFan00, there is currently a known incident [19:36:54] this is what I get when I run git review -R for the amended config patch:UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10926: character maps to [19:36:56] ShakespeareFan00: known incident before you ask :) [19:37:15] That's what I thought [19:37:23] I am not seeing anything on a traceroute [19:37:48] bblack: mutante points out that all the traffic alerts are in eqsin, does that ring any bells for you? [19:38:02] rzl: no [19:38:07] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [19:38:07] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.528 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:38:09] we're only doing funny things in eqiad right now [19:38:11] Getting ping times of about 17ms to 91.198.174.192 ( text-lb.esams.wikimedia.org ) [19:38:13] RECOVERY - Varnish has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [19:38:22] urbanecm: sorry, forgot to tag you [19:38:34] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5006 is OK: (C)5000 gt (W)3000 gt 261.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [19:38:50] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5016 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.447 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:38:54] PROBLEM - pybal on lvs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:38:56] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:38:56] RECOVERY - PyBal backends health check on lvs5001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:38:58] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:38:58] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:38:58] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5016 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:38:58] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:38:58] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:39:00] Is this outage due to a planned deployment? [19:39:02] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:39:02] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:39:02] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.449 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:39:02] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5016 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.447 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:39:02] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:39:04] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5016 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:39:06] while a lot of eqsin cps are mad, and seem to be recovering now [19:39:06] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:39:06] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.455 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:39:10] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:39:11] i didnt see any icinga errors for the network links there [19:39:18] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5001 is OK: (C)5000 gt (W)3000 gt 342.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5001 [19:39:22] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.452 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:39:24] i didnt do anything was just browsing icinga [19:39:27] * volans here [19:39:28] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.448 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:39:40] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5007 is OK: (C)5000 gt (W)3000 gt 384 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [19:39:46] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5009 is OK: (C)5000 gt (W)3000 gt 552.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [19:39:46] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5014 is OK: (C)5000 gt (W)3000 gt 457.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5014 [19:39:48] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5011 is OK: (C)5000 gt (W)3000 gt 194 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [19:39:52] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5013 is OK: (C)5000 gt (W)3000 gt 429.6 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5013 [19:39:52] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5005 is OK: (C)5000 gt (W)3000 gt 362.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005 [19:39:56] (ProbeHttpFailed) resolved: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [19:40:12] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:12] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:12] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5016 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:14] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:24] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:24] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.447 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:24] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5003 is OK: (C)5000 gt (W)3000 gt 457.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5003 [19:40:26] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5002 is OK: (C)5000 gt (W)3000 gt 264 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002 [19:40:26] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5016 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.454 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:26] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:26] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:26] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5016 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.449 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:26] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:30] (JobUnavailable) firing: (3) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:40:38] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.455 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:38] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.447 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:38] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:38] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:40:38] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:41:54] so our cr3 router had librenms too long to poll alert at 1939 gmt [19:41:57] in eqsin [19:42:48] this is the same router that has a bad optic causing interface errors but not sure if thats related [19:43:27] usage on the main interfaces on that router seem fine, levels dropping but that's usual for this time of night over there. [19:44:11] health looking ok too (cpu/mem etc.) [19:44:15] robh: topranks --> _security [19:44:20] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [19:45:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298554)', diff saved to https://phabricator.wikimedia.org/P20579 and previous config saved to /var/cache/conftool/dbconfig/20220210-194510-ladsgroup.json [19:45:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [19:45:13] 10SRE, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10Ladsgroup) [19:45:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [19:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:16] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [19:45:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298554)', diff saved to https://phabricator.wikimedia.org/P20580 and previous config saved to /var/cache/conftool/dbconfig/20220210-194518-ladsgroup.json [19:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:26] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1003 is CRITICAL: connect to address 10.64.32.181 and port 9001: Connection refused https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [19:46:32] PROBLEM - etherpad_lite_process_running on etherpad1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/nodejs /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [19:48:16] ^ etherpad alerts are unrelated [19:48:44] we're still working on troubleshooting this, and please continue to hold off with the B&C window [19:49:03] everyone who reported problems, thank you, and we expect it should be resolved -- speak up if you're still seeing issues please :) [19:49:32] PROBLEM - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1262522 MB (15% inode=74%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [19:53:27] urbanecm: I ran git review -R on the amended commit, but I can't see the change in gerrit [19:53:35] ACKNOWLEDGEMENT - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1262522 MB (15% inode=74%): andrew bogott Michael investigating - The acknowledgement expires at: 2022-02-11 23:52:55. https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [19:53:42] Ideophagous: can we move to -dev please, [19:53:43] ? [19:53:44] #wikimedia-dev [19:54:01] sure [19:54:21] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [20:00:04] jeena and dancy: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220210T2000). [20:00:22] I am assuming we are holding off on train for a bit? [20:01:03] 10SRE, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10Ladsgroup) [20:01:04] jeena: we're wrapping up incident response so I think you're clear to go ahead [20:01:27] note we didn't get around to that last B&C window, I'm agnostic as to which should go first but deployers should coordinate :) [20:01:43] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 55.55 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:01:48] (or at least I think we didn't) [20:01:51] ^ that's expected [20:01:54] thanks rzl [20:02:04] jeena: you're clear to go from B&C standpoint :) [20:02:19] thanks urbanecm :) [20:03:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298554)', diff saved to https://phabricator.wikimedia.org/P20581 and previous config saved to /var/cache/conftool/dbconfig/20220210-200304-ladsgroup.json [20:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:09] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [20:03:43] (03PS1) 10Ideophagous: adding Portal and Draft namespaces to arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761708 [20:04:47] (03PS1) 10BBlack: Add lvs1017 to pybal neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/761709 (https://phabricator.wikimedia.org/T301142) [20:04:58] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup We think this is resolved now. If you still can't access the wikis, please let us know. [20:05:18] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/761709 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [20:05:30] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10Dzahn) We think this is resolved now. If you still can't access the wikis, please let us know. [20:06:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10Jclark-ctr) [20:06:24] (03PS1) 10Jeena Huneidi: all wikis to 1.38.0-wmf.21 refs T300197 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761710 [20:06:26] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.38.0-wmf.21 refs T300197 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761710 (owner: 10Jeena Huneidi) [20:06:52] (03CR) 10BBlack: [C: 03+2] Add lvs1017 to pybal neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/761709 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [20:07:13] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.21 refs T300197 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761710 (owner: 10Jeena Huneidi) [20:08:01] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 74.7 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:08:22] (03CR) 10Ideophagous: "I've updated the patch to include only the new namespaces" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761708 (owner: 10Ideophagous) [20:08:31] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.21 refs T300197 [20:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:37] T300197: 1.38.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T300197 [20:10:35] ACKNOWLEDGEMENT - etherpad.wikimedia.org HTTP on etherpad1003 is CRITICAL: connect to address 10.64.32.181 and port 9001: Connection refused daniel_zahn known issue still being debugged, currently not serving traffic https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [20:10:35] ACKNOWLEDGEMENT - etherpad_lite_process_running on etherpad1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/nodejs /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js daniel_zahn known issue still being debugged, currently not serving traffic https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [20:12:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Jclark-ctr) [20:14:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:59] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:15:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:15:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P20582 and previous config saved to /var/cache/conftool/dbconfig/20220210-201808-ladsgroup.json [20:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:53] (03CR) 104nn1l2: "Please follow https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines when writing commit messages. Thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761708 (owner: 10Ideophagous) [20:22:01] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Jclark-ctr) [20:31:04] (03PS2) 10Ideophagous: adding Portal and Draft namespaces to arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761708 (https://phabricator.wikimedia.org/T291737) [20:33:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P20583 and previous config saved to /var/cache/conftool/dbconfig/20220210-203313-ladsgroup.json [20:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:32] Can a simple user like me edit the commit message of changes uploaded by other users such as https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/761708/ [20:36:28] nn1l2: yes, if you don't see the option to do it, you probably have to be added to some user group [20:36:35] nn1l2: I believe so? https://www.mediawiki.org/wiki/Gerrit/Tutorial#Editing_via_the_web-interface [20:36:48] let's see if i can find anything about it [20:37:22] It gives me the following error: An error occurred [20:37:22] Error 403 (Forbidden): modifying commit message not permitted [20:37:22] Endpoint: /changes/*~*/message [20:37:50] Which usergroup? I would like to be added to that usergroup. Thanks [20:37:58] nn1l2: via https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/761708/2//COMMIT_MSG,edit#1 ? [20:38:17] you need to be in the trusted contribs usergroup [20:38:36] ah [20:38:42] i found https://www.mediawiki.org/wiki/Technical_contributor_onboarding which says "To amend commits created by others, you have to be added to Trusted-Contributors group in Gerrit. See phab:T249413." [20:38:42] T249413: Add Google Summer of Code students to Trusted-Contributors group in Gerrit - https://phabricator.wikimedia.org/T249413 [20:38:57] but that doesn't seem like proper documentation [20:39:27] https://gerrit.wikimedia.org/r/admin/groups/2021f25e7515187a81d51f8fe14dd6f25617cce0 [20:39:30] "Members of this group can amend changes submitted by someone else. The group is viral in that current members can add new members to the group. See T238651." [20:39:31] T238651: Discussion about Trusted-Contributors Gerrit group - https://phabricator.wikimedia.org/T238651 [20:39:33] I see the edit button, and I can play with the text, but my changes don't get saved [20:39:48] so i guess any of us can probably add you there? let me try [20:39:51] I get this error always: An error occurred [20:39:52] Error 403 (Forbidden): edit not permitted [20:39:52] Endpoint: /changes/*~*/edit/* [20:40:16] (03Abandoned) 10Ideophagous: Change / add some namespaces and aliases on arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747973 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [20:40:29] nn1l2: can you try now? i think i added you [20:40:46] (as seen on https://gerrit.wikimedia.org/r/admin/groups/2021f25e7515187a81d51f8fe14dd6f25617cce0,audit-log ) [20:40:49] (03PS3) 104nn1l2: arywiki: Add Portal and Draft namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761708 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [20:41:12] Thanks! now it works: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/761708/ :) [20:41:54] :D [20:42:19] (03PS1) 10Herron: watchrat: add shop.wm.o to url list [puppet] - 10https://gerrit.wikimedia.org/r/761715 (https://phabricator.wikimedia.org/T299147) [20:42:32] (03CR) 104nn1l2: [C: 03+1] "It looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761708 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [20:42:34] (03PS2) 10Herron: watchrat: add shop.wm.o to url list [puppet] - 10https://gerrit.wikimedia.org/r/761715 (https://phabricator.wikimedia.org/T299147) [20:44:05] (03CR) 10Herron: [C: 03+2] watchrat: add shop.wm.o to url list [puppet] - 10https://gerrit.wikimedia.org/r/761715 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [20:48:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298554)', diff saved to https://phabricator.wikimedia.org/P20584 and previous config saved to /var/cache/conftool/dbconfig/20220210-204818-ladsgroup.json [20:48:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [20:48:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [20:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:48:23] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [20:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298554)', diff saved to https://phabricator.wikimedia.org/P20585 and previous config saved to /var/cache/conftool/dbconfig/20220210-204831-ladsgroup.json [20:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:12] (03PS1) 10Zabe: Remove otrs-wiki.wikimedia.org from mediawiki.yaml [puppet] - 10https://gerrit.wikimedia.org/r/761717 (https://phabricator.wikimedia.org/T280400) [20:53:55] (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [20:57:30] (03PS1) 10Zabe: test [puppet] - 10https://gerrit.wikimedia.org/r/761718 [21:08:31] !log lvs1017 - bringing pybal online with real routing, flips high-traffic (text-cluster) traffic from lvs1020 -> lvs1017 [21:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298554)', diff saved to https://phabricator.wikimedia.org/P20586 and previous config saved to /var/cache/conftool/dbconfig/20220210-210839-ladsgroup.json [21:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:45] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [21:09:07] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:09:11] RECOVERY - pybal on lvs1017 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [21:09:13] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [21:12:21] (03PS1) 10Ryan Kemper: elastic: fix cirrus settings false negative [puppet] - 10https://gerrit.wikimedia.org/r/761721 (https://phabricator.wikimedia.org/T218932) [21:12:44] (03PS2) 10Ryan Kemper: elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/761721 (https://phabricator.wikimedia.org/T218932) [21:12:59] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/761721 (https://phabricator.wikimedia.org/T218932) (owner: 10Ryan Kemper) [21:13:55] (ProbeHttpFailed) firing: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [21:15:08] (03PS3) 10Ryan Kemper: elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/761721 (https://phabricator.wikimedia.org/T301511) [21:16:11] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:16:28] !log cr1-eqiad - manual config, static fallback for high-traffic1 to lvs1017 [21:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:02] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/761721 (https://phabricator.wikimedia.org/T301511) (owner: 10Ryan Kemper) [21:18:21] PROBLEM - Check systemd state on etherpad1003 is CRITICAL: CRITICAL - degraded: The following units failed: etherpad-lite.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:55] (ProbeHttpFailed) firing: (3) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [21:22:12] ^ is this watchrat thing expected? [21:23:15] (03PS1) 10BBlack: lvs1016: clean up unused hieradata [puppet] - 10https://gerrit.wikimedia.org/r/761725 (https://phabricator.wikimedia.org/T301142) [21:23:17] (03PS1) 10BBlack: lvs1013: deconfigure towards spare::system [puppet] - 10https://gerrit.wikimedia.org/r/761726 (https://phabricator.wikimedia.org/T301142) [21:23:34] its not expected but not serious either, something up with the probe success on a recently added check [21:23:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P20587 and previous config saved to /var/cache/conftool/dbconfig/20220210-212344-ladsgroup.json [21:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:55] (ProbeHttpFailed) firing: (5) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [21:26:08] (03PS4) 10Ryan Kemper: elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/761721 (https://phabricator.wikimedia.org/T301511) [21:26:28] (03CR) 10BBlack: [C: 03+2] lvs1016: clean up unused hieradata [puppet] - 10https://gerrit.wikimedia.org/r/761725 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [21:26:48] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/761721 (https://phabricator.wikimedia.org/T301511) (owner: 10Ryan Kemper) [21:27:33] (03CR) 10BBlack: [C: 03+2] lvs1013: deconfigure towards spare::system [puppet] - 10https://gerrit.wikimedia.org/r/761726 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [21:34:11] (03PS5) 10Ryan Kemper: elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/761721 (https://phabricator.wikimedia.org/T301511) [21:34:37] (03PS1) 10Dzahn: etherpad: allow setting listening IP in Hiera, use IPv6 on etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/761727 (https://phabricator.wikimedia.org/T300568) [21:34:48] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/761721 (https://phabricator.wikimedia.org/T301511) (owner: 10Ryan Kemper) [21:35:32] (03PS2) 10Dzahn: etherpad: allow setting listening IP in Hiera, use IPv6 on etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/761727 (https://phabricator.wikimedia.org/T300568) [21:35:50] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33727/console" [puppet] - 10https://gerrit.wikimedia.org/r/761721 (https://phabricator.wikimedia.org/T301511) (owner: 10Ryan Kemper) [21:36:55] (03PS3) 10Dzahn: etherpad: allow setting listening IP in Hiera, use IPv6 on etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/761727 (https://phabricator.wikimedia.org/T300568) [21:38:28] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/761721 (https://phabricator.wikimedia.org/T301511) (owner: 10Ryan Kemper) [21:38:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P20588 and previous config saved to /var/cache/conftool/dbconfig/20220210-213849-ladsgroup.json [21:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:40] (03PS1) 10BBlack: Remove lvs1013 from pybal neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/761728 (https://phabricator.wikimedia.org/T301142) [21:46:53] (03PS4) 10Dzahn: etherpad: allow setting listening IP in Hiera, use IPv6 on etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/761727 (https://phabricator.wikimedia.org/T300568) [21:50:13] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/33729/etherpad1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/761727 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [21:51:51] (03CR) 10Dzahn: "noop on etherpad1002" [puppet] - 10https://gerrit.wikimedia.org/r/761727 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [21:53:42] (03CR) 10BBlack: [C: 03+2] Remove lvs1013 from pybal neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/761728 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [21:53:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298554)', diff saved to https://phabricator.wikimedia.org/P20589 and previous config saved to /var/cache/conftool/dbconfig/20220210-215354-ladsgroup.json [21:53:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:53:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:59] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [21:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:23] (03PS1) 10Dzahn: Revert "Revert "site: add etherpad role to etherpad1003"" [puppet] - 10https://gerrit.wikimedia.org/r/761425 [21:59:17] (03PS2) 10Dzahn: Revert "Revert "site: add etherpad role to etherpad1003"" [puppet] - 10https://gerrit.wikimedia.org/r/761425 [21:59:36] hmmm I managed to clear the icinga alerts for cr[12]-eqiad BGP/pybal, but seeing no update here for the recov [22:00:01] maybe they were silenced or something [22:04:03] (03PS1) 10Herron: Revert "watchrat: add shop.wm.o to url list" [puppet] - 10https://gerrit.wikimedia.org/r/761746 [22:04:09] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS buster [22:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:17] 10SRE, 10ops-eqiad, 10Traffic-Icebox, 10Patch-For-Review: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host lvs1013.eqiad.wmnet with OS buster [22:05:08] (03CR) 10jerkins-bot: [V: 04-1] Revert "watchrat: add shop.wm.o to url list" [puppet] - 10https://gerrit.wikimedia.org/r/761746 (owner: 10Herron) [22:05:46] (03PS2) 10Herron: Revert "watchrat: add shop.wm.o to url list" [puppet] - 10https://gerrit.wikimedia.org/r/761746 [22:06:10] 10SRE, 10ops-eqiad, 10Traffic-Icebox, 10Patch-For-Review: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 (10BBlack) [22:07:06] (03CR) 10Herron: [C: 03+2] Revert "watchrat: add shop.wm.o to url list" [puppet] - 10https://gerrit.wikimedia.org/r/761746 (owner: 10Herron) [22:07:49] the name watchrat made me laugh :) [22:08:22] icinga-wm: ping [22:08:35] it'd be nice if there was some easy way to status-check that it's still there [22:08:57] there is, you can send "CUSTOM ACK" from web UI [22:09:23] does that [22:09:25] CUSTOM - dhclient process on etherpad1002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [22:09:28] (or alternatively, have it just spam the channel with a generic message every ~15 minutes to make its absence obvious. Maybe a list of the current outstanding total crit/warn/unknown) [22:09:37] mutante: nice! :) [22:10:01] probably better that I direct such things towards AM for the future, though [22:10:29] :) That was "Send custom notification for checked service" from the drop down. [22:12:27] bblack: clearly needs monitoring except that is also an IRC client so watch-icinga-wm joins the channel [22:13:55] (ProbeHttpFailed) firing: (5) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [22:14:38] yea, every hour "You are listening to icinga-wm. and I am still alive" would also work. like FM radio stations have to say "you are listening to $station name" every hour by law [22:15:14] that jinxer-wm stuff looks related to "watchrat" change above [22:18:55] (ProbeHttpFailed) firing: (5) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [22:19:36] herron: watchrat replacing watchmouse cracks me up :) also ^ [22:20:45] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "site: add etherpad role to etherpad1003"" [puppet] - 10https://gerrit.wikimedia.org/r/761425 (owner: 10Dzahn) [22:21:19] rats and mices? you need a cat on-site [22:21:56] (03PS1) 10Dzahn: Revert "Revert "switch etherpad.discovery.wmnet to etherpad1003"" [dns] - 10https://gerrit.wikimedia.org/r/761747 [22:22:34] hauskatze: haha, I was about to say that would be you [22:22:39] meow [22:23:55] (ProbeHttpFailed) firing: (5) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [22:24:20] !log etherpad - one more short downtime for maintenance - downtimed in alertmanager and icinga [22:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:21] mutante: ha thanks for the ping I reverted the problem check and it should clear soon [22:26:34] herron: :) cool [22:26:40] its a false positive fwiw [22:26:44] ack [22:26:56] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [22:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:14] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1013.eqiad.wmnet with OS buster [22:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:22] 10SRE, 10ops-eqiad, 10Traffic-Icebox: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host lvs1013.eqiad.wmnet with OS buster completed: - lvs1013 (**PASS**) - Downtimed on Icinga -... [22:27:33] RECOVERY - Check systemd state on etherpad1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:50] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "switch etherpad.discovery.wmnet to etherpad1003"" [dns] - 10https://gerrit.wikimedia.org/r/761747 (owner: 10Dzahn) [22:27:55] etherpad seems to not be working? [22:28:05] bblack: it's me! 1 min [22:28:09] ok :) [22:28:18] I am trying to switch it to new version, attempt 2 [22:28:26] sounds "easy" :) [22:28:36] lol, I have an entire story for later [22:28:59] !log bblack@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [22:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [22:31:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [22:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:37] bblack: try now [22:32:54] 10ops-eqiad, 10DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T300820 (10BBlack) 05Open→03Declined Yeah declining for now. If it's the cable, I imagine we'll see this again after it moves to lvs1019 in T301142 and we can try a fresh task! [22:33:51] mutante: seems to work! [22:33:55] (ProbeHttpFailed) resolved: (4) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [22:34:34] bblack: happy! because this was bullseye->buster, etherpad ->1.8.16, IPv4 -> IPv6, envoy :p [22:34:36] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [22:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:52] eh, the other way around, to bullseye of course [22:35:30] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:35:57] ^ that should recover [22:38:02] 10SRE, 10ops-eqiad, 10Traffic-Icebox: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 (10BBlack) lvs1013 -> lvs1017 is complete, including cleanup. Since the process is tricky to get right and is a corner case for so much of our automation, I've documented it loosely in etherp... [22:39:39] !log etherpad - succesfully switched to etherpad1003 (bullseye) and etherpad 1.8.16 - on second attempt after making it listen on IPv6 to work behind envoy (T300568) - https://gerrit.wikimedia.org/r/c/operations/puppet/+/761727/ [22:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:44] T300568: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 [22:44:53] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1003 is OK: HTTP OK: HTTP/1.1 200 OK - 6446 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [22:47:01] RECOVERY - etherpad_lite_process_running on etherpad1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/nodejs /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [22:47:14] (03Abandoned) 10Dzahn: Revert "etherpad: fix process monitoring after version upgrade, node->nodejs" [puppet] - 10https://gerrit.wikimedia.org/r/761422 (owner: 10Dzahn) [22:47:42] (03PS2) 10Dzahn: site: move etherpad1002 back to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/761669 (https://phabricator.wikimedia.org/T300568) [22:48:32] (03CR) 10Dzahn: [C: 03+2] site: move etherpad1002 back to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/761669 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [22:52:30] (03PS1) 10Dzahn: etherpad: make listening on IPv6 the default now [puppet] - 10https://gerrit.wikimedia.org/r/761737 (https://phabricator.wikimedia.org/T300568) [22:59:53] RECOVERY - etherpad_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:00:41] (03PS1) 10Jdlrobson: Make Vector 2022 the default skin for MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761739 (https://phabricator.wikimedia.org/T298519) [23:01:45] (03CR) 10jerkins-bot: [V: 04-1] Make Vector 2022 the default skin for MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761739 (https://phabricator.wikimedia.org/T298519) (owner: 10Jdlrobson) [23:03:22] (03PS2) 10Jdlrobson: Make Vector 2022 the default skin for MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761739 (https://phabricator.wikimedia.org/T298519) [23:03:57] (03CR) 10Clare Ming: [C: 03+1] Make Vector 2022 the default skin for MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761739 (https://phabricator.wikimedia.org/T298519) (owner: 10Jdlrobson) [23:09:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [23:10:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [23:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T298554)', diff saved to https://phabricator.wikimedia.org/P20590 and previous config saved to /var/cache/conftool/dbconfig/20220210-231004-ladsgroup.json [23:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:09] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [23:10:28] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn) Done. This is in use now in production and etherpad1002 does not have the etherpad role anymore. [23:10:50] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33730/etherpad1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/761737 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [23:10:57] (03PS2) 10Dzahn: etherpad: make listening on IPv6 the default now [puppet] - 10https://gerrit.wikimedia.org/r/761737 (https://phabricator.wikimedia.org/T300568) [23:12:29] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33730/etherpad1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/761737 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [23:14:56] (03PS3) 10Dzahn: DHCP: remove etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/761661 [23:14:58] (03PS3) 10Dzahn: site: remove etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/761662 (https://phabricator.wikimedia.org/T300568) [23:18:25] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298554)', diff saved to https://phabricator.wikimedia.org/P20591 and previous config saved to /var/cache/conftool/dbconfig/20220210-232911-ladsgroup.json [23:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:16] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [23:37:08] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn) What went wrong here at first: When we switched from etherpad1002 to etherpad1003, etherpad itself worked (curl ht... [23:38:08] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it to 1.8.16) - https://phabricator.wikimedia.org/T300568 (10Dzahn) [23:38:44] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it to 1.8.16) - https://phabricator.wikimedia.org/T300568 (10Dzahn) 05In progress→03Resolved a:03Dzahn [23:38:51] jouncebot: now [23:38:51] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [23:39:07] (03CR) 10Dzahn: "https://etherpad.org/doc/v1.8.16/ -> "IP" "IP which etherpad should bind at. Change to :: for IPv6"" [puppet] - 10https://gerrit.wikimedia.org/r/761727 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [23:44:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P20592 and previous config saved to /var/cache/conftool/dbconfig/20220210-234416-ladsgroup.json [23:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P20593 and previous config saved to /var/cache/conftool/dbconfig/20220210-235920-ladsgroup.json [23:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log