[00:00:04] brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220204T0000). [00:00:05] cjming: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:16] o/ [00:00:17] o/ [00:04:17] (03CR) 10Clare Ming: [C: 03+2] Update icons, wordmark for test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759560 (https://phabricator.wikimedia.org/T299512) (owner: 10Clare Ming) [00:05:20] (03Merged) 10jenkins-bot: Update icons, wordmark for test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759560 (https://phabricator.wikimedia.org/T299512) (owner: 10Clare Ming) [00:09:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:02] !log cjming@deploy1002 Synchronized static/images/mobile/copyright/: Config: [[gerrit:759560|Update icons, wordmark for test wikis (T299512)]] (duration: 00m 53s) [00:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:06] T299512: Update logos icons and wordmarks for test wikis - https://phabricator.wikimedia.org/T299512 [00:10:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:10:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:07] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:759560|Update icons, wordmark for test wikis (T299512)]] (duration: 00m 49s) [00:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:38] !log end of UTC late backport & config window [00:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:34] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 (10BBlack) I can fill in the scenario/story part a bit! For background: * Technically, LVS and pybal are separate things running on the same server. L... [00:23:17] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/i18n/pcs (Get i18n strings for the Page Content Service) is CRITICAL: Test Get i18n strings for the Page Content Service returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English [00:23:17] a responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:35:14] brennen: I snuck https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Thanks/+/759319 into the window, if it's still open. [00:38:06] (There might have been a minor miscommunication about who should be adding it to the list.) [00:42:07] RECOVERY - Check systemd state on apifeatureusage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:36] Kemayo: i'm only half a beer in, so i can probably sling that out. [00:43:04] A sentence that always precedes something going well! [00:43:11] :D [00:43:25] i'll go ahead and +2 and let you know when it's testable. [00:43:57] (03CR) 10Brennen Bearnes: [C: 03+2] Correct attribute for flow thanks [extensions/Thanks] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759319 (https://phabricator.wikimedia.org/T300831) (owner: 10Kosta Harlan) [00:47:53] PROBLEM - Check systemd state on apifeatureusage1001 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage_codfw.service,curator_actions_apifeatureusage_eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:16] !log reopening utc late backport window for [[gerrit:759319|Correct attribute for flow thanks (T300831)]] [00:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:21] T300831: Cannot thank a Flow post on mediawiki.org: "Uncaught TypeError: elem is undefined" - https://phabricator.wikimedia.org/T300831 [00:52:52] 5-6 minutes left on ci. [00:53:00] It's a slow one. [00:56:13] (03CR) 10Cwhite: [C: 03+1] "Great!" [alerts] - 10https://gerrit.wikimedia.org/r/759302 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [00:59:25] (03Merged) 10jenkins-bot: Correct attribute for flow thanks [extensions/Thanks] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759319 (https://phabricator.wikimedia.org/T300831) (owner: 10Kosta Harlan) [01:01:13] (03CR) 10Cwhite: [C: 03+1] "nit inline, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/759517 (owner: 10Filippo Giunchedi) [01:02:00] Kemayo: testable on mwdebug1002 [01:02:13] (assuming testable.) [01:02:38] brennen: Okay, it tests out. [01:03:02] cool, syncing. [01:04:09] !log brennen@deploy1002 Synchronized php-1.38.0-wmf.20/extensions/Thanks/modules/ext.thanks.flowthank.js: Backport: [[gerrit:759319|Correct attribute for flow thanks (T300831)]] (duration: 00m 49s) [01:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:14] T300831: Cannot thank a Flow post on mediawiki.org: "Uncaught TypeError: elem is undefined" - https://phabricator.wikimedia.org/T300831 [01:04:37] and done. [01:04:51] !log for-real end of utc late backport & config window [01:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:00] brennen: Thanks! [01:05:07] yep! have a good one. :) [01:05:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:06:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:53] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobil [01:13:53] 8service%29 [01:29:03] PROBLEM - Disk space on elastic2035 is CRITICAL: DISK CRITICAL - free space: / 900 MB (3% inode=94%): /tmp 900 MB (3% inode=94%): /var/tmp 900 MB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops [01:40:05] RECOVERY - Check no envoy runtime configuration is left persistent on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [02:29:13] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:35:25] ACKNOWLEDGEMENT - Disk space on elastic2035 is CRITICAL: DISK CRITICAL - free space: / 837 MB (3% inode=94%): /tmp 837 MB (3% inode=94%): /var/tmp 837 MB (3% inode=94%): Ryan Kemper https://phabricator.wikimedia.org/T298853 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops [02:41:13] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts elastic2035.codfw.wmnet [02:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:18] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts elastic2035.codfw.wmnet [02:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:50] (03PS1) 10Ryan Kemper: elastic: decom elastic2035 [puppet] - 10https://gerrit.wikimedia.org/r/759636 [02:46:29] (03PS2) 10Ryan Kemper: elastic: decom elastic2035 [puppet] - 10https://gerrit.wikimedia.org/r/759636 (https://phabricator.wikimedia.org/T298853) [02:46:38] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: decom elastic2035 [puppet] - 10https://gerrit.wikimedia.org/r/759636 (https://phabricator.wikimedia.org/T298853) (owner: 10Ryan Kemper) [02:48:56] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts elastic2035.codfw.wmnet [02:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:52:41] 10ops-codfw, 10Discovery-Search, 10decommission-hardware: decommission elastic2035.codfw.wmnet - https://phabricator.wikimedia.org/T300946 (10RKemper) [02:53:53] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:55:29] (03PS1) 10Ryan Kemper: elastic: decom elastic2035 [puppet] - 10https://gerrit.wikimedia.org/r/759637 [03:33:55] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [03:35:25] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [04:56:25] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:59:59] !log uploaded pygments 2.11.2 to apt.wm.o (T298399) [06:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:05] T298399: New upstream release for Pygments - https://phabricator.wikimedia.org/T298399 [06:02:39] (03CR) 10Andrew Bogott: docker_entry.sh: override debian mirror (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759552 (owner: 10Arturo Borrero Gonzalez) [06:31:27] (03PS1) 10Marostegui: Revert "db2134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/759320 [06:32:08] (03CR) 10Marostegui: [C: 03+2] Revert "db2134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/759320 (owner: 10Marostegui) [06:35:18] (03CR) 10Giuseppe Lavagetto: [C: 04-2] docker_entry.sh: override debian mirror (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759552 (owner: 10Arturo Borrero Gonzalez) [06:36:39] (03PS1) 10Marostegui: Revert "db2148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/759321 [06:37:22] (03CR) 10Marostegui: [C: 03+2] Revert "db2148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/759321 (owner: 10Marostegui) [07:00:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316 schema change', diff saved to https://phabricator.wikimedia.org/P20163 and previous config saved to /var/cache/conftool/dbconfig/20220204-070003-marostegui.json [07:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:48] RECOVERY - Check systemd state on matomo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:33] !log clean up wmf_auto_restart_prometheus-mysqld-exporter@matomo on matomo1002 (not used anymore, listed as failed) [07:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:35] !log cleanup wmf_auto_restart_prometheus-mysqld-exporter@analytics-meta on an-test-coord1001 and unmasked wmf_auto_restart_prometheus-mysqld-exporter (now used) [07:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:48] (03PS1) 10Ladsgroup: auto_schema: Stop replication before running schema changes [software] - 10https://gerrit.wikimedia.org/r/759645 (https://phabricator.wikimedia.org/T300702) [07:17:53] 10SRE, 10Wikidata, 10Wikidata Query UI, 10wdwb-tech, 10Patch-For-Review: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10elukey) Found the issue: ` diff --git a/maint.html b/maint.html index 703e17a..e63c70e 100644 --- a/maint.html +++ b/maint.html @@ -1 +1 @@ - !log `git checkout main.html` on miscweb1002:/srv/org/wikidata/query to avoid puppet corrective actions (and the host being listed in alarms) [07:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:45] 10SRE, 10Wikidata, 10Wikidata Query UI, 10wdwb-tech, 10Patch-For-Review: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10elukey) 05Open→03Resolved a:03elukey [07:23:55] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:59:17] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220204T0800) [08:00:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/759344 (https://phabricator.wikimedia.org/T299107) (owner: 10JHathaway) [08:01:39] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 337, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:04:47] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (11) node(s) change every puppet run: elastic1052, elastic1063, elastic1057, elastic1065, elastic1066, elastic1084, elastic1086, elastic1060, elastic1056, elastic1058, cloudmetrics1003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [08:07:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. Feel free to ping me before merging, we can first apply/test it on the secondary host." [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) (owner: 10Jbond) [08:07:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/759302 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [08:07:58] (03PS1) 10Elukey: admin: add new ssh key for elukey [puppet] - 10https://gerrit.wikimedia.org/r/759677 [08:09:48] (03CR) 10Elukey: [C: 03+2] admin: add new ssh key for elukey [puppet] - 10https://gerrit.wikimedia.org/r/759677 (owner: 10Elukey) [08:16:13] 10SRE, 10Wikidata, 10Wikidata Query UI, 10wdwb-tech, 10Patch-For-Review: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Addshore) Thanks for the fix @elukey [08:18:13] (03CR) 10Marostegui: [C: 03+1] auto_schema: Stop replication before running schema changes [software] - 10https://gerrit.wikimedia.org/r/759645 (https://phabricator.wikimedia.org/T300702) (owner: 10Ladsgroup) [08:20:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove watchlist group from s4 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P20164 and previous config saved to /var/cache/conftool/dbconfig/20220204-082010-marostegui.json [08:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:16] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [08:21:56] (03PS8) 10Ayounsi: Port labs-in4/6 to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) [08:23:33] (03Abandoned) 10Ayounsi: Add some comments to the rules for backup traffic to eqiad ceph [homer/public] - 10https://gerrit.wikimedia.org/r/759590 (owner: 10Andrew Bogott) [08:28:25] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10fgiunchedi) >>! In T293209#7675209, @Volans wrote: >>>! In T293209#7675048, @fgiunchedi wrote: >>>>! In T293209#7670485, @Volans wrote: >>> - To suppor... [08:29:31] (03PS9) 10Ayounsi: Port labs-in4/6 to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) [08:32:30] (03PS1) 10Elukey: profile::docker::engine: add param to ignore docker storage settings [puppet] - 10https://gerrit.wikimedia.org/r/759678 (https://phabricator.wikimedia.org/T300744) [08:33:50] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33577/console" [puppet] - 10https://gerrit.wikimedia.org/r/759678 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:41:22] (03PS2) 10Elukey: profile::docker::engine: add param to ignore docker storage settings [puppet] - 10https://gerrit.wikimedia.org/r/759678 (https://phabricator.wikimedia.org/T300744) [08:42:02] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33578/console" [puppet] - 10https://gerrit.wikimedia.org/r/759678 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:42:04] (03CR) 10jerkins-bot: [V: 04-1] profile::docker::engine: add param to ignore docker storage settings [puppet] - 10https://gerrit.wikimedia.org/r/759678 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:44:37] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Stop replication before running schema changes [software] - 10https://gerrit.wikimedia.org/r/759645 (https://phabricator.wikimedia.org/T300702) (owner: 10Ladsgroup) [08:45:08] (03Merged) 10jenkins-bot: auto_schema: Stop replication before running schema changes [software] - 10https://gerrit.wikimedia.org/r/759645 (https://phabricator.wikimedia.org/T300702) (owner: 10Ladsgroup) [08:51:25] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 (10akosiaris) >>! In T300877#7677259, @BBlack wrote: > I can fill in the scenario/story part a bit! For background: > > * Without static routes, if pyb... [08:55:48] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 (10Volans) For the human-generated part that seems easy to prevent automating the process via a cookbook that can have all the checks and fail safes need... [08:58:58] (03PS6) 10JMeybohm: Add kubernetes-staging to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/759259 (https://phabricator.wikimedia.org/T300740) [08:59:14] (03CR) 104nn1l2: "Considering that your homewiki is involved" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759564 (https://phabricator.wikimedia.org/T300913) (owner: 104nn1l2) [09:00:29] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:00:47] (03CR) 10JMeybohm: [C: 03+2] Add kubernetes-staging to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/759259 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [09:01:18] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 (10akosiaris) >>! In T300877#7677666, @Volans wrote: > For the human-generated part that seems easy to prevent automating the process via a cookbook that... [09:05:22] (03PS3) 104nn1l2: Remove redundant patrolmarks flag from patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759564 (https://phabricator.wikimedia.org/T300913) [09:07:35] (03CR) 10Ayounsi: "Full diff is in https://phabricator.wikimedia.org/P20112" [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [09:12:57] 10SRE, 10ops-codfw, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 (10Aklapper) [09:13:38] (03CR) 10Cparle: [C: 03+1] Stop capturing media change tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756014 (https://phabricator.wikimedia.org/T286362) (owner: 10Matthias Mullie) [09:16:09] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10homer, and 3 others: Investigate Capirca - https://phabricator.wikimedia.org/T273865 (10ayounsi) 05In progress→03Resolved We're now using Capirca to manage most of our router ACLs [09:18:42] (03PS3) 10Elukey: profile::docker::engine: add param to ignore docker storage settings [puppet] - 10https://gerrit.wikimedia.org/r/759678 (https://phabricator.wikimedia.org/T300744) [09:19:50] (03PS6) 10JMeybohm: Add LVS service k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/759260 (https://phabricator.wikimedia.org/T300740) [09:20:26] (03CR) 10Elukey: "Just realized that this change assumes the usage of /var/lib/docker. If this is not the case, and we want to have a dedicated dir/mountpoi" [puppet] - 10https://gerrit.wikimedia.org/r/759678 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [09:22:48] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 (10akosiaris) Re-reading my reply, I realized I may appear pro having those static routes (I am actually not) whereas my intent was to just provide a dat... [09:25:16] (03PS1) 10Muehlenhoff: puppetboard: Also grant access to cn=idptest-users within the profile [puppet] - 10https://gerrit.wikimedia.org/r/759680 [09:26:23] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:32:06] (03CR) 10Volans: [C: 04-1] "Looks mostly fine, I think there are a couple of queries that don't use the safe parameter and some question inline." [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759504 (https://phabricator.wikimedia.org/T298236) (owner: 10Kormat) [09:38:48] (03PS4) 10Jelto: gitlab_runner: execute gitlab-runner as non-root [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) [09:46:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [09:51:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [09:53:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1008.eqiad.wmnet [09:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1008.eqiad.wmnet [09:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1008.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [10:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:06] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1008.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [10:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:35] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [10:19:51] 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Vgutierrez) so... right now HTTP/1.0 requests from PHP 7.3 are technically working but there is some obvious issue as those requests are really slow.... [10:27:11] (03CR) 10Arturo Borrero Gonzalez: docker_entry.sh: override debian mirror (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759552 (owner: 10Arturo Borrero Gonzalez) [10:28:15] (03CR) 10Ayounsi: P:installserver::proxy: Add domain whitelist to proxy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [10:30:50] (03CR) 10Arturo Borrero Gonzalez: docker_entry.sh: override debian mirror (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759552 (owner: 10Arturo Borrero Gonzalez) [10:30:57] (03Abandoned) 10Arturo Borrero Gonzalez: docker_entry.sh: override debian mirror [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759552 (owner: 10Arturo Borrero Gonzalez) [10:32:40] (03CR) 10Arturo Borrero Gonzalez: mcrouter: introduce updates for Debian Bullseye (032 comments) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759522 (https://phabricator.wikimedia.org/T300578) (owner: 10Arturo Borrero Gonzalez) [10:39:51] (03PS3) 10Arturo Borrero Gonzalez: mcrouter: add .gitreview file [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759551 [10:39:53] (03PS7) 10Arturo Borrero Gonzalez: mcrouter: introduce updates for Debian Bullseye [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759522 (https://phabricator.wikimedia.org/T300578) [10:39:55] (03PS4) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 2022.01.31.00 [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759553 [10:39:57] (03PS4) 10Arturo Borrero Gonzalez: gitignore: ignore additional debian artifacts [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759554 [10:39:59] (03PS1) 10Arturo Borrero Gonzalez: mcrouter: speed up build time by using parallelism [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759681 [10:41:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20165 and previous config saved to /var/cache/conftool/dbconfig/20220204-104102-root.json [10:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:48] (03CR) 10Marostegui: [C: 03+1] Add change_ar_timestamp_T298554.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759578 (https://phabricator.wikimedia.org/T298554) (owner: 10Ladsgroup) [10:47:35] (03PS1) 10Marostegui: add_tl_target_id_T300775.py: Remove stop/start slave [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759683 (https://phabricator.wikimedia.org/T300775) [10:49:03] (03CR) 10Arturo Borrero Gonzalez: mcrouter: add .gitreview file (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759551 (owner: 10Arturo Borrero Gonzalez) [10:51:01] (03CR) 10Arturo Borrero Gonzalez: mcrouter: introduce updates for Debian Bullseye (033 comments) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759522 (https://phabricator.wikimedia.org/T300578) (owner: 10Arturo Borrero Gonzalez) [10:53:31] jynus: sorry had forgot to reset the topic yesterday night, done now [10:53:59] not your fault, I was hoping for the patch to be op being merged :-( [10:54:49] (03CR) 10Jelto: [V: 03+1] "thanks for the detailed review. I added a new patch set." [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:56:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20170 and previous config saved to /var/cache/conftool/dbconfig/20220204-105606-root.json [10:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:07] 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Vgutierrez) the described issue has been reported to upstream on https://github.com/envoyproxy/envoy/issues/19821 [11:01:53] 10Puppet, 10Infrastructure-Foundations, 10good first task: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10Aklapper) [11:02:46] 10SRE, 10Traffic, 10Upstream: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Aklapper) [11:04:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove all special groups from s1 codfw T263127', diff saved to https://phabricator.wikimedia.org/P20171 and previous config saved to /var/cache/conftool/dbconfig/20220204-110427-marostegui.json [11:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:33] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [11:07:28] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [11:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20172 and previous config saved to /var/cache/conftool/dbconfig/20220204-111110-root.json [11:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:39] (03CR) 10Ladsgroup: [C: 03+1] add_tl_target_id_T300775.py: Remove stop/start slave [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759683 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [11:12:54] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add change_ar_timestamp_T298554.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759578 (https://phabricator.wikimedia.org/T298554) (owner: 10Ladsgroup) [11:13:45] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:57] (03CR) 10Marostegui: [V: 03+2 C: 03+2] add_tl_target_id_T300775.py: Remove stop/start slave [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759683 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [11:14:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1020.eqiad.wmnet with OS buster [11:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:03] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1020.eqiad.wmnet with OS buster [11:15:33] PROBLEM - Check systemd state on an-coord1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:31] (03PS2) 10Kormat: wmfdb/cli_admin/db_compare: Add db-compare utility. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759504 (https://phabricator.wikimedia.org/T298236) [11:16:44] (03CR) 10Kormat: wmfdb/cli_admin/db_compare: Add db-compare utility. (0310 comments) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759504 (https://phabricator.wikimedia.org/T298236) (owner: 10Kormat) [11:17:47] (03CR) 10Kormat: wmfdb/cli_admin/db_compare: Add db-compare utility. (032 comments) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759504 (https://phabricator.wikimedia.org/T298236) (owner: 10Kormat) [11:22:26] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:24:16] PROBLEM - MariaDB Replica Lag: x1 on db2101 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1181.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:24:28] PROBLEM - MariaDB Replica Lag: x1 on db2096 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1192.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:24:40] (03PS1) 10Marostegui: change_ir_value_T300382.py: Fix typo [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759689 (https://phabricator.wikimedia.org/T298554) [11:24:41] ^ checking [11:25:28] RECOVERY - MariaDB Replica Lag: x1 on db2096 is OK: OK slave_sql_lag Replication lag: 0.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:26:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20173 and previous config saved to /var/cache/conftool/dbconfig/20220204-112613-root.json [11:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:20] RECOVERY - MariaDB Replica Lag: x1 on db2101 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:27:38] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:27:51] (03PS1) 10Marostegui: db2096: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/759690 (https://phabricator.wikimedia.org/T300965) [11:29:11] (03CR) 10Marostegui: [C: 03+2] db2096: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/759690 (https://phabricator.wikimedia.org/T300965) (owner: 10Marostegui) [11:31:15] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (30163) = 92.4% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:32:05] 10ops-codfw, 10DBA, 10Patch-For-Review: x1 codfw master crashed due to faulty DIMM - https://phabricator.wikimedia.org/T300965 (10Marostegui) This host is out of warranty, so not sure if we might have spare DIMMS or if this is an issue with the memory slot. Needs coordination with #dc-ops [11:33:15] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (30163) = 92.4% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:33:20] ACKNOWLEDGEMENT - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (30163) = 92.4% Marostegui tendril is being shutdown https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:35:39] why did it fire twice without a recover? [11:36:18] weird [11:36:28] (03Abandoned) 10Marostegui: change_ir_value_T300382.py: Fix typo [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759689 (https://phabricator.wikimedia.org/T298554) (owner: 10Marostegui) [11:37:44] (03PS1) 10Marostegui: change_ar_timestamp_T298554.py: Fix typo [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759691 (https://phabricator.wikimedia.org/T298554) [11:37:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [11:41:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20174 and previous config saved to /var/cache/conftool/dbconfig/20220204-114117-root.json [11:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [11:46:25] (03CR) 10Muehlenhoff: [C: 03+2] puppetboard: Also grant access to cn=idptest-users within the profile [puppet] - 10https://gerrit.wikimedia.org/r/759680 (owner: 10Muehlenhoff) [11:46:46] (03CR) 10Marostegui: add_tl_target_id_T300775.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759379 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [11:49:49] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:05:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] base: standard_packages: don't install hp-health on Debian Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/759480 (https://phabricator.wikimedia.org/T300438) (owner: 10Arturo Borrero Gonzalez) [12:06:23] (03CR) 10Arturo Borrero Gonzalez: "thanks John and Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/759480 (https://phabricator.wikimedia.org/T300438) (owner: 10Arturo Borrero Gonzalez) [12:08:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1020.eqiad.wmnet with OS buster [12:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:13] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1020.eqiad.wmnet with OS buster completed: - ganeti1020 (**PASS**)... [12:12:39] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:14:35] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:16:49] (03PS15) 10D3r1ck01: Define a contact form for Chapter/Thorg application status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748120 (https://phabricator.wikimedia.org/T298024) [12:28:03] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:31:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic2043-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [12:33:18] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:38:34] (03PS3) 10Jbond: C:mw_rc_irc::ircserver: Refresh ircd services on config changes [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) [12:39:36] (03CR) 10Jbond: [C: 03+2] C:mw_rc_irc::ircserver: Refresh ircd services on config changes [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) (owner: 10Jbond) [12:39:55] (03CR) 10Jbond: [C: 03+2] C:mw_rc_irc::ircserver: Refresh ircd services on config changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) (owner: 10Jbond) [12:43:26] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:56:02] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:58:28] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:59:04] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:03:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mcrouter: speed up build time by using parallelism [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759681 (owner: 10Arturo Borrero Gonzalez) [13:04:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mcrouter: introduce updates for Debian Bullseye [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759522 (https://phabricator.wikimedia.org/T300578) (owner: 10Arturo Borrero Gonzalez) [13:07:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "update the commit message, then LGTM 😊" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759553 (owner: 10Arturo Borrero Gonzalez) [13:10:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add tls port for cloud vps rabbitmq [homer/public] - 10https://gerrit.wikimedia.org/r/755478 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [13:11:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wheel of misfortune: check if pid exists first [puppet] - 10https://gerrit.wikimedia.org/r/731924 (owner: 10Majavah) [13:15:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] google_api_proxy: fix proxies and ip addresses [puppet] - 10https://gerrit.wikimedia.org/r/756124 (owner: 10Majavah) [13:17:12] PROBLEM - SSH on wtp1027.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:17:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "hey @Valentin, what about this patch?" [puppet] - 10https://gerrit.wikimedia.org/r/759439 (https://phabricator.wikimedia.org/T292619) (owner: 10Majavah) [13:18:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: fix up check_flavor_properties [puppet] - 10https://gerrit.wikimedia.org/r/759443 (owner: 10Majavah) [13:20:49] (03PS1) 10Joal: Bump AQS druid datasource [puppet] - 10https://gerrit.wikimedia.org/r/759702 [13:22:12] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] mcrouter: speed up build time by using parallelism [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759681 (owner: 10Arturo Borrero Gonzalez) [13:23:55] (03CR) 10Ladsgroup: [C: 03+2] change_ar_timestamp_T298554.py: Fix typo [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759691 (https://phabricator.wikimedia.org/T298554) (owner: 10Marostegui) [13:24:00] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:02] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mcrouter: add .gitreview file [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759551 (owner: 10Arturo Borrero Gonzalez) [13:26:00] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] mcrouter: add .gitreview file [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759551 (owner: 10Arturo Borrero Gonzalez) [13:26:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2035-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [13:26:20] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] mcrouter: introduce updates for Debian Bullseye [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759522 (https://phabricator.wikimedia.org/T300578) (owner: 10Arturo Borrero Gonzalez) [13:27:07] (03PS5) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 0.41.0-2 [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759553 [13:27:18] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] d/changelog: generate entry for 0.41.0-2 [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759553 (owner: 10Arturo Borrero Gonzalez) [13:27:29] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] gitignore: ignore additional debian artifacts [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759554 (owner: 10Arturo Borrero Gonzalez) [13:28:08] (03PS5) 10Arturo Borrero Gonzalez: gitignore: ignore additional debian artifacts [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759554 [13:28:57] (03CR) 10Jbond: [C: 04-1] ferm: replace systemd unit to ensure success on boot (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/758548 (owner: 10JHathaway) [13:29:43] (03CR) 10Marostegui: [V: 03+2] change_ar_timestamp_T298554.py: Fix typo [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759691 (https://phabricator.wikimedia.org/T298554) (owner: 10Marostegui) [13:36:18] (03CR) 10Vgutierrez: [C: 03+1] "the watchdog has been working fine for us, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/759439 (https://phabricator.wikimedia.org/T292619) (owner: 10Majavah) [13:38:41] (03PS1) 10Arturo Borrero Gonzalez: mcrouter: bump to a newer upstream version [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759703 [13:40:07] (03PS2) 10Arturo Borrero Gonzalez: mcrouter: bump to a newer upstream version [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759703 [13:48:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:acme_chief: set watchdog_sec default on cloud [puppet] - 10https://gerrit.wikimedia.org/r/759439 (https://phabricator.wikimedia.org/T292619) (owner: 10Majavah) [13:51:03] (03PS1) 10Stang: Update $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759704 [13:51:05] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759704 (owner: 10Stang) [14:06:04] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [14:15:42] (03CR) 10Btullis: [C: 03+2] Bump AQS druid datasource [puppet] - 10https://gerrit.wikimedia.org/r/759702 (owner: 10Joal) [14:18:30] RECOVERY - SSH on wtp1027.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:18:39] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [14:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:50] (03PS1) 10Cathal Mooney: Add new function to return device 'underlay' network links. [software/homer] - 10https://gerrit.wikimedia.org/r/759707 (https://phabricator.wikimedia.org/T299758) [14:25:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [14:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:37] (03CR) 10jerkins-bot: [V: 04-1] Add new function to return device 'underlay' network links. [software/homer] - 10https://gerrit.wikimedia.org/r/759707 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [14:26:25] (03Abandoned) 10Michael DiPietro: mcrouter for bullseye. This does some unexpected things, in particular build.sh doesn't seem to work, errors out with a seg fault. README updated with instructions on how to make it work. [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759539 (https://phabricator.wikimedia.org/T300578) (owner: 10Michael DiPietro) [14:28:58] (03PS1) 10Cathal Mooney: Base config additions and updated tempaltes to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) [14:29:30] (03CR) 10jerkins-bot: [V: 04-1] Base config additions and updated tempaltes to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [14:32:03] (03PS2) 10Cathal Mooney: Add new function to return device 'underlay' network links. [software/homer] - 10https://gerrit.wikimedia.org/r/759707 (https://phabricator.wikimedia.org/T299758) [14:34:48] (03CR) 10jerkins-bot: [V: 04-1] Add new function to return device 'underlay' network links. [software/homer] - 10https://gerrit.wikimedia.org/r/759707 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [14:40:12] (03PS3) 10Cathal Mooney: Add new function to return device 'underlay' network links. [software/homer] - 10https://gerrit.wikimedia.org/r/759707 (https://phabricator.wikimedia.org/T299758) [14:42:11] (03CR) 10Ayounsi: "That probably should live in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/homer/deploy/+/refs/heads/master/plugins/w" [software/homer] - 10https://gerrit.wikimedia.org/r/759707 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [14:44:38] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:49:16] (03PS1) 10Elukey: admin: set only one key for elukey [puppet] - 10https://gerrit.wikimedia.org/r/759710 [14:50:07] (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759704 (owner: 10Stang) [14:50:11] (03CR) 10AntiCompositeNumber: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759704 (owner: 10Stang) [14:50:24] (03PS4) 10Cathal Mooney: Add new function to return device 'underlay' network links. [software/homer] - 10https://gerrit.wikimedia.org/r/759707 (https://phabricator.wikimedia.org/T299758) [14:50:33] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Ottomata) [14:53:39] (03CR) 10Elukey: [C: 03+2] admin: set only one key for elukey [puppet] - 10https://gerrit.wikimedia.org/r/759710 (owner: 10Elukey) [15:06:10] (03PS2) 10Filippo Giunchedi: prometheus: relabel 'instance' in job=prometheus with hostname [puppet] - 10https://gerrit.wikimedia.org/r/759517 [15:07:08] (03CR) 10jerkins-bot: [V: 04-1] prometheus: relabel 'instance' in job=prometheus with hostname [puppet] - 10https://gerrit.wikimedia.org/r/759517 (owner: 10Filippo Giunchedi) [15:09:43] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33581/console" [puppet] - 10https://gerrit.wikimedia.org/r/759517 (owner: 10Filippo Giunchedi) [15:09:49] (03PS3) 10Filippo Giunchedi: prometheus: relabel 'instance' in job=prometheus with hostname [puppet] - 10https://gerrit.wikimedia.org/r/759517 [15:10:18] (03PS2) 10Cathal Mooney: Base config additions and updated tempaltes to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) [15:10:52] (03CR) 10jerkins-bot: [V: 04-1] Base config additions and updated tempaltes to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [15:12:27] (03PS2) 10Stang: Update $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759704 (https://phabricator.wikimedia.org/T300978) [15:12:31] (03CR) 10Filippo Giunchedi: prometheus: relabel 'instance' in job=prometheus with hostname (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759517 (owner: 10Filippo Giunchedi) [15:24:36] (03PS3) 10Cathal Mooney: Base config additions and updated tempaltes to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) [15:25:12] (03CR) 10jerkins-bot: [V: 04-1] Base config additions and updated tempaltes to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [15:25:48] (03PS5) 10JHathaway: mx: set net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl [puppet] - 10https://gerrit.wikimedia.org/r/759344 (https://phabricator.wikimedia.org/T299107) [15:26:36] (03CR) 10JHathaway: mx: set net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759344 (https://phabricator.wikimedia.org/T299107) (owner: 10JHathaway) [15:26:54] (03PS4) 10Cathal Mooney: Base config additions and updated tempaltes to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) [15:27:17] topranks: s/tempaltes/templates? <3 [15:28:44] haha [15:28:55] you're that added layer of human CI I so badly need :) [15:29:35] anything to stop looking to that google docs brower tab [15:29:52] (03CR) 10AntiCompositeNumber: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759704 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [15:30:22] (03PS5) 10Cathal Mooney: Base config additions and updated templates to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) [15:31:01] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Add a kubernetes module to spicerack - https://phabricator.wikimedia.org/T300879 (10Joe) More in detail, I would reduce the choices to a match between python-kubernetes, which we already use in imagecatalog, and kubectl. I started taking a look at how... [15:32:38] (03PS1) 10Elukey: install_server: add partman recipe kubernetes-node-overlay.cfg [puppet] - 10https://gerrit.wikimedia.org/r/759716 (https://phabricator.wikimedia.org/T300744) [15:33:56] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/759718 [15:34:06] (03CR) 10JHathaway: [C: 03+2] mx: set net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl [puppet] - 10https://gerrit.wikimedia.org/r/759344 (https://phabricator.wikimedia.org/T299107) (owner: 10JHathaway) [15:37:30] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/759719 [15:37:35] (03PS2) 10Elukey: install_server: add partman recipe kubernetes-node-overlay.cfg [puppet] - 10https://gerrit.wikimedia.org/r/759716 (https://phabricator.wikimedia.org/T300744) [15:38:01] (03CR) 10Muehlenhoff: "General direction looks fine, two questions inline" [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:41:54] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on Engl [15:41:54] pedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:41:56] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/759722 [15:44:31] (03CR) 10Muehlenhoff: admin: Fully deprecate sc-admins group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759219 (owner: 10Ladsgroup) [15:55:01] (03PS1) 10Majavah: beta: WRITE_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759725 (https://phabricator.wikimedia.org/T289068) [15:55:40] (03PS2) 10JMeybohm: Add ingress support to miscweb chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/757935 (https://phabricator.wikimedia.org/T290966) [15:55:42] (03PS2) 10JMeybohm: miscweb: Remove repeating settings and enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/757936 (https://phabricator.wikimedia.org/T290966) [15:55:44] (03PS1) 10JMeybohm: Enable nodePort 30021 for ingressgateway status [deployment-charts] - 10https://gerrit.wikimedia.org/r/759726 (https://phabricator.wikimedia.org/T290966) [15:55:46] (03PS1) 10JMeybohm: Add ingress.staging switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/759727 (https://phabricator.wikimedia.org/T290966) [15:57:33] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Create Generalised blocking strategy - https://phabricator.wikimedia.org/T270618 (10MoritzMuehlenhoff) > I think we could also take the decision to no bother with this additional complexity and take the stance that if so... [15:58:53] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989 (10MoritzMuehlenhoff) Ack, I'll have a closer look over the course of February [15:59:11] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989 (10MoritzMuehlenhoff) [16:00:11] 10SRE, 10ops-codfw, 10DBA: x1 codfw master crashed due to faulty DIMM - https://phabricator.wikimedia.org/T300965 (10Papaul) This system is using DDR4 32GB DIMM i am not %100 sure that i have those on site but will checked on Monday. [16:01:00] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:14] RECOVERY - Check systemd state on an-coord1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:46] (03CR) 10Jforrester: [C: 04-2] "This shouldn't be deployed without sign-off from the Security team re. the votewiki inclusion. In general, this should be five (or four) d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759704 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [16:05:45] !log unmask prometheus-mysqld-exporter.service and clean up the old @analytics + wmf_auto_restart units (service+timer) not used anymore on an-coord100[12] [16:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:09] (03CR) 10Elukey: "Had a chat with Filippo, that suggested to use raid1-2devs.cfg + standard.cfg + kubernetes-node-overlay.cfg (that overrides only the speci" [puppet] - 10https://gerrit.wikimedia.org/r/759716 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [16:10:03] (03PS3) 10Elukey: install_server: add partman recipe kubernetes-node-overlay.cfg [puppet] - 10https://gerrit.wikimedia.org/r/759716 (https://phabricator.wikimedia.org/T300744) [16:10:31] (03CR) 10Stang: "Hi Jdforrester, thanks for let me know and I will fill separate patches. And I wonder how to ask for the approval from the security team?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759704 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [16:10:37] (03PS2) 10Herron: watchrat: check URLs from watchmouse not already covered by icinga [puppet] - 10https://gerrit.wikimedia.org/r/759297 (https://phabricator.wikimedia.org/T299147) [16:11:57] (03CR) 10Jforrester: [C: 04-2] Update $wgCrossSiteAJAXdomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759704 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [16:14:23] (03CR) 10Stang: "To make it clear, I need to open four(as add votewiki is not allowed) tasks with #security, is it right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759704 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [16:17:46] (03CR) 10Jforrester: [C: 04-2] Update $wgCrossSiteAJAXdomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759704 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [16:19:11] 10SRE, 10Traffic, 10Upstream: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Vgutierrez) this no longers seems to be related to HTTP/1.0 as the following code also triggers the issue: `lang=php (03CR) 10Herron: watchrat: check URLs from watchmouse not already covered by icinga (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/759297 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [16:22:20] (03PS4) 10Elukey: install_server: add partman recipe kubernetes-node-overlay.cfg [puppet] - 10https://gerrit.wikimedia.org/r/759716 (https://phabricator.wikimedia.org/T300744) [16:23:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/759716 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [16:25:12] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:25:28] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:25:38] (03PS1) 10Ebernhardson: Reduce CirrusSearchJVMGCOldPoolFlatlined false positives [alerts] - 10https://gerrit.wikimedia.org/r/759733 [16:27:34] (03CR) 10jerkins-bot: [V: 04-1] Reduce CirrusSearchJVMGCOldPoolFlatlined false positives [alerts] - 10https://gerrit.wikimedia.org/r/759733 (owner: 10Ebernhardson) [16:27:43] (03CR) 10Zabe: [C: 03+1] beta: WRITE_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759725 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [16:32:02] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin1001 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [16:35:16] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:35:18] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10aborrero) [16:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:45] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10aborrero) Got another one, same package: ` Err:198 http://mirrors.wikimedia.org/debian bullseye/main amd64 libboost-locale1.74.0 amd64 1.74.0-9... [16:38:46] (03PS1) 10Stang: Update $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759739 (https://phabricator.wikimedia.org/T300978) [16:39:09] 10SRE, 10ops-codfw, 10Discovery-Search, 10decommission-hardware: decommission elastic2035.codfw.wmnet - https://phabricator.wikimedia.org/T300946 (10Papaul) [16:39:20] 10SRE, 10ops-codfw, 10Discovery-Search, 10decommission-hardware: decommission elastic2035.codfw.wmnet - https://phabricator.wikimedia.org/T300946 (10Papaul) p:05Medium→03Low [16:39:59] (03Abandoned) 10Stang: Update $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759704 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [16:40:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/759297 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [16:41:04] 10SRE, 10ops-codfw, 10Lift-Wing: ml-serve2001 logged a corrected memory error - https://phabricator.wikimedia.org/T299427 (10Papaul) @klausman can we close this now? [16:43:46] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10aborrero) Another, different package this time: ` Err:197 http://mirrors.wikimedia.org/debian bullseye/main amd64 libboost-iostreams1.74-dev amd64 1.74.0-9... [16:45:10] 10SRE, 10ops-codfw, 10Lift-Wing: ml-serve2001 logged a corrected memory error - https://phabricator.wikimedia.org/T299427 (10klausman) 05Open→03Resolved Yes, I think so. Since the reboot, everything has been quiet: `root@ml-serve2001:/sys/devices/system/edac# grep . mc/mc*/*count mc/mc0/ce_count:0 mc/m... [16:48:57] !log update add new ferm package ferm_2.5.1-1+wmf11u2 [16:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:48] (03CR) 10Herron: [C: 03+2] watchrat: check URLs from watchmouse not already covered by icinga [puppet] - 10https://gerrit.wikimedia.org/r/759297 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [16:53:53] (03PS1) 10JMeybohm: Add label node-role.kubernetes.io/master to masters [puppet] - 10https://gerrit.wikimedia.org/r/759741 (https://phabricator.wikimedia.org/T290967) [16:54:14] (03PS2) 10Jforrester: wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759739 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [16:54:31] (03CR) 10Jforrester: [C: 03+1] "As I said, this really should be four different patches, but this is fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759739 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [16:54:50] (03CR) 10Andrew Bogott: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/759480 (https://phabricator.wikimedia.org/T300438) (owner: 10Arturo Borrero Gonzalez) [16:56:10] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [16:58:15] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33582/console" [puppet] - 10https://gerrit.wikimedia.org/r/759741 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [17:00:26] !log add mcrouter 2022.01.31.00-1 to bullseye-wikimedia (T300578) [17:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:31] T300578: build/package mcrouter for Debian Bullseye - https://phabricator.wikimedia.org/T300578 [17:01:57] 10SRE, 10SRE-Access-Requests, 10User-Ladsgroup: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10bwang) @jcrespo Hi, is there anything else I need to do here? Who is the "service owner" [17:02:41] (03CR) 10Herron: [C: 03+2] watchrat: add http probe alerting with warning severity [alerts] - 10https://gerrit.wikimedia.org/r/759302 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [17:02:43] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] mcrouter: bump to a newer upstream version [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/759703 (owner: 10Arturo Borrero Gonzalez) [17:05:11] 10SRE, 10SRE-Access-Requests, 10User-Ladsgroup: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10Ladsgroup) The service owner for this is @thcipriani [17:06:36] 10SRE, 10SRE-Access-Requests, 10User-Ladsgroup: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10jcrespo) a:05Ladsgroup→03thcipriani @bwang Yes, as said on T300664#7669287 / marked on the tag column, this is blocked on @thcipriani to approve the req... [17:08:34] (03PS2) 10Ebernhardson: Reduce CirrusSearchJVMGCOldPoolFlatlined false positives [alerts] - 10https://gerrit.wikimedia.org/r/759733 [17:14:09] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10jhathaway) a:03jhathaway [17:15:36] 10SRE, 10Wikidata, 10Wikidata Query UI, 10wdwb-tech, 10Patch-For-Review: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Dzahn) Thanks @elukey I don't know where it came from but suspect as well somebody wanted to test if you can edit that manually or puppet will overwrite it. [17:15:42] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:15:44] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:15:56] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:15:56] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10jbond) Thanks for creating this task Andrew, Just wanted to copy paste the following from the parent task in-case there ar... [17:21:53] (03PS3) 10Ebernhardson: Reduce CirrusSearchJVMGCOldPoolFlatlined false positives [alerts] - 10https://gerrit.wikimedia.org/r/759733 [17:23:28] 10SRE, 10SRE-Access-Requests, 10User-Ladsgroup: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10thcipriani) >>! In T300664#7683665, @jcrespo wrote: > @bwang Yes, as said on T300664#7669287 / marked on the tag column, this is blocked on @thcipriani to a... [17:23:45] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10jcrespo) a:05jcrespo→03None @AndyRussG I've checked the steps, setup and management. It is very similar to Google Search Tools. Regarding the technical setup and user management... [17:23:54] (03PS1) 10Dzahn: admin: make Tyler Cipriani the approver for the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/759746 (https://phabricator.wikimedia.org/T300664) [17:26:02] (03CR) 10Jcrespo: [C: 03+1] "Ok to me, I guess in some special cases there could be other approvers, but this makes sense to me, as long as the affected person is ok w" [puppet] - 10https://gerrit.wikimedia.org/r/759746 (https://phabricator.wikimedia.org/T300664) (owner: 10Dzahn) [17:26:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2035-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [17:27:10] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Ladsgroup: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10jcrespo) a:05thcipriani→03Ladsgroup [17:27:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Ladsgroup: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10jcrespo) [17:28:00] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:32:28] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:36:39] (03CR) 10Thcipriani: [C: 03+1] "This patchset seems to align with what's currently happening: I tend to approve these requests." [puppet] - 10https://gerrit.wikimedia.org/r/759746 (https://phabricator.wikimedia.org/T300664) (owner: 10Dzahn) [17:41:46] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10Legoktm) I ran into this as well (from California) when using a dockerfile with a lot of packages (this is from #LibUp's dockerfile). ` FROM docker-registry.wikimedia.org/bulls... [17:43:04] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10Majavah) I can reproduce quite reliably with the following one-liner: ` docker run --rm -it docker-registry.wikimedia.org/bullseye bash -c "apt-get update && apt-get install -y... [17:43:25] (03CR) 10Dzahn: admin: make Tyler Cipriani the approver for the deployment group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759746 (https://phabricator.wikimedia.org/T300664) (owner: 10Dzahn) [17:44:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Volans) [17:44:33] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10Majavah) >>! In T300985#7683781, @Majavah wrote: > I can reproduce quite reliably with the following one-liner: > ` > docker run --rm -it docker-registry.wikimedia.org/bullseye... [17:47:17] (03CR) 10Jcrespo: [C: 03+1] admin: make Tyler Cipriani the approver for the deployment group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759746 (https://phabricator.wikimedia.org/T300664) (owner: 10Dzahn) [17:55:18] (03PS1) 10JMeybohm: Allow to configure a different port for ProxyFetch monitor [debs/pybal] - 10https://gerrit.wikimedia.org/r/759749 (https://phabricator.wikimedia.org/T290966) [18:00:39] (03CR) 10Volans: [C: 04-1] "As mentioned by Arzhel this WMF-specific feature should be moved to the wmf-netbox homer plugin." [software/homer] - 10https://gerrit.wikimedia.org/r/759707 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [18:04:55] (03PS7) 10JMeybohm: Add LVS service k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/759260 (https://phabricator.wikimedia.org/T300740) [18:06:29] (03CR) 10JMeybohm: [V: 03+1] "Added those manually on main/wikikube clusters. Just FYI" [puppet] - 10https://gerrit.wikimedia.org/r/759741 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [18:08:09] (03CR) 10Ahmon Dancy: ci: Qemu image and snapshot creation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [18:09:37] (03PS1) 10Cwhite: pcc test [puppet] - 10https://gerrit.wikimedia.org/r/759751 [18:12:35] (03PS2) 10Cwhite: pcc test [puppet] - 10https://gerrit.wikimedia.org/r/759751 [18:13:05] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10AndyRussG) >>! In T298723#7683702, @jcrespo wrote: > @AndyRussG I've checked the steps, setup and management. It is very similar to Google Search Tools. Regarding the technical setu... [18:17:50] (03PS3) 10Cwhite: pcc test [puppet] - 10https://gerrit.wikimedia.org/r/759751 [18:18:34] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2001 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [18:31:32] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10jcrespo) >>! In T298723#7683849, @AndyRussG wrote: > Ah ok thanks for taking care with this... Can you describe any more details about these steps? (For Google, it seems that XML... [18:38:08] (03PS1) 10Cwhite: logstash: use java home from profile::java [puppet] - 10https://gerrit.wikimedia.org/r/759757 (https://phabricator.wikimedia.org/T300853) [18:41:11] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/33586/" [puppet] - 10https://gerrit.wikimedia.org/r/759757 (https://phabricator.wikimedia.org/T300853) (owner: 10Cwhite) [18:45:45] (03PS4) 10Cwhite: opensearch: use java_home from profile::java [puppet] - 10https://gerrit.wikimedia.org/r/759751 [18:46:17] (03PS5) 10Cwhite: opensearch: use java_home from profile::java [puppet] - 10https://gerrit.wikimedia.org/r/759751 (https://phabricator.wikimedia.org/T300853) [18:51:32] (03PS6) 10Cwhite: opensearch: use java_home from profile::java [puppet] - 10https://gerrit.wikimedia.org/r/759751 (https://phabricator.wikimedia.org/T300853) [18:52:50] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices2002-dev.wikimedia.org with OS bullseye [18:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:54] (03PS7) 10Cwhite: opensearch: use java_home from profile::java [puppet] - 10https://gerrit.wikimedia.org/r/759751 (https://phabricator.wikimedia.org/T300853) [18:55:02] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/33589/" [puppet] - 10https://gerrit.wikimedia.org/r/759751 (https://phabricator.wikimedia.org/T300853) (owner: 10Cwhite) [18:56:11] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Switch Logstash/apifeatureusage to use the system OpenJDK 11 - https://phabricator.wikimedia.org/T300853 (10colewhite) 05Open→03In progress a:03colewhite [19:33:22] (03PS1) 10Ottomata: Include mergedeep in airflow env for use by airflow-dags repo [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/759772 [19:33:24] (03PS1) 10Ottomata: Update pip-requirements for 2.1.4-py3.7-2 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/759773 [19:33:48] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Include mergedeep in airflow env for use by airflow-dags repo [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/759772 (owner: 10Ottomata) [19:35:07] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:42:37] (03PS1) 10Majavah: kubeadm: node-upgrade: ignore empty lines in host list [puppet] - 10https://gerrit.wikimedia.org/r/759776 [19:44:37] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudservices2002-dev.wikimedia.org with OS bullseye [19:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:10] legoktm: I think the enwiki category issue fixed itself [19:45:16] left a comment on task [19:45:36] (unless I didn't read the thread right which may very well be the case) [19:49:58] (03PS2) 10Ottomata: Update pip-requirements for 2.1.4-py3.7-2 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/759773 [19:54:45] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 67 probes of 649 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:58:06] (03PS2) 10Aklapper: mariadb: Add script to generate watchlist_count table on labs [puppet] - 10https://gerrit.wikimedia.org/r/375349 (https://phabricator.wikimedia.org/T59617) (owner: 10Jcrespo) [19:58:40] 10Puppet, 10SRE, 10Infrastructure-Foundations: Refactor P:base::firewall to pull host directly from puppetdb - https://phabricator.wikimedia.org/T300957 (10Peachey88) [19:59:49] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 53 probes of 649 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:00:08] (03CR) 10Cwhite: [C: 03+1] prometheus: relabel 'instance' in job=prometheus with hostname [puppet] - 10https://gerrit.wikimedia.org/r/759517 (owner: 10Filippo Giunchedi) [20:03:29] (03PS1) 10Cwhite: beta-logs: disable gc logging on opensearch instances [puppet] - 10https://gerrit.wikimedia.org/r/759781 [20:11:05] (03CR) 10Cwhite: [C: 03+2] beta-logs: disable gc logging on opensearch instances [puppet] - 10https://gerrit.wikimedia.org/r/759781 (owner: 10Cwhite) [20:14:10] (03PS3) 10Ottomata: Update pip-requirements for 2.1.4-py3.7-2 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/759773 [20:16:32] (03PS4) 10Ottomata: Update pip-requirements for 2.1.4-py3.7-2 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/759773 [20:25:02] hauskatze: right, I'm just a bit concerned that it shouldn't have happened in the first place [20:25:29] I also think it's like near impossible to figure out why it happened [20:29:15] short of searching through bin logs [20:34:26] legoktm: Well I wouldn't know. I guess it might be some caching or db lag? [20:35:03] or jobqueue [20:35:19] I think it's very unlikely to be db lag, probably caching or some missing invalidation step [20:35:43] can it be fixed in the script? [20:35:53] (03CR) 10LMata: "so exciting \o/" [alerts] - 10https://gerrit.wikimedia.org/r/759302 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [20:36:15] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:36:47] yeah probably, I was going to poke at it later today hopefully [20:36:59] the good news is if something is wrong we have a month to fix it [20:37:19] and I'm pretty sure things are more correct now than whatever is wrong :p [20:37:43] (03PS1) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/759783 (https://phabricator.wikimedia.org/T299168) [20:39:05] if for some reason more than a month is needed, we can add a ensure: absent so it does not run anymore for the moment [20:39:12] no need to delete the whole thing [20:58:35] yep [21:05:47] PROBLEM - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:13:27] (03PS2) 10Ryan Kemper: elastic: decom elastic2035 [puppet] - 10https://gerrit.wikimedia.org/r/759637 (https://phabricator.wikimedia.org/T294805) [21:22:31] PROBLEM - puppet last run on elastic1055 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:23:15] PROBLEM - puppet last run on elastic1064 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:23:25] PROBLEM - puppet last run on elastic1065 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:24:28] 10SRE, 10Patch-For-Review: Onboarding for Arnold Okoth - https://phabricator.wikimedia.org/T288645 (10Dzahn) There is nothing specifically pending from my side. We have recently chatted about Phabricator dashboards, I can see Icinga is done. We have done pwstore and had other onboarding chats. [21:26:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2035-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [21:28:53] RECOVERY - puppet last run on elastic1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:29:35] RECOVERY - puppet last run on elastic1064 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:29:45] RECOVERY - puppet last run on elastic1065 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:44:56] 10SRE, 10Wikimedia-Mailing-lists: Subscribe Zabe to ops@ - https://phabricator.wikimedia.org/T301011 (10Zabe) [21:45:16] 10SRE, 10Wikimedia-Mailing-lists: Subscribe Zabe to ops@ - https://phabricator.wikimedia.org/T301011 (10Zabe) [21:51:36] (03CR) 10Ryan Kemper: [C: 03+2] Reduce CirrusSearchJVMGCOldPoolFlatlined false positives [alerts] - 10https://gerrit.wikimedia.org/r/759733 (owner: 10Ebernhardson) [21:53:35] (03Merged) 10jenkins-bot: Reduce CirrusSearchJVMGCOldPoolFlatlined false positives [alerts] - 10https://gerrit.wikimedia.org/r/759733 (owner: 10Ebernhardson) [21:55:24] 10SRE: Subscribe Zabe to ops@ - https://phabricator.wikimedia.org/T301011 (10Legoktm) This is a general #SRE task so I'm removing the mailing lists tag. [22:04:52] RECOVERY - Check systemd state on ms-be2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:06:40] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [22:07:54] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:11:01] (CirrusSearchJVMGCOldPoolFlatlined) resolved: (2) Elasticsearch instance elastic2035-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [22:11:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2035-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [22:11:31] (CirrusSearchJVMGCOldPoolFlatlined) resolved: (2) Elasticsearch instance elastic2035-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [22:12:32] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms [22:22:09] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 229 probes of 649 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:23:16] (03PS1) 10Majavah: proxy: Don't crash on SGE for tools without Kubernetes credentials [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/759826 [22:23:36] (03PS2) 10Majavah: proxy: Don't crash on SGE for tools without Kubernetes credentials [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/759826 [22:24:21] (03PS3) 10Majavah: proxy: Don't crash on SGE for tools without Kubernetes credentials [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/759826 (https://phabricator.wikimedia.org/T301015) [22:26:33] (03CR) 10BryanDavis: [C: 03+1] "Untested, but this looks nicer than the "try: ... except Exception: ..." mess that I was poking at in my dev tree. :)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/759826 (https://phabricator.wikimedia.org/T301015) (owner: 10Majavah) [22:27:11] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 52 probes of 649 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:27:32] (03CR) 10Majavah: [C: 03+2] proxy: Don't crash on SGE for tools without Kubernetes credentials [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/759826 (https://phabricator.wikimedia.org/T301015) (owner: 10Majavah) [22:28:26] (03Merged) 10jenkins-bot: proxy: Don't crash on SGE for tools without Kubernetes credentials [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/759826 (https://phabricator.wikimedia.org/T301015) (owner: 10Majavah) [22:30:30] 10SRE, 10SRE-Access-Requests: Subscribe Zabe to ops@ - https://phabricator.wikimedia.org/T301011 (10Zabe) [22:35:09] (03PS1) 10Majavah: d/changelog: Prepare for 0.80 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/759829 [22:36:55] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mirror1001.wikimedia.org with reason: new kernel [22:36:57] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mirror1001.wikimedia.org with reason: new kernel [22:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:56] (03CR) 10Majavah: [C: 03+2] d/changelog: Prepare for 0.80 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/759829 (owner: 10Majavah) [22:39:23] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.80 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/759829 (owner: 10Majavah) [23:02:48] !log bking@deployment-puppetmaster04 local commit to public/private repo, see T299797 for more details [23:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:53] T299797: Deploy new elastic cluster nodes on deployment-prep - https://phabricator.wikimedia.org/T299797 [23:27:41] (03PS2) 10Ladsgroup: admin: make Tyler Cipriani the approver for the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/759746 (https://phabricator.wikimedia.org/T300664) (owner: 10Dzahn) [23:27:47] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: make Tyler Cipriani the approver for the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/759746 (https://phabricator.wikimedia.org/T300664) (owner: 10Dzahn) [23:38:35] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Product-Analytics, 10User-Ladsgroup: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10Ladsgroup) a:05jcrespo→03Ladsgroup Hi, This is pending approval by @SWakiyama [23:42:19] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10jhathaway) I can reliably produce the issue, with the following script running on sretest1002: ` #!/bin/bash set -o errexit set -o nounset if [[ $EUID -ne 0 ]]; then printf... [23:43:23] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mirror1001.wikimedia.org with reason: new kernel [23:43:25] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mirror1001.wikimedia.org with reason: new kernel [23:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:30] inflatador: fyi the majority of changes to beta cluster are logged in -releng rather than here. [23:44:34] (03CR) 10Ladsgroup: [C: 03+1] "It looks good generally but my only is that the old column is not nullable (while it has a default) and stopping writes to it might break " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759725 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [23:46:16] urbanecm thanks for letting me know, will hit it there next time! [23:47:00] np and thanks.