[00:34:44] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:35:24] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:06:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.13 [core] (wmf/1.37.0-wmf.13) - 10https://gerrit.wikimedia.org/r/703262 [02:06:55] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.13 [core] (wmf/1.37.0-wmf.13) - 10https://gerrit.wikimedia.org/r/703262 (owner: 10TrainBranchBot) [02:28:51] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.13 [core] (wmf/1.37.0-wmf.13) - 10https://gerrit.wikimedia.org/r/703262 (owner: 10TrainBranchBot) [03:27:56] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:31:16] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 68 probes of 620 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:37:10] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 52 probes of 620 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:28:42] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:42:24] (03PS1) 10Marostegui: mariadb: Move db1124 to m2. [puppet] - 10https://gerrit.wikimedia.org/r/703269 (https://phabricator.wikimedia.org/T286042) [05:21:18] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:25:53] (03PS2) 10ArielGlenn: dumps: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/702933 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [05:29:16] (03CR) 10ArielGlenn: [C: 03+2] dumps: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/702933 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [06:23:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10ArielGlenn) What is the expected length of service interupption for any of these days? I'm looking on the impact on the dumpsdata/snapshot hosts... [06:31:50] !log Upgrade db1160 kernel [06:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:25] !log Upgrade db1138 kernel [06:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:21] !issync [06:37:21] Syncing #wikimedia-operations (requested by legoktm) [06:37:23] Set /cs flags #wikimedia-operations litharge +o [06:45:03] (03CR) 10Muehlenhoff: [C: 03+2] Add separate role for Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/703213 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [06:50:36] !log Upgrade db1122 kernel [06:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:19] !log installing PHP 7.3 securiy updates on buster [06:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:06] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 184 probes of 627 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:00:04] Deploy window No deploys all week! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210706T0700) [07:06:53] !log Upgrade db1104 kernel [07:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:00] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:21:54] 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10Marostegui) [07:22:07] 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10Marostegui) p:05Triage→03High a:03Kormat [07:25:21] (03PS1) 10Muehlenhoff: Add library hint for libuv1 [puppet] - 10https://gerrit.wikimedia.org/r/703348 [07:27:33] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libuv1 [puppet] - 10https://gerrit.wikimedia.org/r/703348 (owner: 10Muehlenhoff) [07:30:07] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate contour as an ingress - https://phabricator.wikimedia.org/T286196 (10Joe) **Usage** Basic usage can be obtained by just defining the Ingress properties in the charts: ` apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: lambdoid spec: rules: - ht... [07:32:00] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 69 probes of 628 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:32:02] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate contour as an ingress - https://phabricator.wikimedia.org/T286196 (10Joe) Collection of metrics via prometheus is supported natively, both from envoy and contour. Using configuration options is also possible to make contour / envoy log in json format, see https://p... [07:33:22] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate contour as an ingress - https://phabricator.wikimedia.org/T286196 (10Joe) Overall, contour seems built the right way and looks like a promising ingress. I am just wary of using CRDs so much, and of the fact it needs yet another operator to work properly. I think th... [07:43:48] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 51 probes of 628 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:44:12] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 63 probes of 627 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:53:25] RECOVERY - WDQS high update lag on wdqs1007 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.129e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:53:54] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 93 probes of 627 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:59:48] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 52 probes of 627 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:43:43] !log installing libuv1 security updates on buster [08:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:53] <_joe_> !log repooling wdqs1007 now that lag has caught up [09:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:23] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate contour as an ingress - https://phabricator.wikimedia.org/T286196 (10Joe) 05Open→03Resolved p:05Triage→03High [09:03:25] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) [09:16:55] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Fix Puppet CA expired certs - https://phabricator.wikimedia.org/T286229 (10jbond) [09:17:05] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1124 to m2. [puppet] - 10https://gerrit.wikimedia.org/r/703269 (https://phabricator.wikimedia.org/T286042) (owner: 10Marostegui) [09:18:10] (03PS1) 10Jbond: P:puppetmaster: update NRPE script [puppet] - 10https://gerrit.wikimedia.org/r/703359 (https://phabricator.wikimedia.org/T286229) [09:19:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30112/console" [puppet] - 10https://gerrit.wikimedia.org/r/703359 (https://phabricator.wikimedia.org/T286229) (owner: 10Jbond) [09:22:39] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:23:19] <_joe_> marostegui: put that database back! [09:24:41] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:27:53] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:28:02] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:28:02] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:29:22] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:32:56] ^ me [09:35:09] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:35:17] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:35:17] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:38:11] 10SRE, 10Wikimedia-Mailing-lists: Redirect https://lists.wikimedia.org/pipermail/foobar/ to https://lists.wikimedia.org/hyperkitty/list/foobar@lists.wikimedia.org/ - https://phabricator.wikimedia.org/T285949 (10Aklapper) Thanks a lot! <3 [09:38:17] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:52:57] (03PS1) 10Jbond: P:configmaster: update disc state to match post dc switch over state [puppet] - 10https://gerrit.wikimedia.org/r/703390 (https://phabricator.wikimedia.org/T286231) [09:53:37] (03PS1) 10Giuseppe Lavagetto: mwdebug: add more network egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/703391 [09:53:39] (03PS2) 10Jbond: P:configmaster: update disc state to match post dc switch over state [puppet] - 10https://gerrit.wikimedia.org/r/703390 (https://phabricator.wikimedia.org/T286231) [09:54:21] (03PS3) 10Jbond: P:configmaster: update disc state to match post dc switch over state [puppet] - 10https://gerrit.wikimedia.org/r/703390 (https://phabricator.wikimedia.org/T286231) [09:55:52] (03CR) 10jerkins-bot: [V: 04-1] P:configmaster: update disc state to match post dc switch over state [puppet] - 10https://gerrit.wikimedia.org/r/703390 (https://phabricator.wikimedia.org/T286231) (owner: 10Jbond) [09:58:46] (03PS4) 10Jbond: P:configmaster: update disc state to match post dc switch over state [puppet] - 10https://gerrit.wikimedia.org/r/703390 (https://phabricator.wikimedia.org/T286231) [09:59:25] (03CR) 10Jbond: P:configmaster: update disc state to match post dc switch over state (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/703390 (https://phabricator.wikimedia.org/T286231) (owner: 10Jbond) [10:00:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30113/console" [puppet] - 10https://gerrit.wikimedia.org/r/703390 (https://phabricator.wikimedia.org/T286231) (owner: 10Jbond) [10:00:07] (03CR) 10jerkins-bot: [V: 04-1] P:configmaster: update disc state to match post dc switch over state [puppet] - 10https://gerrit.wikimedia.org/r/703390 (https://phabricator.wikimedia.org/T286231) (owner: 10Jbond) [10:02:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM minus the flake8 violations." [puppet] - 10https://gerrit.wikimedia.org/r/703390 (https://phabricator.wikimedia.org/T286231) (owner: 10Jbond) [10:02:33] 10SRE, 10Infrastructure-Foundations: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [10:04:55] (03PS5) 10Jbond: P:configmaster: update disc state to match post dc switch over state [puppet] - 10https://gerrit.wikimedia.org/r/703390 (https://phabricator.wikimedia.org/T286231) [10:13:06] (03CR) 10Jbond: [C: 03+2] P:configmaster: update disc state to match post dc switch over state [puppet] - 10https://gerrit.wikimedia.org/r/703390 (https://phabricator.wikimedia.org/T286231) (owner: 10Jbond) [10:18:15] (03PS1) 10Jbond: P:configmaster: update expected status for eventgate-external [puppet] - 10https://gerrit.wikimedia.org/r/703396 (https://phabricator.wikimedia.org/T286231) [10:18:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/703359 (https://phabricator.wikimedia.org/T286229) (owner: 10Jbond) [10:19:09] !log installing jackson-databind security updates on buster [10:19:15] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Fix Puppet CA expired certs - https://phabricator.wikimedia.org/T286229 (10jbond) The ceritifcate was unused so it has been removed [10:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:59] (03CR) 10Jbond: [C: 03+2] P:configmaster: update expected status for eventgate-external [puppet] - 10https://gerrit.wikimedia.org/r/703396 (https://phabricator.wikimedia.org/T286231) (owner: 10Jbond) [10:27:52] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) I’ve no need to think anything other than what’s in the child tasks at this point. Having put out feelers externally I’ve some anecdot... [10:37:26] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppetmaster: update NRPE script [puppet] - 10https://gerrit.wikimedia.org/r/703359 (https://phabricator.wikimedia.org/T286229) (owner: 10Jbond) [10:41:59] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [10:42:07] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:02:35] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 134 probes of 627 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:16:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2071', diff saved to https://phabricator.wikimedia.org/P16767 and previous config saved to /var/cache/conftool/dbconfig/20210706-111635-marostegui.json [11:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2071 (re)pooling @ 25%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16768 and previous config saved to /var/cache/conftool/dbconfig/20210706-111731-root.json [11:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:44] 10SRE: Integrate Buster 10.10 point update - https://phabricator.wikimedia.org/T285206 (10MoritzMuehlenhoff) [11:32:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2071 (re)pooling @ 50%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16769 and previous config saved to /var/cache/conftool/dbconfig/20210706-113235-root.json [11:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:17] PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 1869 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [11:35:12] <_joe_> uhm [11:35:19] <_joe_> 1869? [11:35:24] <_joe_> that seems a bit much [11:47:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2071 (re)pooling @ 75%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16770 and previous config saved to /var/cache/conftool/dbconfig/20210706-114739-root.json [11:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:49] _joe_: looking i made a change to the check so probably broke it [11:53:51] (03PS1) 10Jbond: P:puppetmaster: fix expiry check [puppet] - 10https://gerrit.wikimedia.org/r/703404 [11:55:54] (03CR) 10Jbond: [C: 03+2] P:puppetmaster: fix expiry check [puppet] - 10https://gerrit.wikimedia.org/r/703404 (owner: 10Jbond) [11:56:33] omg [11:56:43] sometimes the emoji substitution is just over the top [11:56:59] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 143 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:57:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1118', diff saved to https://phabricator.wikimedia.org/P16771 and previous config saved to /var/cache/conftool/dbconfig/20210706-115732-marostegui.json [11:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:43] RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [11:57:52] https://share.riseup.net/#aRHZo6TEnddrZm02XnZezA [11:57:58] * jbond lunch [11:58:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2072', diff saved to https://phabricator.wikimedia.org/P16772 and previous config saved to /var/cache/conftool/dbconfig/20210706-115820-marostegui.json [11:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:43] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 8 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:02:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2071 (re)pooling @ 100%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16773 and previous config saved to /var/cache/conftool/dbconfig/20210706-120242-root.json [12:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:15] PROBLEM - dump of s4 in codfw on alert1001 is CRITICAL: dump for s4 at codfw taken more than 8 days ago: Most recent backup 2021-06-28 12:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [12:07:29] PROBLEM - dump of s4 in eqiad on alert1001 is CRITICAL: dump for s4 at eqiad taken more than 8 days ago: Most recent backup 2021-06-28 12:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [12:08:51] PROBLEM - dump of es5 in eqiad on alert1001 is CRITICAL: dump for es5 at eqiad taken more than 8 days ago: Most recent backup 2021-06-28 12:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [12:09:49] (03PS1) 10Aklapper: phabricator weekly changes email: Fix query for cookie-licked tasks [puppet] - 10https://gerrit.wikimedia.org/r/703427 (https://phabricator.wikimedia.org/T286181) [12:10:27] PROBLEM - dump of es4 in codfw on alert1001 is CRITICAL: dump for es4 at codfw taken more than 8 days ago: Most recent backup 2021-06-28 12:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [12:10:35] PROBLEM - dump of es5 in codfw on alert1001 is CRITICAL: dump for es5 at codfw taken more than 8 days ago: Most recent backup 2021-06-28 12:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [12:14:11] (03CR) 10Aklapper: "Great... Thanks to running "git checkout -b T286181 origin/master" as usual, but having this repo using "origin/production" instead to con" [puppet] - 10https://gerrit.wikimedia.org/r/703427 (https://phabricator.wikimedia.org/T286181) (owner: 10Aklapper) [12:14:26] (03Abandoned) 10Aklapper: phabricator weekly changes email: Fix query for cookie-licked tasks [puppet] - 10https://gerrit.wikimedia.org/r/703427 (https://phabricator.wikimedia.org/T286181) (owner: 10Aklapper) [12:14:50] marostegui: fyi ^^ [12:17:00] (03PS1) 10Aklapper: phabricator weekly changes email: Fix query for cookie-licked tasks [puppet] - 10https://gerrit.wikimedia.org/r/703428 (https://phabricator.wikimedia.org/T286181) [12:17:15] PROBLEM - dump of es4 in eqiad on alert1001 is CRITICAL: dump for es4 at eqiad taken more than 8 days ago: Most recent backup 2021-06-28 12:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [12:17:57] (03CR) 10Aklapper: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/703428" [puppet] - 10https://gerrit.wikimedia.org/r/703427 (https://phabricator.wikimedia.org/T286181) (owner: 10Aklapper) [12:22:36] jbond: the dumps issue? [12:22:54] marostegui: yes [12:23:20] jbond: thanks, they might recover soon automatically, sometimes it takes a bit longer than expected, I will monitor the issue [12:23:30] ack [12:26:59] (03PS1) 10Jbond: initial commit [debs/cfssl] - 10https://gerrit.wikimedia.org/r/703431 [12:27:37] (03CR) 10Jbond: [V: 03+2 C: 03+2] initial commit [debs/cfssl] - 10https://gerrit.wikimedia.org/r/703431 (owner: 10Jbond) [12:34:12] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10kostajh) I've posted a few (mostly untested) patches today, here's the summary (cc @Tgr, @hashar, @Jdforrester-WMF, @awight): *... [12:36:34] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10ArielGlenn) >>! In T284592#7198669, @cmooney wrote: > I’ve no reason to think anything other than what’s in the child tasks at this point. Havi... [12:52:39] 10SRE, 10ops-eqiad, 10User-fgiunchedi: Disk failed on thanos-be1003 - https://phabricator.wikimedia.org/T285664 (10jbond) I have disabled notifications for this host in icinga to stop it appearing in the results for "Ensure hosts are not performing a change on every puppet run" [12:56:52] PROBLEM - SSH on logstash1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:07:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: add more network egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/703391 (owner: 10Giuseppe Lavagetto) [13:10:03] (03Merged) 10jenkins-bot: mwdebug: add more network egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/703391 (owner: 10Giuseppe Lavagetto) [13:14:57] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2072 (re)pooling @ 25%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16774 and previous config saved to /var/cache/conftool/dbconfig/20210706-131537-root.json [13:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:23] RECOVERY - SSH on logstash1009 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:27:02] <_joe_> uhm what's up with git pull charts [13:28:13] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:15] (03CR) 10Jbond: "updated thanks" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [13:28:34] (03PS3) 10Jbond: sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 [13:29:15] PROBLEM - SSH on logstash1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:29:41] (03PS6) 10Ottomata: Add gobblin_job define and declare first gobblin job in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/702430 (https://phabricator.wikimedia.org/T271232) [13:30:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2072 (re)pooling @ 50%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16775 and previous config saved to /var/cache/conftool/dbconfig/20210706-133041-root.json [13:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:46] (03CR) 10jerkins-bot: [V: 04-1] sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [13:31:53] (03CR) 10Ottomata: [C: 03+2] Add gobblin_job define and declare first gobblin job in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/702430 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:32:03] (03PS1) 10Giuseppe Lavagetto: mwdebug: bump opcache size and n. of files [deployment-charts] - 10https://gerrit.wikimedia.org/r/703435 (https://phabricator.wikimedia.org/T280497) [13:34:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: bump opcache size and n. of files [deployment-charts] - 10https://gerrit.wikimedia.org/r/703435 (https://phabricator.wikimedia.org/T280497) (owner: 10Giuseppe Lavagetto) [13:36:29] (03PS1) 10Joal: Bump AQS druid snapshot to 2021-06 [puppet] - 10https://gerrit.wikimedia.org/r/703436 [13:36:41] ottomata: for when you have a minute --^ [13:37:11] (03CR) 10Ottomata: [C: 03+2] Bump AQS druid snapshot to 2021-06 [puppet] - 10https://gerrit.wikimedia.org/r/703436 (owner: 10Joal) [13:40:16] (03PS1) 10Ottomata: gobblin_job - fix typo in path to default jobconfig file [puppet] - 10https://gerrit.wikimedia.org/r/703437 (https://phabricator.wikimedia.org/T271232) [13:41:53] (03CR) 10Ottomata: [C: 03+2] gobblin_job - fix typo in path to default jobconfig file [puppet] - 10https://gerrit.wikimedia.org/r/703437 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:43:17] RECOVERY - dump of es5 in codfw on alert1001 is OK: Last dump for es5 at codfw (es2025.codfw.wmnet) taken on 2021-07-06 00:00:01 (1916 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [13:45:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2072 (re)pooling @ 75%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16776 and previous config saved to /var/cache/conftool/dbconfig/20210706-134545-root.json [13:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:11] (03PS1) 10Ottomata: gobblin_job - set PYTHONPATH in environment [puppet] - 10https://gerrit.wikimedia.org/r/703439 (https://phabricator.wikimedia.org/T271232) [13:49:05] !log otto@cumin1001 START - Cookbook sre.aqs.roll-restart [13:49:05] !log otto@cumin1001 END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) [13:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:18] !log otto@cumin1001 START - Cookbook sre.aqs.roll-restart [13:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:45] (03CR) 10Ottomata: [C: 03+2] gobblin_job - set PYTHONPATH in environment [puppet] - 10https://gerrit.wikimedia.org/r/703439 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:49:51] (03PS2) 10Ottomata: gobblin_job - set PYTHONPATH in environment [puppet] - 10https://gerrit.wikimedia.org/r/703439 (https://phabricator.wikimedia.org/T271232) [13:49:57] RECOVERY - dump of es4 in eqiad on alert1001 is OK: Last dump for es4 at eqiad (es1022.eqiad.wmnet) taken on 2021-07-06 00:00:01 (1938 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [13:50:33] (03CR) 10Ottomata: [V: 03+2 C: 03+2] gobblin_job - set PYTHONPATH in environment [puppet] - 10https://gerrit.wikimedia.org/r/703439 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:53:41] !log otto@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [13:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:26] RECOVERY - SSH on logstash1009 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:57:45] PROBLEM - SSH on logstash1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:00:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2072 (re)pooling @ 100%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16777 and previous config saved to /var/cache/conftool/dbconfig/20210706-140049-root.json [14:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:39] (03PS1) 10Jbond: O:apereo_cas: move WMF specific code to profile::idp [puppet] - 10https://gerrit.wikimedia.org/r/703442 [14:04:55] RECOVERY - SSH on logstash1009 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:09:02] PROBLEM - SSH on logstash1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:10:19] RECOVERY - SSH on logstash1009 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:14:08] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/703442 (owner: 10Jbond) [14:14:57] PROBLEM - SSH on logstash1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:30:25] RECOVERY - SSH on logstash1009 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:30:59] PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_5@production-logstash-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:41] (03PS1) 10Jbond: initial commit [puppet-apereo_cas] - 10https://gerrit.wikimedia.org/r/703446 [14:33:02] (03PS2) 10Jbond: initial commit [puppet-apereo_cas] - 10https://gerrit.wikimedia.org/r/703446 [14:33:16] (03CR) 10Jbond: [V: 03+2 C: 03+2] initial commit [puppet-apereo_cas] - 10https://gerrit.wikimedia.org/r/703446 (owner: 10Jbond) [14:34:31] (03CR) 10Jbond: [C: 03+2] O:apereo_cas: move WMF specific code to profile::idp [puppet] - 10https://gerrit.wikimedia.org/r/703442 (owner: 10Jbond) [14:37:15] (03PS1) 10Jbond: P:idp: move tomecat specific config to profile [puppet] - 10https://gerrit.wikimedia.org/r/703447 [14:42:47] RECOVERY - dump of es5 in eqiad on alert1001 is OK: Last dump for es5 at eqiad (es1025.eqiad.wmnet) taken on 2021-07-06 00:00:01 (1916 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:44:33] RECOVERY - dump of es4 in codfw on alert1001 is OK: Last dump for es4 at codfw (es2022.codfw.wmnet) taken on 2021-07-06 00:00:01 (1938 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:55:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30115/console" [puppet] - 10https://gerrit.wikimedia.org/r/703447 (owner: 10Jbond) [14:58:46] (03PS1) 10Giuseppe Lavagetto: wmdebug: fix networkpolicy ports definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/703449 [15:00:50] (03PS2) 10Giuseppe Lavagetto: mwdebug: fix networkpolicy ports definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/703449 [15:00:57] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 66 probes of 628 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:02:15] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:idp: move tomecat specific config to profile [puppet] - 10https://gerrit.wikimedia.org/r/703447 (owner: 10Jbond) [15:12:45] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 58 probes of 628 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:22:32] (03PS1) 10Jbond: O:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) [15:23:33] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 41 probes of 627 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:23:50] (03CR) 10jerkins-bot: [V: 04-1] O:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond) [15:39:58] RECOVERY - dump of s4 in codfw on alert1001 is OK: Last dump for s4 at codfw (db2099.codfw.wmnet:3314) taken on 2021-07-06 00:00:01 (382 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:43:11] RECOVERY - dump of s4 in eqiad on alert1001 is OK: Last dump for s4 at eqiad (db1145.eqiad.wmnet:3314) taken on 2021-07-06 00:00:02 (382 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:44:27] (03PS1) 10Ottomata: Remove camus webrequest job in analytics test cluster [puppet] - 10https://gerrit.wikimedia.org/r/703457 (https://phabricator.wikimedia.org/T271232) [15:48:37] !log otto@deploy1002 Started deploy [analytics/refinery@a8e79f3] (hadoop-test): analytics test cluster deploy for webrequest_test gobblin job migration [15:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:02] !log otto@deploy1002 Finished deploy [analytics/refinery@a8e79f3] (hadoop-test): analytics test cluster deploy for webrequest_test gobblin job migration (duration: 05m 24s) [15:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:01] (03CR) 10Ottomata: [C: 03+2] Remove camus webrequest job in analytics test cluster [puppet] - 10https://gerrit.wikimedia.org/r/703457 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [16:04:59] (03PS2) 10Jbond: C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) [16:06:25] (03CR) 10jerkins-bot: [V: 04-1] C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond) [16:14:35] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 128 probes of 627 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:22:34] (03CR) 10Ladsgroup: [C: 03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/701658 (https://phabricator.wikimedia.org/T285361) (owner: 10Ladsgroup) [16:23:52] (03PS1) 10Ottomata: eventgate now uses prometheus directly instead of statsd bridge [deployment-charts] - 10https://gerrit.wikimedia.org/r/703463 (https://phabricator.wikimedia.org/T272714) [16:23:58] (03CR) 10Ladsgroup: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/703252 (https://phabricator.wikimedia.org/T286218) (owner: 10Legoktm) [16:24:47] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:33:55] Amir1: woot, rolling both out now [16:34:08] (03CR) 10Legoktm: [C: 03+2] mailman: Enable verp probes [puppet] - 10https://gerrit.wikimedia.org/r/701658 (https://phabricator.wikimedia.org/T285361) (owner: 10Ladsgroup) [16:34:21] legoktm: thanks for the work <3 [16:34:24] (03PS3) 10Legoktm: mailman3: Discard all mails with a X-Spam-Score >= 6 [puppet] - 10https://gerrit.wikimedia.org/r/703252 (https://phabricator.wikimedia.org/T286218) [16:35:40] (03CR) 10Legoktm: [C: 03+2] mailman3: Discard all mails with a X-Spam-Score >= 6 [puppet] - 10https://gerrit.wikimedia.org/r/703252 (https://phabricator.wikimedia.org/T286218) (owner: 10Legoktm) [16:39:53] (03PS3) 10Jbond: C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) [16:41:16] (03CR) 10jerkins-bot: [V: 04-1] C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond) [16:42:26] !log joal@deploy1002 Started deploy [analytics/refinery@419d1f0]: Analytics deploy for Gobblin replacing Camus [analytics/refinery@419d1f0] [16:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:13] (03PS4) 10Jbond: C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) [16:44:31] (03CR) 10jerkins-bot: [V: 04-1] C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond) [16:46:41] (03PS5) 10Jbond: C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) [16:48:00] (03CR) 10jerkins-bot: [V: 04-1] C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond) [16:51:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 16): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30122/console" [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond) [16:52:27] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10kostajh) Output from the above patches (where possible anyway) can be seen here: https://gerrit.wikimedia.org/r/c/mediawiki/core... [17:03:09] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:03:21] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=DELETE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:03:22] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={delete,get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:04:08] erm, ^ is that an issue _joe_? [17:04:19] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={LIST,PATCH,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:04:47] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 54 probes of 627 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:05:15] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:07:11] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:08:09] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:08:53] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:12:39] <_joe_> legoktm: it just means that creating new objects in k8s is slow, which in a normal situation shouldn't be an issue [17:12:56] ok [17:13:09] what would have caused it? [17:14:29] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 124 probes of 627 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:19:25] !log joal@deploy1002 Finished deploy [analytics/refinery@419d1f0]: Analytics deploy for Gobblin replacing Camus [analytics/refinery@419d1f0] (duration: 36m 59s) [17:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:47] !log joal@deploy1002 Started deploy [analytics/refinery@419d1f0] (thin): Analytics deploy for Gobblin replacing Camus - THIN [analytics/refinery@419d1f0] [17:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:54] !log joal@deploy1002 Finished deploy [analytics/refinery@419d1f0] (thin): Analytics deploy for Gobblin replacing Camus - THIN [analytics/refinery@419d1f0] (duration: 00m 07s) [17:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:24] !log joal@deploy1002 Started deploy [analytics/refinery@419d1f0] (hadoop-test): Analytics deploy for Gobblin replacing Camus - HADOOP-TEST [analytics/refinery@419d1f0] [17:20:25] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 43 probes of 627 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:55] !log joal@deploy1002 Finished deploy [analytics/refinery@419d1f0] (hadoop-test): Analytics deploy for Gobblin replacing Camus - HADOOP-TEST [analytics/refinery@419d1f0] (duration: 05m 31s) [17:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:53] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Legoktm) 05Open→03Resolved a:03Ladsgroup [17:39:50] (03PS1) 10Ladsgroup: dumps: Migrate kiwix update cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/703470 (https://phabricator.wikimedia.org/T273673) [17:45:45] (03PS1) 10Ottomata: Declare analytics webrequest and netflow gobblin jobs [puppet] - 10https://gerrit.wikimedia.org/r/703471 (https://phabricator.wikimedia.org/T271232) [17:47:13] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30123/console" [puppet] - 10https://gerrit.wikimedia.org/r/703471 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [17:47:53] PROBLEM - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:48:01] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Declare analytics webrequest and netflow gobblin jobs [puppet] - 10https://gerrit.wikimedia.org/r/703471 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [17:48:28] (03PS1) 10Ottomata: Fix comment in job/gobblin.pp [puppet] - 10https://gerrit.wikimedia.org/r/703472 [17:48:36] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix comment in job/gobblin.pp [puppet] - 10https://gerrit.wikimedia.org/r/703472 (owner: 10Ottomata) [17:54:55] (03PS1) 10Ottomata: Ensure absent webrequest and netflow camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/703474 (https://phabricator.wikimedia.org/T271232) [17:56:52] (03CR) 10Ottomata: [C: 03+2] "Merging. Will only deploy to staging clusters for now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/703463 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [17:57:21] RECOVERY - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 23654 bytes in 0.326 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:00:18] (03PS1) 10Ladsgroup: Set $wgIncludejQueryMigrate to false in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703476 (https://phabricator.wikimedia.org/T280944) [18:03:44] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [18:03:44] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [18:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:24] (03PS1) 10Ottomata: eventgate-analytics/values-staging.yaml - set num_workers: 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/703477 (https://phabricator.wikimedia.org/T272714) [18:19:22] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics/values-staging.yaml - set num_workers: 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/703477 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [18:32:07] (03PS1) 10Ottomata: eventgate - Allow for setting num_workers: 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/703478 [18:32:46] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - Allow for setting num_workers: 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/703478 (owner: 10Ottomata) [18:34:38] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [18:34:38] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [18:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:48] I'm messing with mwdebug2001 to test T285919 [18:51:49] T285919: Allow links to dag.wikipedia.org from Wikidata - https://phabricator.wikimedia.org/T285919 [19:50:38] 10SRE, 10MediaWiki-Cache: APCu caches are set to expire in 2073 instead of an hour if exptime is a unix timestamp - https://phabricator.wikimedia.org/T286260 (10Ladsgroup) [19:50:47] _joe_: ^ have fun [19:54:24] 10SRE, 10MediaWiki-Cache: APCu caches are set to expire in 2073 instead of an hour if exptime is a unix timestamp - https://phabricator.wikimedia.org/T286260 (10Joe) Clearing acpu is as easy as doing a rolling restart of the cluster. But I think we should first fix the BagOfStuff implementation and/or all the... [19:54:31] <_joe_> Amir1: awesome [19:54:43] <_joe_> Amir1: what were we saying today about unit testing? [19:55:05] or lack thereof :D [20:04:28] (03PS1) 10Tks4Fish: zhwiktionary: Add namespaces: *118 - Reconstruction *119 - Reconstruction Talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703480 (https://phabricator.wikimedia.org/T286101) [20:04:30] (03PS1) 10Tks4Fish: zhwiktionary: Add aliases for namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703481 (https://phabricator.wikimedia.org/T286101) [20:04:33] (03PS1) 10Tks4Fish: zhwiktionary: Add templateeditor right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703482 (https://phabricator.wikimedia.org/T286101) [20:09:52] 10SRE, 10MediaWiki-Cache, 10Patch-For-Review: APCu caches are set to expire in 2073 instead of an hour if exptime is a unix timestamp - https://phabricator.wikimedia.org/T286260 (10Krinkle) I'd say the exptime parameter as timestamp is obscure and something I've not seen even once being used in the past ten... [20:10:19] 10SRE, 10MediaWiki-Cache, 10Platform Engineering, 10Patch-For-Review: APCu caches are set to expire in 2073 instead of an hour if exptime is a unix timestamp - https://phabricator.wikimedia.org/T286260 (10Krinkle) [20:45:20] 10SRE, 10MediaWiki-Cache, 10Platform Engineering, 10Patch-For-Review: APCu caches are set to expire in 2073 instead of an hour if exptime is a unix timestamp - https://phabricator.wikimedia.org/T286260 (10Ladsgroup) While this is clearly the way to do but I'm slightly worried by deploying this we will caus... [21:06:50] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:14:56] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:15:54] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS1299/IPv6: Idle - Telia, AS1299/IPv4: Idle - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:21:43] 10SRE, 10MediaWiki-Cache, 10Platform Engineering, 10Patch-For-Review: APCu caches are set to expire in 2073 instead of an hour if exptime is a unix timestamp - https://phabricator.wikimedia.org/T286260 (10Krinkle) Right, but changing ApcuBag first will do the same thing as well. Perhaps a safer first step... [21:22:36] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:23:30] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 319, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:25:57] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 423, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:29:09] (03PS1) 10Ottomata: Bump eventgate image version to get normalized prometheus metric labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/703487 (https://phabricator.wikimedia.org/T272714) [21:55:28] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:10:52] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 66, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:43:57] (03PS1) 10Legoktm: Re-enable Score using Shellbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) [22:45:04] (03CR) 10jerkins-bot: [V: 04-1] Re-enable Score using Shellbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [22:46:57] (03PS2) 10Legoktm: Re-enable Score using Shellbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) [22:47:28] (03PS1) 10H.krishna123: [WIP] web_app: Creating skeleton code for frontend, and static files [software/bernard] - 10https://gerrit.wikimedia.org/r/703490 (https://phabricator.wikimedia.org/T285438) [22:50:51] (03CR) 10Ladsgroup: [C: 04-1] Re-enable Score using Shellbox on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [22:51:54] (03PS1) 10Legoktm: services_proxy: Add envoyproxy for shellbox [puppet] - 10https://gerrit.wikimedia.org/r/703491 (https://phabricator.wikimedia.org/T281423) [22:53:26] (03CR) 10Legoktm: Re-enable Score using Shellbox on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [22:53:41] (03PS3) 10Legoktm: Re-enable Score using Shellbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) [22:56:43] (03PS2) 10Legoktm: services_proxy: Add envoyproxy for shellbox [puppet] - 10https://gerrit.wikimedia.org/r/703491 (https://phabricator.wikimedia.org/T281423) [22:57:34] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30125/console" [puppet] - 10https://gerrit.wikimedia.org/r/703491 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [22:59:26] (03PS3) 10Legoktm: services_proxy: Add envoyproxy for shellbox [puppet] - 10https://gerrit.wikimedia.org/r/703491 (https://phabricator.wikimedia.org/T281423) [23:12:17] I'm hacking on mwdebug2001 [23:17:13] https://test.wikipedia.org/wiki/Score has real scores again [23:17:26] (03PS4) 10Legoktm: Re-enable Score using Shellbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) [23:21:52] (03CR) 10Legoktm: Re-enable Score using Shellbox on testwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [23:22:29] (03CR) 10DannyS712: Re-enable Score using Shellbox on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [23:34:32] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [23:34:48] (03PS5) 10Legoktm: Re-enable Score using Shellbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) [23:34:50] (03PS1) 10Legoktm: Document $wgShellboxSecretKey in private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703495 [23:34:52] (03PS1) 10Legoktm: Add Shellbox to {Production,Labs}Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703496 (https://phabricator.wikimedia.org/T281423) [23:35:27] (03CR) 10Legoktm: Re-enable Score using Shellbox on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [23:45:53] 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: daily-image-l stopped sending on 2020-10-11 - https://phabricator.wikimedia.org/T265568 (10Platonides) 05Open→03Resolved I have actually removed those two print() statements (some debugging, it seems), so it doesn't produce any output. List is now... [23:48:20] (03PS1) 10Legoktm: lists: Redirect /mailman/options/ too [puppet] - 10https://gerrit.wikimedia.org/r/703497 (https://phabricator.wikimedia.org/T286267) [23:50:05] (03CR) 10Legoktm: [C: 03+2] lists: Redirect /mailman/options/ too [puppet] - 10https://gerrit.wikimedia.org/r/703497 (https://phabricator.wikimedia.org/T286267) (owner: 10Legoktm) [23:52:04] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Redirect old /mailman/options/ urls - https://phabricator.wikimedia.org/T286267 (10Legoktm) 05Open→03Resolved a:03Legoktm ` km@cashew ~> curl -I "https://lists.wikimedia.org/mailman/options/daily-image-l" HTTP/1.1 301 Moved Permanently Date: T...