[00:03:20] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:25:22] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:59:54] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1006&panelId=37 [01:24:38] (03CR) 10Legoktm: "> Patch Set 2:" (031 comment) [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685533 (owner: 10Legoktm) [01:24:42] (03PS3) 10Legoktm: Add qqq [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685533 [01:24:44] (03PS1) 10Legoktm: Don't translate blank templates [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/699981 [01:24:46] (03PS1) 10Legoktm: Use one variable syntax, remove tabs [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/699982 [01:26:08] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:43:44] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Don't translate blank templates [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/699981 (owner: 10Legoktm) [01:44:11] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Use one variable syntax, remove tabs [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/699982 (owner: 10Legoktm) [01:44:52] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Patch-For-Review: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10Legoktm) Pushed some new commits, I think I've addressed all the feedback so far. [03:26:04] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 73.22 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37 [04:59:22] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1006&panelId=37 [05:04:27] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1014. [puppet] - 10https://gerrit.wikimedia.org/r/699987 [05:05:36] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1014. [puppet] - 10https://gerrit.wikimedia.org/r/699987 (owner: 10Marostegui) [05:06:56] !log Upgrade clouddb1014 [05:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:48] (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1014." [puppet] - 10https://gerrit.wikimedia.org/r/699864 [05:09:34] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1014." [puppet] - 10https://gerrit.wikimedia.org/r/699864 (owner: 10Marostegui) [06:43:44] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:52:11] !log volans@cumin1001 START - Cookbook sre.dns.netbox [06:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:48] (03PS1) 10QChris: Add .gitreview [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700011 [06:53:50] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700011 (owner: 10QChris) [06:55:50] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210616T0700) [07:07:27] qchris: o/ thanks a lot [07:07:43] yw :) [07:09:29] (03PS1) 10Elukey: Add initial debianization for istioctl 1.6.14 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) [07:09:39] (03PS5) 10Ema: varnish: add timing data to varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/699223 (https://phabricator.wikimedia.org/T284576) [07:09:41] (03PS2) 10Ema: varnish: add prometheus histogram varnish_processing_seconds [puppet] - 10https://gerrit.wikimedia.org/r/699941 (https://phabricator.wikimedia.org/T284576) [07:10:21] (03PS2) 10Elukey: Add initial debianization for istioctl 1.6.14 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) [07:12:58] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [07:14:33] (03PS3) 10Elukey: Add initial debianization for istioctl 1.6.14 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) [07:27:26] !log cleanup old /var/log/airflow/scheduler logs to reclaim space on an-airflow1001 [07:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:49] (03CR) 10Muehlenhoff: "Looks good, a few nits inline." (035 comments) [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [07:35:13] 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10cmooney) 05Open→03Resolved [07:45:00] (03PS4) 10Elukey: Add initial debianization for istioctl 1.6.14 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) [07:45:33] (03CR) 10Elukey: "Thanks a lot for the review! Tried to fix all the comments, hope it is better now :)" (035 comments) [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [07:48:50] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [07:59:51] (03CR) 10JMeybohm: [C: 04-1] "I think you'll want to append the istio version (or at least major and minor) to the debian package name as well as the filename of the bi" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [08:03:30] (03PS1) 10Jcrespo: Revert "mariadb: Switchover s7&s8 codfw backups from db2100 to db2098" [puppet] - 10https://gerrit.wikimedia.org/r/699865 (https://phabricator.wikimedia.org/T284980) [08:03:48] (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb: Switchover s7&s8 codfw backups from db2100 to db2098" [puppet] - 10https://gerrit.wikimedia.org/r/699865 (https://phabricator.wikimedia.org/T284980) (owner: 10Jcrespo) [08:06:25] (03PS2) 10Jcrespo: Revert "mariadb: Switchover s7&s8 codfw backups from db2100 to db2098" [puppet] - 10https://gerrit.wikimedia.org/r/699865 (https://phabricator.wikimedia.org/T284980) [08:09:42] (03PS3) 10Jcrespo: Revert "mariadb: Switchover s7&s8 codfw backups from db2100 to db2098" [puppet] - 10https://gerrit.wikimedia.org/r/699865 (https://phabricator.wikimedia.org/T284980) [08:09:58] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:15:10] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Switchover s7&s8 codfw backups from db2100 to db2098" [puppet] - 10https://gerrit.wikimedia.org/r/699865 (https://phabricator.wikimedia.org/T284980) (owner: 10Jcrespo) [08:19:15] (03CR) 10Filippo Giunchedi: "LGTM overall, just an observation inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699941 (https://phabricator.wikimedia.org/T284576) (owner: 10Ema) [08:24:19] !log running "update flaggedrevs set fr_quality = 0 where fr_quality != 0;" on all wikis where flagged revs is enabled (T279761) [08:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:23] T279761: When reviewing pending changes, raw message ID "⧼revreview-hist-quality⧽" shown instead of human readable string - https://phabricator.wikimedia.org/T279761 [08:28:08] (03PS1) 10Jbond: gitlab: add ability to only listen on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/700020 [08:29:31] (03CR) 10Jbond: [C: 03+2] gitlab: add ability to only listen on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/700020 (owner: 10Jbond) [08:29:40] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 121 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:31:12] jobrunner issues [08:31:53] the maxlag is going up but that's sorta expected [08:31:54] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=6&orgId=1 [08:32:02] it'll finish soon [08:32:31] currently at ukwiki [08:32:57] done [08:33:24] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 39 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:38:04] (03PS1) 10Jbond: gitlab: add listen_port and listent_https to template [puppet] - 10https://gerrit.wikimedia.org/r/700023 [08:41:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mwdebug: include nutcracker and mcrouter pools in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/699432 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli) [08:41:26] (03CR) 10Jbond: [C: 03+2] gitlab: add listen_port and listent_https to template [puppet] - 10https://gerrit.wikimedia.org/r/700023 (owner: 10Jbond) [08:49:57] (03CR) 10Elukey: "> Patch Set 4: Code-Review-1" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [08:53:22] (03PS4) 10Kormat: mariadb: Automatically manage pt-heartbeat. [puppet] - 10https://gerrit.wikimedia.org/r/699213 [08:59:26] (03PS4) 10Kormat: mariadb: Promote db1157 as s3 primary [puppet] - 10https://gerrit.wikimedia.org/r/698981 (https://phabricator.wikimedia.org/T284648) [09:01:58] (03CR) 10JMeybohm: [C: 03+1] "> Patch Set 4:" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [09:03:26] (03CR) 10Elukey: "> Patch Set 4: Code-Review+1" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [09:04:05] !log uploaded wmfmariadbpy 0.7.1 to apt.wm.o [09:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:20] !log Deploying wmfmariadbpy 0.7.1 T284819 [09:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:23] T284819: Deploy wmfmariadbpy 0.7.1 - https://phabricator.wikimedia.org/T284819 [09:07:21] (03CR) 10JMeybohm: [C: 03+1] "> Patch Set 4:" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [09:10:42] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:11:37] 10SRE: debdeploy does not support bullseye - https://phabricator.wikimedia.org/T285034 (10Kormat) [09:21:02] (03PS2) 10Kormat: wmnet: Update s3-master to db1157 [dns] - 10https://gerrit.wikimedia.org/r/698982 (https://phabricator.wikimedia.org/T284648) [09:23:02] jouncebot: next [09:23:02] In 21 hour(s) and 36 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210617T0700) [09:23:30] (03PS1) 10Kormat: Revert "db-eqiad.php: Set pc1010 as pc3 primary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700026 [09:24:16] (03PS1) 10Muehlenhoff: Support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/700046 (https://phabricator.wikimedia.org/T285034) [09:25:52] (03PS2) 10Muehlenhoff: Support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/700046 (https://phabricator.wikimedia.org/T285034) [09:26:15] (03CR) 10Kormat: [C: 03+1] Support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/700046 (https://phabricator.wikimedia.org/T285034) (owner: 10Muehlenhoff) [09:27:35] (03CR) 10Muehlenhoff: [C: 03+2] Support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/700046 (https://phabricator.wikimedia.org/T285034) (owner: 10Muehlenhoff) [09:28:32] (03CR) 10Hnowlan: [C: 03+1] maps: fix SQL modules paths in import script [puppet] - 10https://gerrit.wikimedia.org/r/695240 (owner: 10MSantos) [09:28:39] (03CR) 10Hnowlan: [C: 03+2] maps: fix SQL modules paths in import script [puppet] - 10https://gerrit.wikimedia.org/r/695240 (owner: 10MSantos) [09:28:45] (03PS1) 10Kormat: pc1010: Move back to pc1, now that the maintenance is done. [puppet] - 10https://gerrit.wikimedia.org/r/700047 (https://phabricator.wikimedia.org/T282761) [09:29:51] 10SRE, 10Patch-For-Review: debdeploy does not support bullseye - https://phabricator.wikimedia.org/T285034 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Fixed. I ran puppet on cumin hosts, should work now. [09:31:53] * apergos peeks in [09:31:54] same issue? [09:31:58] i just got multipaged [09:31:58] got paged [09:32:03] expired downtime? [09:32:08] possibly [09:32:12] it's splunk that didn't resolve the incidents? [09:32:23] mmhh yeah I suspect that's right volans [09:32:24] ah could be [09:32:26] let me see if there is any impact [09:32:31] as there is nothing on irc [09:33:00] I acked all of them [09:33:03] I mean I just can't believe Telia has a second outage in as many days etc. [09:33:16] yup no actuall issue I can see [09:33:53] sobanski: we need to resolve them though [09:34:06] or they will page again tomorrow AFAIK, godog correct me if I'm wrong [09:34:11] heh [09:34:24] and then we will all be here again saying "a third outage? wtf" :-D [09:34:37] volans: sure thing, as soon as we're confident there's no actual problem :) [09:35:42] volans: that's right yeah, +1 to resolve IMHO [09:36:03] sobanski: I'm pretty sure it's not but let's wait for XioNoX and topranks to confirm [09:36:15] I've not found any issue thus far. [09:36:40] yeah no user-facing problem, so I'd say it's safe to resolve [09:36:46] Telia link looks ok [09:37:33] I see nada as well [09:37:49] (03CR) 10Kormat: [C: 03+2] Revert "db-eqiad.php: Set pc1010 as pc3 primary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700026 (owner: 10Kormat) [09:38:24] yeah I'm happy enough that eqsin has not been affected network wise [09:38:48] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Set pc1010 as pc3 primary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700026 (owner: 10Kormat) [09:40:37] !log kormat@deploy1002 Synchronized wmf-config/db-eqiad.php: Repool pc1009 as pc3 primary T282761 (duration: 00m 59s) [09:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:42] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [09:44:10] (03CR) 10Kormat: [C: 03+2] pc1010: Move back to pc1, now that the maintenance is done. [puppet] - 10https://gerrit.wikimedia.org/r/700047 (https://phabricator.wikimedia.org/T282761) (owner: 10Kormat) [09:46:20] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:47:20] !log truncating all pc* tables on pc1010 T282761 [09:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:24] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [09:50:07] !log disabling puppet on maps1* to reparent maps1007 from new master maps1009 [09:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:04] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on maps1007.eqiad.wmnet with reason: Reparenting from maps1009 [09:51:04] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on maps1007.eqiad.wmnet with reason: Reparenting from maps1009 [09:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:42] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1007.eqiad.wmnet [09:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:42] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] maps: make maps1007 a buster replica of the new imposm cluster [puppet] - 10https://gerrit.wikimedia.org/r/699782 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [09:58:24] hnowlan: FYI that downtime will expire during the weekend (in case it might page) [09:58:40] volans: ack, I'll be undoing it or extending it before then [09:58:49] np, thx [10:00:11] Closing the Splunk thread, I'll go and resolve them all unless anyone thinks otherwise [10:10:41] sobanski: +1 [10:11:40] We probably need to follow up on why they weren't auto-resolved, or is that a known scenario? [10:13:41] sobanski: sometimes it happens yeah, though we've focused more on alertmanager than improving icinga since it doesn't happen very often [10:13:58] Got it [10:15:01] godog: would it be easy to create a simple check in icinga that alerts here on IRC (no page) if there is any open incident on splunk opened more than say 6h ago? [10:19:41] 10Puppet, 10Packaging, 10User-jbond: Explore packaging facter 4.0 - https://phabricator.wikimedia.org/T285043 (10jbond) p:05Triage→03Low [10:21:28] 10Puppet, 10Packaging, 10User-jbond: Explore packaging facter 4.0 - https://phabricator.wikimedia.org/T285043 (10jbond) trying `gem2deb facter` we get the following issue ` dpkg-checkbuilddeps: error: Unmet build dependencies: ruby-hocon (>= 1.3) ruby-thor (<< 2.0) ruby-thor (>= 1.0.1) ` As such we need t... [10:23:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316', diff saved to https://phabricator.wikimedia.org/P16535 and previous config saved to /var/cache/conftool/dbconfig/20210616-102349-marostegui.json [10:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:31] 10SRE, 10Okapi [Wikimedia Enterprise], 10Traffic: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10Eugene.chernov) Thank you for the change. Now I can see wikimedia.com NS records are over to AWS. https://www.nslookup.io/dns-records/wikimedia.com [10:32:59] volans: probably not too complex [10:34:18] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on maps1007.eqiad.wmnet with reason: REIMAGE [10:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16536 and previous config saved to /var/cache/conftool/dbconfig/20210616-103604-root.json [10:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:25] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps1007.eqiad.wmnet with reason: REIMAGE [10:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:04] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: fix data structure name [puppet] - 10https://gerrit.wikimedia.org/r/698209 [10:50:06] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: fix virtualhost port. [puppet] - 10https://gerrit.wikimedia.org/r/700050 [10:51:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16537 and previous config saved to /var/cache/conftool/dbconfig/20210616-105108-root.json [10:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:44] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: fix virtualhost port. [puppet] - 10https://gerrit.wikimedia.org/r/700050 [10:52:47] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:54:13] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:54:23] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::yaml_defs: fix virtualhost port. [puppet] - 10https://gerrit.wikimedia.org/r/700050 (owner: 10Giuseppe Lavagetto) [10:56:08] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: fix virtualhost port. [puppet] - 10https://gerrit.wikimedia.org/r/700050 [10:57:10] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29895/console" [puppet] - 10https://gerrit.wikimedia.org/r/700050 (owner: 10Giuseppe Lavagetto) [10:59:20] (03CR) 10Jelto: [C: 03+2] bacula/gitlab: add a backup::set for gitlab and use it [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [11:03:52] (03PS1) 10Jbond: cfssl::cert: renew certificates when there are 11 days left [puppet] - 10https://gerrit.wikimedia.org/r/700051 [11:05:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29896/console" [puppet] - 10https://gerrit.wikimedia.org/r/700051 (owner: 10Jbond) [11:05:37] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl::cert: renew certificates when there are 11 days left [puppet] - 10https://gerrit.wikimedia.org/r/700051 (owner: 10Jbond) [11:06:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16538 and previous config saved to /var/cache/conftool/dbconfig/20210616-110612-root.json [11:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:29] (03PS1) 10Jbond: P:tendril::webserver: Ensure apache is automatically refreshed when certs change [puppet] - 10https://gerrit.wikimedia.org/r/700052 [11:15:38] (03PS1) 10Jbond: cfssl::cert: update to the correct amount of seconds [puppet] - 10https://gerrit.wikimedia.org/r/700053 [11:15:59] (03CR) 10Jbond: [C: 03+2] P:tendril::webserver: Ensure apache is automatically refreshed when certs change [puppet] - 10https://gerrit.wikimedia.org/r/700052 (owner: 10Jbond) [11:16:14] (03CR) 10Vgutierrez: [C: 03+1] cfssl::cert: update to the correct amount of seconds [puppet] - 10https://gerrit.wikimedia.org/r/700053 (owner: 10Jbond) [11:16:25] (03CR) 10Jbond: [C: 03+2] cfssl::cert: update to the correct amount of seconds [puppet] - 10https://gerrit.wikimedia.org/r/700053 (owner: 10Jbond) [11:20:51] !log running `nodetool cleanup` on maps1005 [11:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16539 and previous config saved to /var/cache/conftool/dbconfig/20210616-112115-root.json [11:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:27] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki::web::yaml_defs: fix virtualhost port. [puppet] - 10https://gerrit.wikimedia.org/r/700050 (owner: 10Giuseppe Lavagetto) [11:40:15] sleepless in Australia? [12:00:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131', diff saved to https://phabricator.wikimedia.org/P16540 and previous config saved to /var/cache/conftool/dbconfig/20210616-120015-marostegui.json [12:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:26] (03CR) 10Jcrespo: [C: 04-2] "This will create duplicate backup storage as it is now, we won't have enough storage for this:" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [12:00:58] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on maps1007.eqiad.wmnet with reason: Reparenting from maps1009 [12:00:59] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on maps1007.eqiad.wmnet with reason: Reparenting from maps1009 [12:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:41] (03CR) 10Jcrespo: [C: 04-2] "Also, the default backup policy (weekly full backups) for exports will not be a good match." [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [12:08:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16541 and previous config saved to /var/cache/conftool/dbconfig/20210616-120818-root.json [12:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:58] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:23:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16543 and previous config saved to /var/cache/conftool/dbconfig/20210616-122322-root.json [12:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:47] !log deploying heartbeat service puppet change [12:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:59] (03CR) 10Kormat: [C: 03+2] mariadb: Automatically manage pt-heartbeat. [puppet] - 10https://gerrit.wikimedia.org/r/699213 (owner: 10Kormat) [12:36:54] PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100% [12:37:04] PROBLEM - Host ganeti5001 is DOWN: PING CRITICAL - Packet loss = 100% [12:37:20] PROBLEM - Host ncredir5001 is DOWN: PING CRITICAL - Packet loss = 100% [12:38:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16544 and previous config saved to /var/cache/conftool/dbconfig/20210616-123826-root.json [12:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:30] PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100% [12:38:33] PROBLEM - Host upload-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [12:38:43] *siiiigh*, is it the telia link again isn't it ? [12:38:58] * volans here [12:38:59] XioNoX topranks ^ what do you think ? [12:39:01] yo [12:39:05] let's depool [12:39:20] RECOVERY - Host ncredir5001 is UP: PING WARNING - Packet loss = 90%, RTA = 225.30 ms [12:39:22] RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 226.46 ms [12:39:24] ah? [12:39:26] RECOVERY - Host ganeti5002 is UP: PING OK - Packet loss = 0%, RTA = 230.65 ms [12:39:31] I was about to kill the telia link [12:39:33] (03PS1) 10Volans: Revert "Revert "Depool eqsin"" [dns] - 10https://gerrit.wikimedia.org/r/700027 [12:39:40] I've created ^^^ [12:39:46] RECOVERY - Host ganeti5001 is UP: PING OK - Packet loss = 0%, RTA = 225.51 ms [12:40:00] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "Revert "Depool eqsin"" [dns] - 10https://gerrit.wikimedia.org/r/700027 (owner: 10Volans) [12:40:01] volans: yeah +1 [12:40:01] XioNoX: go ahead [12:40:14] but you told yesterday better to be depooled without the link [12:40:15] RECOVERY - Host upload-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 224.99 ms [12:40:16] is that correct? [12:41:32] * volans waiting for XioNoX given the recovery [12:41:38] mtr seems ok [12:41:46] this one I'm draining it, not killing it [12:41:57] https://www.irccloud.com/pastebin/OrHHe8sH/ [12:43:28] alright, telia link is up but traffic is routed over the tunnel link [12:44:42] XioNoX: do you think we should depool anyway? how much redundancy is left? [12:45:06] it's fine to keep it pooled for now [12:45:22] sending most of asia to AMS or SFO ofc is not great but better than failures ;) [12:45:35] telia is not fully down [12:45:51] if telia goes hard down we'll get a router interfaces down alert and can depool then [12:45:53] so redundancy is not great, not terrible :) [12:46:44] I was in the middle of an email to Telia, I guess I'll add that :) [12:47:30] (I resolved teh incident) [12:49:28] (03PS1) 10Zabe: Rename Portal and Portal talk namespaces on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700065 (https://phabricator.wikimedia.org/T284868) [12:49:36] email sent [12:49:51] thx [12:50:05] I'll leave the patch there in case is needed [12:50:06] smokeping didn't pickup the connectivity issue: https://smokeping.wikimedia.org/?target=eqsin [12:52:47] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [12:53:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16545 and previous config saved to /var/cache/conftool/dbconfig/20210616-125329-root.json [12:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:39] (03CR) 10Jelto: "> Patch Set 2: Code-Review-2" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [12:57:47] (Traffic on tunnel link) firing: (2) Traffic on tunnel link - https://alerts.wikimedia.org [12:59:10] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 208 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:02:47] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [13:22:54] (03CR) 10Elukey: Add support for knative serving (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [13:23:01] (03PS6) 10Elukey: Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) [13:26:43] PROBLEM - Check systemd state on db2151 is CRITICAL: CRITICAL - degraded: The following units failed: pt-heartbeat-wikimedia.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:48] (03CR) 10Jcrespo: [C: 04-2] "> Patch Set 2: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [13:35:03] PROBLEM - Check systemd state on db1176 is CRITICAL: CRITICAL - degraded: The following units failed: pt-heartbeat-wikimedia.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:12] kormat: ^ [13:41:19] you still testing? [13:41:43] i am not [13:41:56] ah those are the new hosts I think [13:41:59] what is db1176? [13:42:02] jynus: ^ does that need pt-heartbeat? [13:42:07] mmmm [13:42:11] those are mediabackups misc kormat [13:42:11] what are those? [13:42:13] i'll ack, you folks figure it out :) [13:42:25] jynus: I believe the mediabackup misc hosts :) [13:42:27] I think those are standalone [13:42:38] so I guess no pt? [13:42:43] ACKNOWLEDGEMENT - Check systemd state on db1176 is CRITICAL: CRITICAL - degraded: The following units failed: pt-heartbeat-wikimedia.service Kormat unsure https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:08] jynus: profile::mariadb::mysql_role: 'master' should we remove that? [13:43:12] mmmm [13:43:16] let me check [13:43:20] sure, no rush [13:43:37] * marostegui goes back to his all hands session [13:44:19] maybe it just needs some grant fixes? [13:46:10] failed: Unknown database 'heartbeat' [13:46:26] so that is an easy fix [13:50:12] interesting, it exists on codfw, so what is failing there? [13:51:12] "Data too long for column 'shard' at row 1" [13:51:14] lol [13:51:55] kormat, marostegui either we make the colum larger or the value ('mediabackupstemp') smaller [13:52:11] hahaha [13:52:17] for now I think I will alter just this db [13:52:20] It is probably easier to do the value smaller [13:52:24] it is isolated enough it won't be an issue [13:52:43] jynus: +1 for the quick fix anyways [13:52:45] it is not mediawiki and will never serve mediawiki traffic without a full reimage [13:59:27] RECOVERY - Check systemd state on db2151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:55] (03PS2) 10Zabe: Rename Portal and Portal talk namespaces on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700065 (https://phabricator.wikimedia.org/T284868) [14:06:40] (03CR) 10Elukey: "This is now ready for review, works on minikube." [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [14:10:37] RECOVERY - Check systemd state on db1176 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:07] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:20:53] (03PS1) 10Hnowlan: postgres::slave: remove problematic auto-replicate [puppet] - 10https://gerrit.wikimedia.org/r/700071 [14:45:31] (03CR) 10JMeybohm: [C: 04-1] Add support for knative serving (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [14:45:44] (03CR) 10Volans: "I agree with the idea, there is no need to have puppet manage that. Added the other people involved in postgres installations (netbox and " [puppet] - 10https://gerrit.wikimedia.org/r/700071 (owner: 10Hnowlan) [14:56:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::web::yaml_defs: fix data structure name [puppet] - 10https://gerrit.wikimedia.org/r/698209 (owner: 10Giuseppe Lavagetto) [14:58:06] (03PS1) 10Volans: icinga: rename some IcingaHosts methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/700076 [14:58:45] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::web::yaml_defs: fix virtualhost port. [puppet] - 10https://gerrit.wikimedia.org/r/700050 (owner: 10Giuseppe Lavagetto) [14:58:57] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: fix virtualhost port. [puppet] - 10https://gerrit.wikimedia.org/r/700050 [15:04:03] (03PS1) 10Arlolra: Switch to using parsoid-async for direct VirtualRestClient connects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700077 (https://phabricator.wikimedia.org/T244609) [15:05:24] (03CR) 10jerkins-bot: [V: 04-1] Switch to using parsoid-async for direct VirtualRestClient connects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700077 (https://phabricator.wikimedia.org/T244609) (owner: 10Arlolra) [15:08:02] (03CR) 10Arlolra: Bump envoy timeout for parsoid-php (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/699425 (https://phabricator.wikimedia.org/T244609) (owner: 10Arlolra) [15:09:18] (03CR) 10David Caro: "Note to self: this breaks the wmcs branch:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/700076 (owner: 10Volans) [15:13:03] (03CR) 10Volans: "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/700076 (owner: 10Volans) [15:14:18] 10Puppet, 10Packaging, 10User-jbond: Explore packaging facter 4.0 - https://phabricator.wikimedia.org/T285043 (10jbond) ruby-hocon 1.3 builds fine using the commands from the [[ https://wiki.debian.org/Teams/Ruby/Packaging#Updating_a_package_to_a_newer_version | ruby packaging team ]] i.e ` $ mr --trust-all... [15:17:02] (03CR) 10Volans: "This is a proposal to clean a bit the new IcingaHosts API before starting to migrate all the cookbooks to it from the Icinga API in Spicer" [software/spicerack] - 10https://gerrit.wikimedia.org/r/700076 (owner: 10Volans) [15:22:03] !log testing upcoming Scap release on beta [15:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:33] RECOVERY - NFS Share Volume Space /srv/tools on labstore1004 is OK: DISK OK - free space: /srv/tools 1754872 MB (22% inode=79%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [15:40:56] 10Puppet, 10Packaging, 10User-jbond: Explore packaging facter 4.0 - https://phabricator.wikimedia.org/T285043 (10jbond) Also following the instructions `https://wiki.debian.org/git-pbuilder#Installing_Extra_Packages` to add the updated ruby-hocon` package i was able to build ruby-facter using the ruby-meta r... [15:49:21] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:55:41] (03PS1) 10Majavah: metricsinfra: Monitor toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/700082 [16:04:34] (03PS1) 10Jelto: copy latest backup to dedicated folder [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/700084 (https://phabricator.wikimedia.org/T274463) [16:12:28] (03CR) 10Jelto: "the review in https://gerrit.wikimedia.org/r/c/operations/puppet/+/697850 says that we need to store the latest backup of GitLab for bacul" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/700084 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [16:18:50] !log Resetting metric on Telia CCT IC-331929, cr1-codfw and cr3-eqsin. [16:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:27] ^^ this seems stable other than the blip we had earlier, will try to bring back into service and monitor status. [16:22:09] done [16:27:31] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] cassandra: drop support for 2.1 in metrics. Fix collector version [puppet] - 10https://gerrit.wikimedia.org/r/696399 (https://phabricator.wikimedia.org/T275353) (owner: 10Hnowlan) [16:37:11] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:39] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:17] (03PS1) 10Hnowlan: maps: make maps2007 a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/700087 (https://phabricator.wikimedia.org/T269582) [17:11:10] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29897/console" [puppet] - 10https://gerrit.wikimedia.org/r/700087 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [17:15:27] (03CR) 10Jgiannelos: [C: 03+1] postgres: fix sync bugs in resync_replica script [puppet] - 10https://gerrit.wikimedia.org/r/699430 (owner: 10Hnowlan) [17:41:04] !log Reverted Scap release on beta [17:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:18] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10sgrabarczuk) > We are tentatively planning a datacenter switchover [...] A date for switching back hasn't been set yet. We're waiting for the... [18:35:13] 10SRE, 10Technical-blog-posts, 10Wikimedia-Mailing-lists: Story idea for Blog: Discovering and fixing CVE-2021-33038 in Mailman3 - https://phabricator.wikimedia.org/T284486 (10srodlund) Fixed! [18:41:30] (03PS4) 10Bstorm: Replace os.execv with subprocess.check_call [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 (owner: 10Majavah) [18:43:58] (03CR) 10Bstorm: [C: 03+2] "Ok, now I'll merge this and put up a release patch before we try the more "invasive" patch of setting up new routing for grid engine. That" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 (owner: 10Majavah) [18:44:37] (03Merged) 10jenkins-bot: Replace os.execv with subprocess.check_call [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 (owner: 10Majavah) [18:59:37] (03PS1) 10Bstorm: d/changelog: Prepare for 0.75 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095 [19:00:49] (03CR) 10Bstorm: "This has fair few changes in it, but we won't know if there are buried problems until release, so...." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095 (owner: 10Bstorm) [19:14:35] (03CR) 10Majavah: "Do we want to try to get https://phabricator.wikimedia.org/T278748#7153319 in this release as well?" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095 (owner: 10Bstorm) [19:18:05] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:21:26] (03CR) 10Legoktm: "Yay!" (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095 (owner: 10Bstorm) [19:23:38] (03CR) 10Bstorm: "> Patch Set 1:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095 (owner: 10Bstorm) [19:24:22] (03CR) 10Bstorm: d/changelog: Prepare for 0.75 release (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095 (owner: 10Bstorm) [19:37:17] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10Legoktm) I added the timeline to https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Schedule_for_June_2021_switch, which is: * Services: Mo... [19:39:11] (03CR) 10Legoktm: Bump envoy timeout for parsoid-php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699425 (https://phabricator.wikimedia.org/T244609) (owner: 10Arlolra) [19:41:38] (03CR) 10Legoktm: "So just to clarify:" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [19:56:18] (03CR) 10Bstorm: Use common k8s labels (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/637813 (https://phabricator.wikimedia.org/T266844) (owner: 10Legoktm) [20:18:49] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:08:00] (03PS3) 10H.krishna123: api_db: Add working skeleton code for api_db, add dockerfile [software/bernard] - 10https://gerrit.wikimedia.org/r/699915 (https://phabricator.wikimedia.org/T284399) [21:11:24] (03CR) 10H.krishna123: "Shall we merge this into master branch? (is that the procedure?)" [software/bernard] - 10https://gerrit.wikimedia.org/r/699915 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123) [21:32:13] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox' for release 'main' . [21:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:08] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox' for release 'main' . [21:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:38] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Legoktm) Shellbox is now running in eqiad and codfw: ` legoktm@deploy1002:/srv/deployment-charts/helmfile.d/services/shellbox$ curl https://kubernetes1001.eqiad.... [21:38:58] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Legoktm) [21:41:27] (03PS12) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) [21:51:15] (03CR) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [22:10:27] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 204 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:14:13] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 28 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:16:03] 10SRE, 10DBA, 10Datacenter-Switchover: Check "Days in advance preparation" for databases before DC switchover - https://phabricator.wikimedia.org/T285069 (10Legoktm) [22:17:55] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 116 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:21:37] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 26 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:21:43] (03PS2) 10Legoktm: Add shellbox to LVS [puppet] - 10https://gerrit.wikimedia.org/r/693959 (https://phabricator.wikimedia.org/T281423) [22:21:45] (03PS2) 10Legoktm: service: Switch shellbox to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/693960 (https://phabricator.wikimedia.org/T281423) [22:21:47] (03PS2) 10Legoktm: service: Switch shellbox to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/693961 (https://phabricator.wikimedia.org/T281423) [22:21:49] (03PS2) 10Legoktm: service: Switch shellbox to production [puppet] - 10https://gerrit.wikimedia.org/r/693962 (https://phabricator.wikimedia.org/T281423) [22:34:20] (03CR) 10Bstorm: "This script (based on others I've run) seems to do idempotent-ish things after testing on one namespace and testing across all namespaces " [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/637813 (https://phabricator.wikimedia.org/T266844) (owner: 10Legoktm) [22:41:17] 10SRE, 10Wikimedia-Mailing-lists: Please close the wmfkids@ mailing list - https://phabricator.wikimedia.org/T284683 (10Legoktm) 05Open→03Resolved a:03Legoktm List closed. We don't tend to delete archives unless there's a good (usually legal) reason to do so. [22:55:09] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:08:59] 10SRE, 10Wikimedia-Mailing-lists: Add link to list archives in default footer - https://phabricator.wikimedia.org/T284256 (10Legoktm) The Wikimedia default footer actually looks like: ` _______________________________________________ ${display_name} mailing list -- ${listname} List information: https://${doma... [23:12:59] RECOVERY - Maps - OSM synchronization lag - codfw on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 1.736e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [23:55:59] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook