[00:02:47] PROBLEM - cassandra-a CQL 10.192.48.121:9042 on restbase2017 is CRITICAL: connect to address 10.192.48.121 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [00:02:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T355609)', diff saved to https://phabricator.wikimedia.org/P56163 and previous config saved to /var/cache/conftool/dbconfig/20240203-000252-marostegui.json [00:02:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1249.eqiad.wmnet with reason: Maintenance [00:03:01] PROBLEM - cassandra-a SSL 10.192.48.121:7000 on restbase2017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [00:03:01] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [00:03:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1249.eqiad.wmnet with reason: Maintenance [00:03:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T355609)', diff saved to https://phabricator.wikimedia.org/P56164 and previous config saved to /var/cache/conftool/dbconfig/20240203-000314-marostegui.json [00:28:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T355609)', diff saved to https://phabricator.wikimedia.org/P56165 and previous config saved to /var/cache/conftool/dbconfig/20240203-002817-marostegui.json [00:28:34] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [00:35:38] (03PS4) 10Zabe: foreachwikiindblist: Return early when no arg is passed [puppet] - 10https://gerrit.wikimedia.org/r/992263 [00:35:45] (03CR) 10Zabe: foreachwikiindblist: Return early when no arg is passed (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/992263 (owner: 10Zabe) [00:39:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/995347 [00:39:10] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/995347 (owner: 10TrainBranchBot) [00:40:20] (03CR) 10Ahmon Dancy: [C: 03+1] "This change looks reasonable/useful to me." [puppet] - 10https://gerrit.wikimedia.org/r/992263 (owner: 10Zabe) [00:43:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P56166 and previous config saved to /var/cache/conftool/dbconfig/20240203-004324-marostegui.json [00:58:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P56167 and previous config saved to /var/cache/conftool/dbconfig/20240203-005830-marostegui.json [01:01:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/995347 (owner: 10TrainBranchBot) [01:13:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T355609)', diff saved to https://phabricator.wikimedia.org/P56168 and previous config saved to /var/cache/conftool/dbconfig/20240203-011337-marostegui.json [01:13:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:13:51] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [01:13:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:34:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:29] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [01:55:56] (03PS4) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) [01:58:35] (03PS5) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) [01:58:55] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [02:10:54] (03PS6) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) [02:11:09] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [02:34:16] (03PS7) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) [02:35:01] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [02:39:00] (03PS8) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) [02:39:31] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:41] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [02:41:45] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:44:30] (03PS9) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) [02:44:41] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [02:46:35] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:47:37] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:51:22] (03PS10) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) [02:51:28] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995369 (https://phabricator.wikimedia.org/T350995) (owner: 10Andrew Bogott) [02:51:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:07:33] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:08:03] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:09:31] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:21:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:11:57] PROBLEM - ElasticSearch unassigned shard check - 9400 on cloudelastic1010 is CRITICAL: CRITICAL - dgawiki_content_first[0](2024-01-30T20:52:14.401Z), bmwikiquote_general_1692325479[0](2024-01-30T20:52:14.402Z), ilowiki_content_1682003872[0](2024-01-30T20:52:14.403Z), nlwikinews_content_1682132908[0](2024-01-30T20:52:14.398Z), bjnwikiquote_content_first[0](2024-01-30T20:52:14.400Z), napwikisource_content_1682124288[0](2024-01-30T20:52:14.4 [04:11:57] wiki_content_1682324242[0](2024-01-30T20:52:14.402Z), bpywiki_general_1692375521[0](2024-01-30T20:52:14.399Z), angwikiquote_content_1692144583[0](2024-01-30T20:52:14.402Z), fiwikivoyage_general_1693144275[0](2024-01-30T20:52:14.399Z), chowiki_content_1692531553[0](2024-01-30T20:52:14.401Z), btmwiktionary_general_1692403885[0](2024-01-30T20:52:14.400Z), hewikinews_general_1681947456[0](2024-01-30T20:52:14.403Z), kowikiquote_general_1682063 [04:11:57] 024-01-30T20:52:14.401Z), ltwikiquote_general_1682083320[0](2024-01-30T20:52:14.400Z), cywikiquote_general_1692581318[0](2024-01-30T20:52:14.402Z), hiwikimedia_content_1681961237[0](2024-01-30T20:52:14.402Z), e https://wikitech.wikimedia.org/wiki/Search%23Administration [04:19:19] PROBLEM - ElasticSearch unassigned shard check - 9400 on cloudelastic1009 is CRITICAL: CRITICAL - quwikibooks_content_1682184136[0](2024-01-30T20:52:14.403Z), napwikisource_content_1682124288[0](2024-01-30T20:52:14.401Z), newiktionary_content_1682126793[0](2024-01-30T20:52:14.403Z), knwikiquote_general_1682050558[0](2024-01-30T20:52:14.401Z), angwikiquote_content_1692144583[0](2024-01-30T20:52:14.402Z), bewikisource_general_1692300925[0]( [04:19:19] 30T20:52:14.401Z), dgawiki_general_first[0](2024-01-30T20:52:14.400Z), bewikibooks_general_1692297309[0](2024-01-30T20:52:14.401Z), ttwiktionary_content_1682372674[0](2024-01-30T20:52:14.401Z), kgwiki_general_1682046219[0](2024-01-30T20:52:14.402Z), fywiktionary_content_1693308635[0](2024-01-30T20:52:14.400Z), labtestwiki_content_1682073064[0](2024-01-30T20:52:14.401Z), cswikiversity_content_1692566020[0](2024-01-30T20:52:14.399Z), angwik [04:19:19] general_1692144613[0](2024-01-30T20:52:14.399Z), arwikinews_general_1692183507[0](2024-01-30T20:52:14.402Z), zuwiktionary_content_1682468427[0](2024-01-30T20:52:14.402Z), hiwikimedia_general_1681961248[0](2024- https://wikitech.wikimedia.org/wiki/Search%23Administration [04:32:11] PROBLEM - ElasticSearch unassigned shard check - 9400 on cloudelastic1007 is CRITICAL: CRITICAL - blkwiki_general_1692325160[0](2024-01-30T20:52:14.400Z), swwiktionary_content_1682336016[0](2024-01-30T20:52:14.399Z), fiwikiquote_general_1693141971[0](2024-01-30T20:52:14.403Z), tswiki_general_1682369644[0](2024-01-30T20:52:14.398Z), bxrwiki_content_1692405982[0](2024-01-30T20:52:14.403Z), sahwikisource_content_1682223239[0](2024-01-30T20:5 [04:32:11] Z), zh_min_nanwikiquote_content_1682432826[0](2024-01-30T20:52:14.402Z), gawikibooks_content_1693313037[0](2024-01-30T20:52:14.399Z), map_bmswiki_content_1682087026[0](2024-01-30T20:52:14.399Z), ttwikibooks_content_1682372612[0](2024-01-30T20:52:14.399Z), fiwikibooks_general_1693141213[0](2024-01-30T20:52:14.399Z), ruewiki_content_1682184981[0](2024-01-30T20:52:14.403Z), rmwiki_content_1682184205[0](2024-01-30T20:52:14.402Z), arwikibooks_ [04:32:11] 1692181666[0](2024-01-30T20:52:14.402Z), tawikibooks_content_1682339007[0](2024-01-30T20:52:14.403Z), lijwikisource_general_1682077655[0](2024-01-30T20:52:14.403Z), bawikibooks_general_1692285898[0](2024-01-30T https://wikitech.wikimedia.org/wiki/Search%23Administration [04:46:31] PROBLEM - ElasticSearch unassigned shard check - 9400 on cloudelastic1008 is CRITICAL: CRITICAL - fywiktionary_content_1693308635[0](2024-01-30T20:52:14.400Z), sswiki_content_1682324242[0](2024-01-30T20:52:14.402Z), kawiktionary_general_1682045987[0](2024-01-30T20:52:14.402Z), tawikinews_content_1682339054[0](2024-01-30T20:52:14.403Z), hakwiki_content_1681938141[0](2024-01-30T20:52:14.402Z), ilowiki_content_1682003872[0](2024-01-30T20:52: [04:46:31] , abwiktionary_content_1692132157[0](2024-01-30T20:52:14.401Z), cywikiquote_content_1692581299[0](2024-01-30T20:52:14.401Z), fiwikivoyage_general_1693144275[0](2024-01-30T20:52:14.399Z), angwikisource_general_1692144613[0](2024-01-30T20:52:14.399Z), iawikibooks_general_1681994016[0](2024-01-30T20:52:14.403Z), mniwiki_general_1682113849[0](2024-01-30T20:52:14.399Z), kowikiquote_general_1682063572[0](2024-01-30T20:52:14.401Z), bxrwiki_conte [04:46:31] 05982[0](2024-01-30T20:52:14.403Z), btmwiktionary_general_1692403885[0](2024-01-30T20:52:14.400Z), pswiktionary_content_1682172419[0](2024-01-30T20:52:14.402Z), hiwikimedia_general_1681961248[0](2024-01-30T20:5 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:12:21] PROBLEM - ElasticSearch unassigned shard check - 9400 on cloudelastic1002 is CRITICAL: CRITICAL - mhrwiki_general_1682106101[0](2024-01-30T20:52:14.403Z), zhwikiversity_content_1682458435[0](2024-01-30T20:52:14.400Z), ttwiktionary_content_1682372674[0](2024-01-30T20:52:14.401Z), ruewiki_content_1682184981[0](2024-01-30T20:52:14.403Z), map_bmswiki_content_1682087026[0](2024-01-30T20:52:14.399Z), ltwikiquote_content_1682083282[0](2024-01-30 [07:12:21] 4.403Z), kaawiki_content_1682043722[0](2024-01-30T20:52:14.403Z), cywikiquote_general_1692581318[0](2024-01-30T20:52:14.402Z), kuwiki_general_1682069766[0](2024-01-30T20:52:14.400Z), hiwikimedia_general_1681961248[0](2024-01-30T20:52:14.399Z), hiwikimedia_content_1681961237[0](2024-01-30T20:52:14.402Z), sswiki_content_1682324242[0](2024-01-30T20:52:14.402Z), tawikinews_content_1682339054[0](2024-01-30T20:52:14.403Z), mywikibooks_general_1 [07:12:21] 9[0](2024-01-30T20:52:14.402Z), olowiki_general_1682147477[0](2024-01-30T20:52:14.399Z), kswikiquote_content_1682069213[0](2024-01-30T20:52:14.402Z), nnwiktionary_content_1682139542[0](2024-01-30T20:52:14.399Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [07:18:39] PROBLEM - ElasticSearch unassigned shard check - 9400 on cloudelastic1001 is CRITICAL: CRITICAL - aswikisource_content_1692252094[0](2024-01-30T20:52:14.400Z), bpywiki_general_1692375521[0](2024-01-30T20:52:14.399Z), kabwiki_content_1682043807[0](2024-01-30T20:52:14.400Z), kowikiquote_general_1682063572[0](2024-01-30T20:52:14.401Z), iewikibooks_content_1682003352[0](2024-01-30T20:52:14.401Z), pawikibooks_general_1682150992[0](2024-01-30T2 [07:18:39] 400Z), guwwikinews_content_first[0](2024-01-30T20:52:14.400Z), tlywiki_general_first[0](2024-01-30T20:52:14.402Z), swwiktionary_content_1682336016[0](2024-01-30T20:52:14.399Z), iawikibooks_general_1681994016[0](2024-01-30T20:52:14.403Z), angwikiquote_content_1692144583[0](2024-01-30T20:52:14.402Z), mhwiki_general_1682106175[0](2024-01-30T20:52:14.403Z), fiwikivoyage_general_1693144275[0](2024-01-30T20:52:14.399Z), rmwiktionary_content_168 [07:18:39] 0](2024-01-30T20:52:14.402Z), bawikibooks_general_1692285898[0](2024-01-30T20:52:14.403Z), fawiktionary_content_1693123829[0](2024-01-30T20:52:14.402Z), madwiki_content_1682086579[0](2024-01-30T20:52:14.400Z), https://wikitech.wikimedia.org/wiki/Search%23Administration [07:21:55] PROBLEM - ElasticSearch unassigned shard check - 9400 on cloudelastic1003 is CRITICAL: CRITICAL - extwiki_content_1693076991[0](2024-01-30T20:52:14.400Z), iawikibooks_general_1681994016[0](2024-01-30T20:52:14.403Z), eewiki_content_1692733368[0](2024-01-30T20:52:14.403Z), fiwikibooks_general_1693141213[0](2024-01-30T20:52:14.399Z), dkwikimedia_general_1692731509[0](2024-01-30T20:52:14.398Z), chowiki_content_1692531553[0](2024-01-30T20:52:1 [07:21:55] bpywiki_general_1692375521[0](2024-01-30T20:52:14.399Z), cowiktionary_general_1692536869[0](2024-01-30T20:52:14.400Z), arwikibooks_general_1692181666[0](2024-01-30T20:52:14.402Z), kswikiquote_content_1682069213[0](2024-01-30T20:52:14.402Z), aswiki_general_1692250469[0](2024-01-30T20:52:14.403Z), kgwiki_general_1682046219[0](2024-01-30T20:52:14.402Z), tewikiquote_content_1682345381[0](2024-01-30T20:52:14.400Z), hewikinews_general_16819474 [07:21:55] 24-01-30T20:52:14.403Z), pswiktionary_content_1682172419[0](2024-01-30T20:52:14.402Z), extwiki_general_1693077226[0](2024-01-30T20:52:14.403Z), krwikiquote_general_1682068959[0](2024-01-30T20:52:14.399Z), zuwik https://wikitech.wikimedia.org/wiki/Search%23Administration [07:25:35] PROBLEM - ElasticSearch unassigned shard check - 9400 on cloudelastic1006 is CRITICAL: CRITICAL - tswiki_general_1682369644[0](2024-01-30T20:52:14.398Z), bawikibooks_general_1692285898[0](2024-01-30T20:52:14.403Z), mhrwiki_general_1682106101[0](2024-01-30T20:52:14.403Z), hewikinews_general_1681947456[0](2024-01-30T20:52:14.403Z), arwikiquote_content_1692184167[0](2024-01-30T20:52:14.399Z), fowiktionary_content_1693152223[0](2024-01-30T20: [07:25:35] 0Z), chowiki_content_1692531553[0](2024-01-30T20:52:14.401Z), angwikisource_general_1692144613[0](2024-01-30T20:52:14.399Z), nowiktionary_general_1682145425[0](2024-01-30T20:52:14.402Z), cywiktionary_content_1692583389[0](2024-01-30T20:52:14.401Z), afwiktionary_general_1692140122[0](2024-01-30T20:52:14.400Z), hiwikimedia_content_1681961237[0](2024-01-30T20:52:14.402Z), azwikibooks_general_1692273436[0](2024-01-30T20:52:14.403Z), quwiktion [07:25:35] ent_1682184179[0](2024-01-30T20:52:14.401Z), dgawiki_general_first[0](2024-01-30T20:52:14.400Z), gnwiki_general_1681935910[0](2024-01-30T20:52:14.402Z), thwikiquote_content_1682353715[0](2024-01-30T20:52:14.401 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:48:31] PROBLEM - ElasticSearch unassigned shard check - 9400 on cloudelastic1004 is CRITICAL: CRITICAL - thwikiquote_content_1682353715[0](2024-01-30T20:52:14.401Z), mniwiki_content_1682113784[0](2024-01-30T20:52:14.399Z), sahwikisource_content_1682223239[0](2024-01-30T20:52:14.402Z), elwikiversity_content_1692753938[0](2024-01-30T20:52:14.399Z), trwikivoyage_content_1684520752[0](2024-01-30T20:52:14.401Z), ocwiktionary_content_1682147094[0](202 [07:48:31] 20:52:14.401Z), mniwiki_general_1682113849[0](2024-01-30T20:52:14.399Z), iowiki_content_1682009405[0](2024-01-30T20:52:14.401Z), extwiki_content_1693076991[0](2024-01-30T20:52:14.400Z), kbdwiki_general_1682046061[0](2024-01-30T20:52:14.403Z), napwikisource_content_1682124288[0](2024-01-30T20:52:14.401Z), ttwikiquote_content_1682372652[0](2024-01-30T20:52:14.402Z), tawikinews_content_1682339054[0](2024-01-30T20:52:14.403Z), rmwiktionary_co [07:48:31] 82184443[0](2024-01-30T20:52:14.402Z), suwiki_general_1682325319[0](2024-01-30T20:52:14.399Z), testcommonswiki_general_1686951842[0](2024-01-30T20:52:14.400Z), lijwikisource_general_1682077655[0](2024-01-30T20: https://wikitech.wikimedia.org/wiki/Search%23Administration [07:52:53] PROBLEM - ElasticSearch unassigned shard check - 9400 on cloudelastic1005 is CRITICAL: CRITICAL - niawiki_general_1682127438[0](2024-01-30T20:52:14.403Z), amwiktionary_content_1692144240[0](2024-01-30T20:52:14.399Z), azwikibooks_general_1692273436[0](2024-01-30T20:52:14.403Z), fawiktionary_content_1693123829[0](2024-01-30T20:52:14.402Z), bawikibooks_general_1692285898[0](2024-01-30T20:52:14.403Z), kowikiquote_general_1682063572[0](2024-01 [07:52:53] 2:14.401Z), cywikiquote_general_1692581318[0](2024-01-30T20:52:14.402Z), rmwiki_content_1682184205[0](2024-01-30T20:52:14.402Z), fiwikivoyage_general_1693144275[0](2024-01-30T20:52:14.399Z), angwikibooks_content_1692144545[0](2024-01-30T20:52:14.400Z), dkwikimedia_general_1692731509[0](2024-01-30T20:52:14.398Z), mnwiktionary_content_1682114598[0](2024-01-30T20:52:14.403Z), fywiktionary_content_1693308635[0](2024-01-30T20:52:14.400Z), guwi [07:52:53] _content_1681937816[0](2024-01-30T20:52:14.402Z), eewiki_content_1692733368[0](2024-01-30T20:52:14.403Z), shnwikivoyage_general_1682229250[0](2024-01-30T20:52:14.402Z), avwiki_general_1692258005[0](2024-01-30T2 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:00:39] ^ taking a look [08:09:24] !log [cloudelastic] current state: `{"cluster_name":"cloudelastic-omega-eqiad","status":"yellow","number_of_nodes":10,"number_of_data_nodes":10,"active_primary_shards":798,"active_shards":1438,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":160,"delayed_unassigned_shards":0,"active_shards_percent_as_number":89.98748435544431}` [08:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:03] !log [cloudelastic] Seeing `replica allocations are forbidden due to cluster setting [cluster.routing.allocation.enable=primaries`; that likely explains the many unassigned shards of cloudelastic.wikimedia.org:9400 ... feels like a previous cookbook run didn't back out successfully leaving replica allocation disabled [08:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:58] !log [cloduelastic] Re-enabled replica allocation on `cloudelastic-omega-eqiad` => `curl -H 'Content-Type: application/json' -XPUT https://cloudelastic.wikimedia.org:9443/_cluster/settings -d '{"transient":{"cluster.routing.allocation":{"enable": "all"}}}'` [08:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:13] meh, wrote `cloduelastic` instead :P [08:19:22] !log [cloudelastic] Replica shards have re-initialized; cluster is back to green. Will probably see a wall of `ElasticSearch unassigned shard check - 9400` resolve messages soon, fingers crossed [08:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:45] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:23:19] RECOVERY - ElasticSearch unassigned shard check - 9400 on cloudelastic1001 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [08:23:19] RECOVERY - ElasticSearch unassigned shard check - 9400 on cloudelastic1005 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [08:23:19] RECOVERY - ElasticSearch unassigned shard check - 9400 on cloudelastic1004 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [08:23:19] RECOVERY - ElasticSearch unassigned shard check - 9400 on cloudelastic1002 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [08:23:19] RECOVERY - ElasticSearch unassigned shard check - 9400 on cloudelastic1003 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [08:23:19] RECOVERY - ElasticSearch unassigned shard check - 9400 on cloudelastic1007 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [08:23:20] RECOVERY - ElasticSearch unassigned shard check - 9400 on cloudelastic1009 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [08:23:20] RECOVERY - ElasticSearch unassigned shard check - 9400 on cloudelastic1006 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [08:23:21] RECOVERY - ElasticSearch unassigned shard check - 9400 on cloudelastic1008 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [08:23:21] RECOVERY - ElasticSearch unassigned shard check - 9400 on cloudelastic1010 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [08:25:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:50:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:05:57] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [09:07:21] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 1.582 second response time https://wikitech.wikimedia.org/wiki/Docker [10:21:07] (MediaWikiEditFailures) firing: (2) Elevated MediaWiki edit failures (session_loss) for cluster appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:26:06] (MediaWikiEditFailures) resolved: (2) Elevated MediaWiki edit failures (session_loss) for cluster appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:44:53] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10TheDJ) >>! In T355914#9510509, @Redrose64 wrote: >>>! In T355914#9501705, @Joe wrote: >> Given the chosen size is both non-stand... [10:55:57] PROBLEM - cassandra-b CQL 10.192.48.122:9042 on restbase2017 is CRITICAL: connect to address 10.192.48.122 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [10:56:41] PROBLEM - cassandra-b SSL 10.192.48.122:7000 on restbase2017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [12:16:15] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:16:19] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:31:31] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:31:37] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:30:33] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2017.codfw.wmnet with reason: Decommissioning — T352469 [13:30:47] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2017.codfw.wmnet with reason: Decommissioning — T352469 [13:30:48] T352469: Decommission restbase20[13-20]) - https://phabricator.wikimedia.org/T352469 [13:40:55] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:42:17] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Swift [14:39:31] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:31] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:05:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Volans) There are pending DNS changes in Netbox not committed to the auto-generated DNS repository related to those hosts since yesterday: ` Fri 22... [16:55:23] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:56:47] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Swift [17:03:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:18:17] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:28:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:33:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:01:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:06:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:31:15] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 3.4321265794755864s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:40:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:42:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:44:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:06:15] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.4851127085001004s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:22:15] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 4.503494906478542s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:27:15] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 4.053708267554382s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:27:45] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 3.485202046976389s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:32:45] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 4.053708267554382s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:33:45] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 3.7558366376170373s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:53:45] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 4.160105819276109s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:54:00] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 3.51669949281251s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded