All notes should be followed prior to any failover attempt
(All Releases) Latest Release
- The latest version of Eyeglass has been installed, we add enhancements to each release based on customer failovers to prevent or document anything that will block or impact failover. Not upgrading to latest release affects your entitlement to support for a planned failover event. Excludes real DR failovers.
- If you are planning failover and want Eyeglass DR readiness assessment, we require 7 days advanced notice and support logs submitted.
- If your Eyeglass installation is N-2 releases where N is the currently published GA release, you may choose to stay on a release that is N-2 without affecting support only if the following steps are completed:
- Open a case with support and upload support logs. Then follow instructions below.
- State planned failover will use a release that is within N-2 and update the case.
- State the Failover Features of N release has been reviewed here and update the case to state confirmation.
- Support will request confirmation in the case that N, N-1 and N-2 "failover section" of the release notes (example here) has been reviewed and that you are confirming and accepting the risk in your environment.
- NOTE: Failure to complete #1-3, will affect support entitlement of using N-2 release for a planned failover.
- Target DR Releases if already running one of these releases
- Check with support for target release or consult software matrix above.
- NOTE: If not running these releases upgrade to the latest GA release is required.
- NOTE: if using one of these releases, all release notes apply and assumed read and accepted prior to any planned failover.
(Release 1.6.3 >) Snapshot schedule expiration offset has OneFS API bug that adds extra time to creation of the snapshot schedule.
- This results in an expiration on the DR cluster, that can be greater than entered on the source cluster. example expire in 20 days will be 22 days on the target cluster. Different units of off set all result in a value greater than entered. After failover the DR (target cluster) value will be synced back to the source (Prod cluster). Thereby losing the original expiry off set and extending the expire time by a new offset from the API error. This has been raised with EMC as SR to resolve.
- Work around: Before failover ensure a cluster report has been generated (cluster reports icon), or an existing emailed cluster report exists. Post Failover re-enter the original values on the DR snapshot schedules using the cluster report values from the source cluster as a reference.
- Another option is disable Snapshot Sync jobs in the jobs window if the above workaround does not meet your needs to preserve expiry of snapshot settings.
(All Releases) SyncIQ file filters not supported
- File pattern filters are NOT synced on failover, these pattern filters can result in unprotected data during failover and failback. Failover and failback work flows require customer testing for their own use case. All file filter scenario’s are untested without support for custom workflows related to file filters failover failback issue not present under normal failover and failback workflow
(Release < 1.8.0) Technical Advisory #10
- For the case where Isilon clusters have been added to Eyeglass using FQDN, uncontrolled failover for case where source cluster is not reachable does not start and gives the error ""Error performing zone failover: Cannot find associated source network element for zone". This issue will be addressed in a 1.8.1 patch. Eyeglass installations using FQDN to add clusters must upgrade to this patch once available. Workaround: please refer to Technical Advisory #10
(All Releases) OneFS Failover and Failback without waiting for quota scan job to complete
- In OneFS 8 quota scan job is started as soon as a quota is created (cannot be disabled on OneFS 8). Resync Prep on failover or failback will fail when Quota scan job is active on a path on the target cluster. Do not add/edit quotas before or during failover. If you have Quotas with snapshot overhead enabled, deleting a snapshot may trigger a quota scan. Also, after Eyeglass failover quotas are created by Eyeglass and quota scan will start. If failback is attempted right away (typically testing only scenario) without waiting for quota scan to complete the resync prep step is blocked from running due to domain lock from the quota scan. Workaround: Wait for quota scan job to complete before attempting failover or failback. Use cluster running jobs UI to verify if quota scan is running or not before attempting to failover.
(All Releases) SPNs not updated during failover for OneFS8 non-default groupnet AD provider (T3848)
- For the case where OneFS 8 is configured with multiple groupnets and different AD provider between groupnets, the SPN update during failover does not succeed for non-default groupnet AD providers. SPN's are not deleted for source cluster and are not created for the target cluster. The failover log indicates success. This is due to a OneFS8 defect with multiple AD providers and isi commands scheduled for patch release in OneFS 126.96.36.199. NOTE: SPN delete / create for the AD provider defined in groupnet0 is successful. Workaround: Manually delete and create the SPN for the Smartconnect Zones that were moved from AD ADSI Edit interface.
(Release 1.8.3) Failure to run Resync Prep step during DFS Failover Deletes Shares on Target Cluster (T4145)
- If during a DFS failover the Resync Prep does not run due to error prior to Resync Prep step or in the Resync Prep step itself, post failover Configuration Replication finds that the Eyeglass Job is still active on Failover source cluster and the replication of the renamed igls-dfs-<share> results in deletion of the <share> on the target cluster.
- Workaround: Prior to failover disable the Configuration Replication task. This does not affect the Configuration Replication step executed during failover.
- To disable the Eyeglass Configuration Replication task, execute the below command from the Eyeglass appliance command line:
- igls admin schedules set --id Replication --enabled false
- Post successful failover, re-enable Eyeglass Configuration Replication task.
- To disable the Eyeglass Configuration Replication task, execute the below command from the Eyeglass appliance command line:
- igls admin schedules set --id Replication --enabled true
- Fixed in > 1.9.0 - any step fails the Jobs in eyeglass are left at user disabled state and will not run until manually enabled again. Ensuring SyncIQ policy issues can be recovered to correct state first and then user enable the policies in Eyeglass.
(Releases < 1.9.0) DR Assistant returns "Error Retrieving Zones Undefined" if many access zones exist
- This error can occur when attempting a failover when many access zones and many policies per access zone are configured. A database query times out return all data needed to validate the failover. This is addressed with optimized DB query in 1.9 release. The impact is inability to start a failover.
- Work Around: Increase timeout on browser to return all needed data from the database to start a failover.
- For temporary fix, please follow the steps below and let us know the update via this case:
- SSH to the eyeglass appliance as admin user
- type password (default: 3y3gl4ss)
- sudo su - (default password: 3y3gl4ss)
- vi /srv/www/htdocs/eyeglass/js/eyeglass_globals.js
- please change the ajax_get_timeout_seconds value to 600.
- Please refer the screenshot for details:
- :wq! // save the changes //
- login to the eyeglass webpage and open the DR assistant and check whether error still present or resolves. You may need to clear browser cache to ensure new java script is loaded to the browser that includes the new timeout.
(Releases => 1.9.0) DFS Failover Enhancement to handle partial or complete Share Rename failures
- DFS mode uses parallel threads to rename shares for all policies involved in the failover.
- If share renaming is failed for all the shares from a cluster, then failover status is error. Failover is stopped and Users are not redirected to target cluster. Make writeable and Resync prep does not run and data is active on source cluster still.
- If share renaming is failed only for some shares from the source cluster, then failover status is warning AND failover will continue to run make writeable and resync prep.
- Summary: In this scenario it is best to attempt the failover of some shares fail rename. If all fail abort failover and stop.
(All Releases) Access Zone Failover Networking Roll back on failures
- This feature has been available for some time and should be understood how it works.
- During make writeable step Eyeglass will send API to target cluster to start the make write able step.
- At this point in the failover smartconnect networking and SPN failover has been completed and dual delegation will mean new mount requests will be handled by the target cluster and SPN authentication will be handled by the target cluster.
- If the make writeable step Succeeds on at least ONE policy of N (of all policies involved in the Access zone), the failover logic will continue. This means you are partially failed over for some of the data in the access zone. It also means all networking and SPN's are failed over. Next step is to resolve failed make write step on policies to get the file system writeable. This often requires EMC SR to resolve root cause of failover on the target cluster.
- If NONE of the policies pass the make writeable step AUTOMATIC rollback of Smartconnect networking and SPN's are reverted to the source cluster.
- The failover log shows if networking rollback is initiated. If you find this in the failover log, Your failover is aborted and all data remains writeable on the source cluster.
- Example Log entry 2017-06-08 22:49:00::260 INFO Starting Step: "Networking Updates Rollback"
- To validate source cluster data access to the following:
- nslookup to the smartconnect name(s) involved in the failover (use failover log for full list). IP returned should be from the source cluster
- Test share and NFS mount access to the source cluster and verify you can mount and write
- This will validate SPN authentication for shares as well.
- Determine root cause , which may require EMC SR to resolve before rescheduling the failover
- Summary: This logic determines the best option automatically. If some data succeeds to failover its best to resolve only the failed policies than aborting the entire failover. If no data succeeds at the make writeable step it is best to revert and abort the failover. Eyeglass handles this decision automatically.
(All Releases) Isilon OneFS 188.8.131.52 API load sharing with Smartconnect issue
- Any issue with this oneFS release was found where API calls from eyeglass when clusters are added using FQDN smartconnect name, shows DFS mode share rename step uses parallel API calls to load shared across nodes results in HTTP 409 AEC error from the cluster when a share rename share fails.
- The share is renamed correctly but the cluster does not remove the old share leaving the igls-dfs-sharename and sharename on the target cluster.
- The HTTP 409 error is sent incorrectly by the cluster and Eyeglass treats this as a failed step, even though the rename was successful.
- Summary: Work around is to delete the cluster from eyeglass inventory window, re-add the cluster with subnet service IP to avoid this cluster bug. No known resolution for this issue at this time on OneFS. Impact of not switching to SSIP, is failed DFS failover when using FQDN cluster add with 184.108.40.206.
(All Releases) Missing SPN Validations for Zone Readiness and Pool Readiness cause SPN create/delete to fail during failover
Zone Readiness and Pool Readiness SPN validations do not check for the conditions below.
IMPACT: These conditions will cause SPN delete/create to fail during a failover:
1) SPN has been created in AD with lower case host (example: host/SPN_name) instead of uppercase HOST (example: HOST/SPN_name)
2) SPN has been created in AD where SPN_name has different case than associated SmartConnect Zone name (example: for SmartConnectZone prod.example.com SPN is configured as HOST/Prod.Example.com)
Workaround: Modify SPN in AD that have above issues so that all SPNs
1) use upper case HOST in the SPN definition (HOST/SPN_name)
2) SPN name matches case of Smartconnect Zone name
(All Releases) User Quota creation fails on failover for multiple disjointed AD Domain environment
In an Isilon environment that is configured to use multiple AD Domains and those Domains are not joined, user quota creation for the quotas related to the non-default AD Domain will fail with the error:
Requested persona was not of user or group type
Workaround: None available with Eyeglass.
(ALL RELEASES) Time to complete steps for Allow Writes and Preparation to Failback Unknown
Time to complete failover steps to make data writeable and prepare to failback (resync prep) can take a long time for some environments related to large number of files/directories and other factors and time is not predictable or deterministic.
(CURRENT RELEASE) Failover Known Issues
Failover related Known issues for the current release can be found here.
Copyright Superna LLC