Introduction to this Guide
Eyeglass offers single button assisted failover by; Access Zone, IP pool , Microsoft DFS enabled SyncIQ policies, or SyncIQ policy(s). This document provides:
- An overview of each failover mode
- High level steps for each failover mode
- How to assess readiness for failover
- Planning and operational steps for each failover mode
For guidance on which failover mode is appropriate for your environment, please consult the document Eyeglass Start here First. The Eyeglass Start Here First document provides the information you will need for each failover option to assist you in making the decision of which option is appropriate for your own environment:
- When to use it?
- Why use it?
- What you need to know?
- Estimated knowledge to configure
What’s New with Eyeglass Failover
|1.6||New error handling for OneFS PAPI errors that occur during failover. Should PAPI return an error such as 503 Service Unavailable on any of the steps for allow writes, run policy/mirror policy, resync prep, Superna Eyeglass will now retry this action 3 times as an error such as 503 Service Unavailable may be transient.||All Failover Modes|
|1.6||New timeout count down added to each step that is being processed so timeout is visible during a failover. URL to long running steps to recovery guide included in log along with login https url to cluster management to allow simple "One" click from a failover to Isilon UI console access to check on cluster operations.||All Failover Modes|
Key steps are now grouped:
- Make writeable all policies are processed together (in series) making the filesystem writeable faster for all policies involved in the failover
- Resync prep step now run in batch for all policies after the make writeable step for all policies.
|All Failover Modes|
|1.6.1||New release notes on failover acknowledge in DR assistant is required reading before allowed to continue with a failover.||All Failover Modes|
As of release 1.7 and beyond all Failover modes will restrict number of parallel Job requests to the Isilon cluster for the Run SyncIQ Policy data sync step based on cluster version:
OneFS 7.2 - 5 parallel job requests (OneFS 7.x cluster have a limit of 5 concurrent policies). Eyeglass will monitor the progress for each Job and submit a new request as previously submitted requests are completed.
OneFS 8 - parallel job requests limit based on Eyeglass appliance configuration (default 10)
Based on extensive testing for safe failovers, make writeable and resync prep are serialized steps.
|All Failover Modes|
This release introduces parallel failover mode disabled by default.
High Speed Failover - Parallel Failover Flag :
- Allows make write step and resync prep to run in parallel with up to 10 threads, ensures that 10 policies are submitted to be processed at all times.
- NOTE: Risk of a policy failure increases and new flag will NOT stop the failover in progress and will continue to issue api calls to submit all SyncIQ policies in the failover job until all have been submitted. This runs the risk of more complex recovery if more than one policy fails to complete its step (Allow Writes OR resync Prep)
- Testing has shown these steps for large quantity policy failover can improve failover times 3x to 4x.
Access Zone Failover Enhancement:
- New validation detects time skew between cluster nodes and between eyeglass and the cluster's.
- Validation warning raised if detected
- Time skew can cause failed steps if the time on different nodes is not within an acceptable range to detect the steps or running status on a policy during failover.
SyncIQ Job Reports appended to Eyeglass failover log :
- Now policy run, and resync prep reports are appended to the end of the eyeglass failover log to allow simplified triage of failed steps and escalation to EMC support based on cluster policies failing.
- All information and time stamps are now in a single file.
|All Failover Modes|
- Open files validation removed from dr assistant until Isilon API support per Access Zone open files
- New Access Zone readiness validation verifies all IP pools have a SmartConnect zone defined
- DR Assistant synciq reports from a failover are now separated from Eyeglass logs in the failover history, making debugging simpler.
- Restrict at source validation updated to show info only in the DR dashboard
- To simplify validation of Access Zones readiness for failover. Restrict at source is a best practice and shows green if implemented or info if not implemented on each policy
- SPN Management Enhancements
- SPN failover enhancement for Access Zone failover now restricts the delete and add SPN API calls to a single cluster node in the target cluster.
- This change will insure a single domain controller is used for the failover operations.
- Short SPN's are now synced to AD computer objects (not used for Kerberos) during config sync if any are missing they are inserted. NOTE: This is not related to failover of SPN’s only maintaining newly detected SmartConnect names and ensure they are synced to AD computer object.
- Failover log real-time view in DR assistant allows a live failover log to be monitored with auto refresh or stop and pause option.
- Quota Failover Enhancement
- Linked quotas that are unlinked to the parent quota creates a quota that be can be managed with a different limit applied from the parent quota.
- Eyeglass will now correctly failover unlinked quotas. Now the unlinked quotas failover as a normal quota and then the parent all users quota is failed over next to ensure no conflict occurs on the target cluster.
- Syncing Shares with variable expansion in the path name now sync correctly between clusters
- Ransomware Defender Failover
New Failover Mode
- IP Pool failover allowing hot hot data within an Access Zone and more granular failover options. See Access zone guide for configuration requirements.
Failover Logic Major Enhancements
- Parallel Failover Jobs:
- This feature will allow multiple failovers to execute in parallel. All Failover types are supported.
- NOTE: parallel threads is set to 10 which is shared across all failover jobs.
- LOGGING: Failover log will be split into Failed over data and client redirect. This will indicate the failover of data and clients and post failover scripts. The second half of the log will be for post failover steps including failback steps and quota failover.
- Continue on failed Step: After analyzing many failovers the new logic will continue to execute steps as outlined below. This will ensure SyncIQ policies are attempted even if one syncIQ policy encounters an error.
- Make Write Step on each SyncIQ policy - If any policy fails to run, all other policies are run and failover continues. The steps that are not yet run for the failed policy will be skipped.
- Run Resync Prep SyncIQ - If any policy fails to run, all other policies are run and failover continues.
- NOTE: Any policy that fails a step will have its following steps skipped.
- Cancel a running failover: This option appears in the running failover tab of DR Assistant and allows a running failover to be canceled. NOTE: No Rollback will occur and failover stops at what ever step was being executed. All steps to recover from this will be manual. Use with caution.
- Cancel Failover option on running failovers UI. NOTE: Only used if directed by support.
New Failover Options in DR Assistant
- Data Integrity Failover
- Access zones or DFS and Per SyncIQ policy failover will now insert deny everyone permissions to shares that will be failed over as a pre-Failover step. This will disconnect openfiles, disconnect users from all shares involved in the failover. This will ensure data integrity of the failed over data set when SyncIQ is run by Superna Eyeglass® after users are disconnected.
- Post failover step to correct share permissions to original security settings.
- Option to disable this feature on per failover with DR Assistant.
- Supports SMB shares in this release
- See New DR Assistant option below. Mouse over help text on options for failover
- Failover option added to skip Quota Failover: This new DR Assistant check box allows skipping quota failover step for situations when a failback is planned within a short period of time. This also can help avoid failed failovers due to quota scan failing SyncIQ steps.
Skip quota failover step option DR Assistant
- In some customer environments the quota scan job interferes with failover and failback performance. The requirement to wait until quota scan completes adds hours to a failover or interrupts a failover with a failed SyncIQ step.
- This feature allows skipping failover of quotas and leave them on the source cluster.
- Eyeglass has a special quota sync command line tool that allows quotas to be synced AFTER a failover has been completed.
- Customers can now choose to skip quota failover in DR Assistant. Another feature detects if quotas already exist that will fail SyncIQ steps.
3. DR Assistant Block Failover Failover on Warnings
Overview: This will validate failover jobs and prevent a failover from starting under certain conditions that will result in a failure. This applies to newly created quotas that have not been scanned by quota scan job.
- Quota scans are triggered on Onefs 8 when quotas are created or quota scan jobs are scheduled to run to calculate quotas.
- This can interfere with the make writeable step and resync prep during failover.
- It is best practise to ensure no quotas are created before failover to avoid this conflict.
- Quota scan locks the file system blocking SyncIQ from completing steps
- DR Assistant will have new option (enabled by default) to detect if any quotas exist on the target cluster at the time of failover that match SyncIQ policies selected for a failover and will abort the failover:
- If any quotas have the ready for Quota scan attribute set (this flag indicates quota scan needs to run)
- Note: disabling or canceling a running quota scan job on the cluster does not avoid the conflict with SyncIQ. The attribute on the quota determines of SyncIQ step will fail.
- DR Assistant will offer the ability to uncheck this detection function at the users risk of SyncIQ steps failing.
Failover log Enhancements
- Color coded Success and Failure per step. To quickly identify any step that was failed
- Failover Summary: Each step is summarized at the end of the failover for all keys steps Example below:
- Overall Failover Job status: Completed, total elapsed time: 0 hours, 11 minutes, 40.50 seconds.
- Final SyncIQ Jobs status: Completed, elapsed time: 0 hours, 1 minutes, 34.02 seconds.
- Client Redirect status: Completed, elapsed time: 0 hours, 0 minutes, 26.17 seconds.
- Make Target Writeable status: Completed, elapsed time: 0 hours, 0 minutes, 40.75 seconds.
- Quota Jobs status: Completed, elapsed time: 0 hours, 0 minutes, 2.21 seconds.
- Preparation for Failback status: Completed, elapsed time: 0 hours, 0 minutes, 56.89 seconds.
Quota Failover Options
- Large quota count environments now have new options to collect inventory of quotas and pre sync quotas before failover and allow skipping of quota failover option.
- Admin Guide
- Data Integrity failover option will continue on errors to restore permissions after the failover completes and log any failures to the failover log.
Copyright Superna LLC