[-]
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
Updated on 7/10/2019
DR Design Guides
How to Validate and troubleshoot A Successful Failover WHEN Data is NOT Accessible on the Target Cluster
Direct link to topic in this publication:
Home

How to Validate and troubleshoot A Successful Failover WHEN Data is NOT Accessible on the Target Cluster


Debug Plan of attack for clients post failover:

NOTE: follow the order below to find root cause.

  1. READ ME FIRST:  Have you unmounted or rebooted the PC you are testing SMB share access. Do not proceed if you have not already done thisl
  2. Check DNS
  3. Mount share from client (DFS or non DFS)
  4. If authentication error fix
  5. If no authentication error Test write access
  6. If no write access, remount correctly
  7. Retest mount of share, test write access again.
  8. Done.

Steps to Validate SmartConnect and DNS failed for over Successfully:


Test DNS response on the clusters:  This test verifies that SmartConnect names were failed over successfully and also can verify if dual delegation in your DNS environment is setup correctly.  This test also eliminates an issues with your internal DNS and verifies Isilon SmartConnect zones failed successfully.

  1. Quick test:  From a windows Client machine dos prompt type “ping <SmartConnect name FQDN>”  This should return IP address from the Target cluster IP pool.  If ping does respond with a correct IP from the TARGET cluster.    
    1. Then cancel ping command CTRL-C and ping again to the same SmartConnect name  to make sure a second IP from the same Target cluster IP pool is returned to verify SmartConnect and Isilon DNS is functioning as expected.
  2. If ping test is successful on BOTH ping tests.  CONTINUE TO MOUNT STEPS section below 
  3. If you get failed Ping or name does not resolve name to correct IP address of the TARGET cluster.  Continue with steps below to debug DNS.
    1. From any Windows client machine type “nslookup<press enter key>
    2. Source Cluster DNS Test:
    3. then type "server x.x.x.x" enter key.  where x.x.x.x is the Subnet service ip of the source cluster
    4. type "FQDN of SmartConnect Zone used in failover"  <press enter key> .  Hint: Refer to the failover log from DR Assistant for the full list of SmartConnect names that were failed over.
    5. The expected response is a failed resolution since failover disables the SOURCE cluster DNS response.
    6. Example of a failed nslookup on the cluster you failed away from “** server can't find userdata.ad1.test: REFUSED”
    7. NOTE: if lookup does NOT return REFUSED response, then SmartConnect name did not failover correctly AND consult recovery guide Networking section. To fix SmartConnect names.
  4. Target Cluster DNS Test:
    1. Test TARGET cluster SSIP (subnet service IP ) with  DNS
    2. type "server y.y.y.y" enter key. where y.y.y.y is the subnet service ip of the target cluster
    3. type "FQDN of SmartConnect Zone used in failover" Refer to the failover log for list of SmartConnect names that were failed over
    4. Expected response SUCCESSFUL NAME RESOLUTION RETURNING IP OF THE TARGET CLUSTER. This means SmartConnect was failed over correctly to the target cluster.  
    5. If DNS test fails this step OR  IP fails to resolve OR is the wrong IP address.   consult recovery guide Networking section. To fix SmartConnect names.
  1. If all DNS tests pass in this section
    1. Root Cause: Your internal DNS is not setup correctly for dual delegation is not configured correctly, since SSIP on the cluster correctly answers DNS queries. Stop here and correct using guide and video above below.   
    1. Double check 2 name server entries exist for the SmartConnect name you are failing over.
    2. End debugging issue has been found.

Steps to test Mounting a Share with Access Zone failover on PC with no previous connected Mount:


  1. NOTE: Use a Windows client that DOES NOT have a connection to any cluster to perform this test correctly. A  PC rebooted AFTER the failover should be used for this step!!!
  2. Mount test from a Windows client in File Explorer  \\FQDN of SmartConnect Zone in Access Zone\<share name>
    1. If this step is successful and test write Access to the share is successful --> Then SKIP to section below “Steps to correctly test a machine with existing Mount to Source cluster and remount to test write access:
  3. If you received a Windows login popup message for user id and password.  This indicates AD SPN kerberos failover issue. (Check Eyeglass failover log in DR Assistant for SPN Delete or Create failed steps and check SmartConnect name to the failed SPN step in the log.)
    1. Typically SPN issues will mean a popup login dialogue box in windows requesting user id and password since authentication failed to the target cluster of the failover.
    2. on the source cluster use isi command to verify the FQDN(s) are NOT listed
      1. Example “isi auth ads spn list <your AD domain provider here>”
    3. on the target cluster use isi command to verify the FQDN(s) ARE listed.
      1. Example “isi auth ads spn list <your AD domain provider here>”
  4. If the source has the SPN FQDN listed OR the target does NOT have the SPN listed matching your FQDN(s).  Then MANUAL SPN failover is required to allow kerberos authentication to succeed see recovery guide section here .  You will need access to AD with ADSIedit tool.
    1. Objective:  ensure FQDN SPN is on the Target cluster and the Source cluster does not have the SPN FQDN listed. You need to edit the SPN proptery of the Isilon computer objects.  
  5. AFTER Correcting the SPN in AD  issue , retest mounting the shares and verify no popup error occurs for authentication.  
    1. If successful  SKIP to section below “Steps to correctly test a machine with existing Mount to Source cluster and remount to test write access:

Steps to correctly test a PC with existing Mount to Source cluster post Access Zone Failover and remount to test write access:

  1. Unmount share/export: Note Access Zone failover requires All Windows OS's and linux OS's to unmount before attempting to access data on the target cluster
    1. Windows OS’s net use x: /delete (replace x with drive letter).    OR
    2. Use Windows Explorer right click the drive letter and select the Disconnect menu option.
    3. Note this is not the best way to test if other netbios sessions exist to the cluster this command will not release the session.  #1 way to ensure this step is done correctly is REBOOT THE CLIENT MACHINE.  Proceed below if you do not want to reboot the client.
  2. Using File Explorer to mount FQDN of  the Access Zone SmartConnect name  \\FQDN of SmartConnect name\sharename
    1. Test write access to share
    2. If this step fails and you have read only file system error  continue to next step
      1. From a dos prompt:   Type “netstat -an | more” to list TCP sessions,  look for an entry that lists an IP address to the source cluster on port 445.   This means that NETBIOS SMB session to the server still exists and the unmount did not release the TCP session.
    3. Next Step:  Reboot PC to guarantee no sessions to the source cluster and repeat the mount of \\FQDN\share name
  3. After Successful remount of SmartConnect name
    1. verify TCP session to target cluster
    2. From a dos prompt:   Type “netstat -an | more” to list TCP sessions,  look for an entry that lists an IP address to the target cluster on port 445.   This means that NETBIOS SMB session to the target cluster is connected.
    3. Test write access to share
  4. Completed all debugging.        

Steps to test Mounting a DFS protected Share with DFS failover mode:

  1. From a Windows client machine connected to Active Directory mount a dfs folder example \\<domain name>\<dfs root name>\<DFS folder name>
  2. Verify file write access by creating a file
    1. If successful - done
    2. Repeat above on a selection of DFS folders that were failed over to be sure all DFS folders have write access
  3. If write test fails OR mount fails or mount error
    1. Check eyeglass failover log (DR Assistant, Failover history tab) open failover log and look for policy name and share rename step completed successfully on the DFS mount you are testing.  
    2. If failed rename step in the Failover log, login to target cluster find the igls-dfs-<share name> and manually rename the share.   If all rename operations were successful continue to next step below
    3. Now login to the source cluster and find the share name and rename to apply igls-dfs-<share name>
    4. Repeat these steps if more than one share failed to rename by using the failover log to repair share names on both source and destination cluster. NOTE: target cluster should have NO prefix and Source cluster MUST have the prefix.
  4. Repeat mount write test from step #2 to verify renaming resolved the issue. If successful Done.
    1. If mount test still fails
      1. Verify DFS referrals are correctly configured in Microsoft DFS Management snapin
      2. Check each item below to verify configuration:
      3. Open DFS manager snapin, right click the DFS folder you are validating
      4. See example
      5. If both DFS referrals exist pointing at source and target cluster SmartConnect names and the share name is the same name as the screenshot example. Continue to next step.
      6. Test each referral mount path
        1. example from above tested  from a Windows client \\dr.ad1.test\smb2 (failover target cluster SmartConnect name used in this test).
        2. If the share mounts and data is visible,  verify you can write data.
        3. This test verifies dns and SmartConnect is configured correctly and AD authentication to the SMB share is correctly configured.
        4. If this step fails continue below.
    2. Follow steps above in this section “Steps to Validate SmartConnect and DNS failed for over Successfully:”
      1. If the above steps find a DNS resolution issue, fix the issue and retest direct share referal UNC mount to DR target cluster or mount the DFS folder.
      2. If smartconnect validation step above is successful: --> then follow steps in this section  “Steps to test Mounting a Share with Access Zone failover on Machine with no previous Mount:”
      3. If the above is successful debugging is completed.
Copyright Superna LLC