The Truth,the whole truth…

So here are the errors that I encountered during this installation.

Error – 1

When running the configuration processor it will accept a short encryption key and doesn’t warn or error until the end of the run through.

InstallC (11)

I rerun the configuration processor and use a stronger encryption password ‘H3lionhelion!’

And so we move on to the next configuration that needs to be debugged…

Error – 2

cd ~/helion/hos/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml

InstallC (12)

If we look at the network_groups.yml file I can see I forgot to modify this section

InstallC (13)

I added ‘hos2.allthingscloud.eu’ as the external name

installC(100)

Error – 3

It’s also complaining about my nic_mapping profiles assigned in the servers.yml file.

InstallC (15)

This profile was missing from the nic_mappings file.

I added the correct profile, HP-DL360-8PORT, to the nic_mappings file.

InstallC (16)

Now we need to recommit all these changes to the repository and start again.

cd ~/helion/hos/ansible
git add -A
git commit -m "Fixed initial configuration errors"

Error – 4

cd ~/scratch/ansible/next/hos/ansible
ansible-playbook -i hosts/verb_hosts wipe_disks.yml

I have the following ERROR when trying to wipe the disks

InstallC (27)

This is because I encrypted the sensitive content – to overcome this I need to supply the –ask-vault-pass as part of the command line

ansible-playbook -i hosts/verb_hosts wipe_disks.yml --ask-vault-pass

Error – 5

And just when I think I’ve almost finished I get another failure as follows:

InstallC (30)

Lots of complaints in ~/.ansible/ansible.log about corrupt disk partitions – I attempted to “manually” clear these using the following script:

clear_host() {
  ssh $1 << EOF
    echo onhost connected
    sudo /bin/dd if=/dev/zero of=/dev/sdb bs=512 count=2
    sudo /bin/dd if=/dev/zero of=/dev/sdc bs=512 count=2
    sudo /bin/dd if=/dev/zero of=/dev/sdd bs=512 count=2
    sudo /bin/dd if=/dev/zero of=/dev/sde bs=512 count=2
    sudo /bin/dd if=/dev/zero of=/dev/sdf bs=512 count=2
    sudo /bin/dd if=/dev/zero of=/dev/sdg bs=512 count=2
    sudo /bin/dd if=/dev/zero of=/dev/sdh bs=512 count=2
    sync
EOF
}

export -f clear_host

seq 15 17 | while read i; do
  clear_host 172.16.60.$i
done

This also made no difference. I get the same partition error.

My next attempt at a fix will be to log on to the raid controller on each ceph server and delete and re-create the non OS drives.

InstallC (31) InstallC (32) InstallC (33) InstallC (34) InstallC (35)

Now I’ll delete all Arrays EXCEPT Array A – the OS drive

InstallC (36) InstallC (37)

Repeat this for Arrays B – G you should end up with something like this:

InstallC (38)

Now rebuild all the raid0 drive arrays

InstallC (39)

Select the “Create Arrays with RAID 0” option

InstallC (40)

Select OK

InstallC (41) InstallC (42) InstallC (43)

Repeat this process on the other 2 Ceph nodes and then we can relaunch the deployment again

ansible-playbook -i hosts/verb_hosts site.yml --ask-vault-pass --limit @/home/graham/site.retry

Once again we have the exact same failure – time to look for known bugs…

Yes – this is a known bug – the wipedisk functionality does not always work correctly –

It’s necessary to log on to each node and run the following command against each journal and osd drive:

/sbin/sgdisk --zap-all -- /dev/sd[b-h]

or use the following script

clear_host() {
  ssh $1 << EOF
    echo onhost connected
    sudo /sbin/sgdisk --zap-all -- /dev/sdb 
    sudo /sbin/sgdisk --zap-all -- /dev/sdc 
    sudo /sbin/sgdisk --zap-all -- /dev/sdd 
    sudo /sbin/sgdisk --zap-all -- /dev/sde 
    sudo /sbin/sgdisk --zap-all -- /dev/sdf 
    sudo /sbin/sgdisk --zap-all -- /dev/sdg 
    sudo /sbin/sgdisk --zap-all -- /dev/sdh 
    sync
EOF
}

export -f clear_host

seq 15 17 | while read i; do
  clear_host 172.16.60.$i
done

Error – 6

And now we continue on where we left off –

ansible-playbook -i hosts/verb_hosts site.yml --ask-vault-pass --limit @/home/graham/site.retry

This brings me to the next challenge –

InstallC (44)

As you can see this is complaining about authentication. What you can’t see is that over 12 hours have passed since I re-joined the original failed screen session. By appending the –limit @/home/graham/site.retry it attempts to carry on from where it failed. However, it looks as though some authentication tokens may have subsequently expired.

Re-launch the installation without the ” –limit @/home/graham/site.retry” option.

ansible-playbook -i hosts/verb_hosts site.yml --ask-vault-pass

Response to “HOS 2.1 Ceph Installation with Network Customisation (8-of-8)”

HOS 2.1 Ceph Installation with Network Customisation (1-of-8) – All Things Cloud

January 23, 2016

[…] The Out Takes – all my ‘deliberate mistakes’ during the installation process (8-o… […]

LikeLike