Elasticsearch: degraded instance

AWS will warn you when you have an instance running on degraded hardware. You will get a scheduled retirement date and, on this date, your instance will be switched off.

However, your hardware is likely already compromised. If you’re running Elasticsearch on this hardware you need to decommission the node and create another node.

The steps to do this are:

1. get current status

From outside box:

2. disable shard allocation

3. stop service

sudo service elasticsearch stop

4. terminate the instance via console

5. create a new instance – e.g. with Terraform

6. re-enable shard allocation

7. validate cluster health goes back to green

 

Rant on technology and legacy institutional processes

The commercial web has been about since 1995.

  1. Why has it taken 23 years for my accounting software to finally connect to my bank to withdraw statements?

‘cos of damn institutions like the law.

2. Why do I have to click Accept Cookies on every website I go to?

‘cos of damn institutions like the law.

 

On point 2, stuff was fine until about 5 years ago when some Judge or the Queen realised that cookies existed.

 

Waiting…

Given the incredible speed of modern computers and networks I shouldn’t have to spend half my time waiting for various computer tasks to complete.

E.g. I’m simultaneously waiting on 3 separate devices for things to load.

  1. MacBook Pro Number 1
    1. for my bank web pages to load AND
    2. for this blog post dashboard to load so I can write this rant about slow computers
  2. Another MacBook Pro: for Terraform to complete
  3. My phone for Spotify to load

Then, waiting for my external Bluetooth speaker and phone to connect.

docker port

Golden rule:


port1:port2 means you’re mapping port1on the host to port2on the container.

i.e. host:container


Say you run:

you’re mapping port 80 in the container to port 8080 on the host.

-p => publish a container’s port to the host

docker port web

gives
80/tcp -> 0.0.0.0:8080

which means:

80 on containermaps to 8080 on host

See also Tech Rant

https://docs.docker.com/engine/reference/commandline/run/#publish-or-expose-port–p—expose

https://docs.docker.com/engine/reference/commandline/port/

Tech Rant

Part of the problem with Tech is keeping so many things going in your brain at once. And having to be an expert at so many things.

Example 1

I’m trying to figure out how docker port works. i.e. with this:

docker container run --rm -d --name web -p 8080:80 nginx

is 8080 on the host or the container?

E.g. I can run this: docker port web
80/tcp -> 0.0.0.0:8080

but I’m not clear on the mapping so I check the docs:

https://docs.docker.com/engine/reference/commandline/port/#description

which shows you an example:

but does not explain what the line actually means.

Is it container to host or host to container?

Next doc is https://docs.docker.com/engine/reference/commandline/run/#publish-or-expose-port–p—expose

This explains that 80:8080is host -> container. Which would mean that  the initial mapping I used for nginxabove is mapping 80in the container to 8080on the host. i.e. the other way around.

Let’s test.

1. from the VM

Assuming nginx is outputting to 80 (seems reasonable!) then I’d get something back from 8080 on the host (i.e. in the VM – we haven’t even started with what’s happening on the actual host – i.e. my Mac!) so we should get some output on 8080  from the VM so

curl localhost:8080

(what’s the format for using curl – is it curl localhost:8080 or curl localhost 8080 – check some more docs: https://www.unix.com/shell-programming-and-scripting/241172-how-specify-port-curl.html  – not unreasonable given that telnet doesn’t use a colon – i.e. you’d do telnet localhost 8080 – https://www.acronis.com/en-us/articles/telnet/)

which thankfully gives us some nginx output.

So, going back to:

docker port web
80/tcp -> 0.0.0.0:8080

This is saying:

80 on containermaps to 8080 on host

Annoyingly, the other way round to the format used earlier (i.e. of host to container).

If I do docker ps I get:

i.e. 0.0.0.0:8080->80/tcp

which even more annoyingly is the other way around! i.e. host -> container. I guess the way to remember it is that it’s host -> container unless you examine the container itself – e.g. using docker port web.

 

Some gotchas here:

  • curl localhost 8080 would give connection refused ‘cos curl by default will test on port 80 – given that we’ve got the command wrong it’s testing 80
  • if we’d tested using the container IP address. e.g.

docker container inspect web

gives  “IPAddress”: “172.17.0.3”

curl 172.17.0.3:80

that gives us nginx output. ‘cos we’re using the container IP address and port.

and

curl 172.17.0.3:8080 would give:
curl: (7) Failed to connect to 172.17.0.3 port 8080: Connection refused

 

2. from the container

we need to execfrom the VM into the container. Another doc page: https://docs.docker.com/engine/reference/commandline/exec/

docker exec -it web /bin/bash

and

curl localhost:80
bash: curl: command not found

So we need to install curl. More docs (‘cos it’s different on the Mac (I use brew), on Debian (apt) and CentOS (yum) to find out the OS.

cat /etc/os-release
PRETTY_NAME=”Debian GNU/Linux 9 (stretch)”

so we’re using Debian.

It should be apt-get but I get:

More docs on how to install on Debian:

https://www.cyberciti.biz/faq/howto-install-curl-command-on-debian-linux-using-apt-get/

says apt install curlwhich gives me the same problem.

More docs – seems like you have to run apt-get update first.

https://stackoverflow.com/questions/27273412/cannot-install-packages-inside-docker-ubuntu-image

And finally I can verify that, in the container,

curl localhost:80

outputs nginx content.

 

Note also: I’ve got Ubuntu in the VM and Debian in the container.

VM: my Vagrantfile uses  config.vm.box = "ubuntu/bionic64"

Container: docker container run --rm -d --name web -p 8080:80 nginx uses a Debian based container.

 


Finally, I write a blog post so I can remember in future how it all works without having to spend an entire morning figuring it out. I open up WordPress and it’s using Gutenberg which I’ve been struggling with. Trying to disable it is a pain. This doesn’t work:

How to Disable Gutenberg & Return to the Classic WordPress Editor

Groan. I just pasted a link and don’t want the Auto Insert content feature however I can’t even be bothered to try and figure out how to disable the Auto Insert behaviour.

In the end, I posted a test post and went to All Posts and clicked Classic Editor under the post.

Another rant: WordPress’ backtick -> formatted code only occasionally works – very frustrating.

 

3. To close the loop let’s test from my Mac

As we’re using 8080on the host let’s forward to 8081 on the Mac. Add this to the Vagrantfile:

config.vm.network "forwarded_port", guest: 8080, host: 8081

https://www.vagrantup.com/docs/networking/forwarded_ports.html

Another rant – trying to reprovision the VM with this gave me a continuous loop of:

I couldn’t be bothered to debug this so just did vagrant destroy vm1 and started again.

https://www.vagrantup.com/intro/getting-started/teardown.html

Then some more Waiting. e.g.

==> vm1: Waiting for machine to boot. This may take a few minutes…

Given how fast computers are it seems crazy how much Waiting we have to do for them. E.g. web browsers, phones, etc.

End of that rant.

 

So, testing from my Mac:

http://localhost:8081/

did not work.

I tried

http://localhost:8080/

which did work. Wtf?

I gave up here. Kind of felt that figuring out the problems here was a rabbit hole too far.

 

Example 2

You’ve got a million more important things to do but you suddenly find in your AWS console that:

Amazon EC2 Instance scheduled for retirement

Groan. This is part of an Elastic cluster.

So, should be a pretty standard process.

  • disable shard allocation
  • stop elasticsearch on that node
  • terminate the instance
  • bring up another instance using terraform
  • reenable shard allocation

but you find unassigned_shards is stuck at x.

So, now you’ve got to become an elasticsearch expert.

E.g. do a

curl -XGET localhost:9200/_cluster/allocation/explain?pretty

and work out why these shards aren’t being assigned.

There goes the rest of my day wading through reams of debugging output and documentation.

https://www.datadoghq.com/blog/elasticsearch-unassigned-shards/

 

Example 3

Finding information is so slow.

E.g. you want to know why Elasticsearch skipped from versions 2.x to versions 5.x.

And whether it’s important.

So you Google. Eventually, hiding amongst the Release Notes is a StackOverflow page (https://stackoverflow.com/questions/38404144/why-did-elasticsearch-skip-from-version-2-4-to-version-5-0 ) which says go look at this 1 hour 52 minute keynote for the answer.

Unless you’re an elasticsearch specialist, no-one wants to spend this time finding out that info (the answer, btw, is Elasticsearch: why the jump from 2.x to 5.x ).

 

Example 4

After you’ve spent days of time finding a solution, the answer is complex.

E.g. let say you have to do a Production restore of Elasticsearch.

Can you imagine the dismay you’d get when you have to face the complex snake’s nest contained here:

https://www.elastic.co/guide/en/elasticsearch/reference/1.7/modules-snapshots.html#_restore

The preconditions start with:

The restore operation can be performed on a functioning cluster. However, an existing index can be only restored if it’s closed and has the same number of shards as the index in the snapshot.

and continue for page after page.

There is no simple command like: elasticsearch restore data from backup A

 

Instead you have to restore an index from a snapshot. How do you work out whether a snapshot contains an index?

Easy! Just search dozens of Google results, wade through several hundred pages of Elasticsearch documentation and Stackoverflow questions for different versions of Elasticsearch and Curator. E.g.

Query Google for:

restore index from elasticsearch snapshot – https://www.google.com/search?q=restore+index+from+elasticsearch+snapshot&oq=restore+index+from+elasticsearch+snapshot&aqs=chrome..69i57j0.8028j0j4&sourceid=chrome&ie=UTF-8

and you get not much useful.

 

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html

and

https://stackoverflow.com/questions/39968861/how-to-find-elasticsearch-index-in-snapshot

Funny – in the Stackoverflow the Answerer had the temerity to say:

I’m confused by your question as you have posted snapshot json which tells you exactly which indices are backed up in each snapshot.

I guess that exactly reflects the lack of understanding / empathy in the tech industry that someone can’t even make the leap to understand how to generalise the question to different indices and snapshots.

Example 5

Software that’s ridiculously complex to use.

E.g. take ElasticHQ – simple web GUI.

But how do you list snapshots?

Perhaps their documentation says something.

http://docs.elastichq.org/

No.

How about a search?

That returns 2 results. One about an API and another containing the word in code.

http://docs.elastichq.org/search.html?q=snapshot&check_keywords=yes&area=default

For anyone searching, it’s under Indices > Snapshots.

But if you’ve clicked on Indices, beware ‘cos you’re now in an infinite loop waiting for Indices to return.

Example 6: the acres of gibberish problem

Let’s say you’re learning a new technology.

The first thing you usually do is some Hello World system.

So, you try this with a Kubernetes pod. E.g.

and do

All seems pretty rosy? No!

Now you’re in acres of gibberish territory.

You’ve gone from trying to do a simple Hello World app to having to figure stuff out like:

  • NodeHasSufficientDisk
  • NodeHasNoDiskPressure
  • rpc error: code = Unknown desc = Error response from daemon

It’s the equivalent of learning to drive but, on trying to signal left, finding you have to replace the electrical circuitry of the car to get it working correctly.

This seemed to fix the problem:

e.g.