Musings on cloud initialization

$ ls
links  published  categories  alts  content

$ cat links
- Home
- Author

$ cat published
2020-01-16

$ cat categories
- AWS
- Cloud
- Reference

$ cat alts
- Gopher Mirror
- text/plain; width=40
- text/plain; width=72
- text/plain; width=120
- application/x-troff

$ cat content

                    Musings on cloud initialization
________________________________________________________________________

I spent a while reading/learning about how to set up cloud instances.
This document summarizes some of those things.


                             The Manual Way
________________________________________________________________________

This is the way I've been doing things up until now.  If I need the ma-
chine set up a certain way, then I'll just start an instance, SSH in,
and run the commands I need.  Then, if I need to start up many instances
like that, I'll make a [snapshot] and convert that to an [AMI] .

This way definitely works, but it means that if I made a mistake on the
2nd command, I have to manually re-run everything up until that point.
It also means that explaining what I did is next to impossible unless I
take proper notes.

The nice thing is that it is easy to debug and test things when you
don't know how they're going to work.  At every step along the way, I
can manually test how it's working.

I also find that it's hard to (de)compose.  The latest example is that I
set up an instance with two main features: logs get forwarded to a sin-
gle instance, and it runs an application on start up.  If I want to de-
compose this into each separate part, and maybe run a different applica-
tion with logging, or run two applications on start up but not forward
logs, then I'm at a loss.

Pros: Easy to understand.  Easy to test.

Cons: Hard to explain.  Hard to reproduce.  Hard to (de)compose.


                 Aside: Running applications on startup
________________________________________________________________________

I like to use [systemd] to do this because I understand it to a certain
level.  I also know that it will get things right with regards to start-
ing up my application again if it crashes.

I want to be able to use systemd user services to run things because it
means that I don't have to manage everything as root.  I ran into prob-
lems where I couldn't get it to start actually running my user services
on startup, and I'm not sure why.  There might have been something wrong
with the [linger] setting or something to that effect.

In the end, I just used regular services, which meant that I threw a
[.service] file in [/etc/systemd/system/] .  A simple one looks like
this.

--------8<--------------------------------------------------------------
[Unit]
AssertPathExists=/usr/local/bin/myapplication

[Service]
ExecStart=/usr/local/bin/myapplication --my-flags
Restart=on-failure
RestartSec=10s

[Install]
WantedBy=multi-user.target
-------->8--------------------------------------------------------------

The [Type=simple] setting is implied by the [ExecStart=] one.  I also
find that this is the right level of restart logic.  I had tinkered in
the past with some restart settings like [StartLimitIntervalSec=] and
[StartLimitBurst=] because they are referenced nearby in the man page,
but they are almost certainly the wrong choice for the types of applica-
tions I write.

After writing this file to [/etc/systemd/system/myapplication.service] ,
it is a simple matter of enabling and starting the service to get it to
auto-run every time the instance starts.

--------8<--------------------------------------------------------------
$ sudo systemctl enable myapplication
$ sudo systemctl start myapplication
-------->8--------------------------------------------------------------


                       Automating the manual way
________________________________________________________________________

You can automate the manual way pretty trivially.  Either via [scp &&
ssh && bash] or using a library like [paramiko] .  Furthermore, you
could use [Puppet] or [Chef] or whatever the latest tool like that is
and let it handle setting up the instance.

The main problem with this is that it all happens after-boot, when the
entire stack is already up, and requires SSH access.  Other than that,
there is still the problem of having to write/run everything as one co-
hesive unit, which makes things harder to debug.  There is a nice way to
facilitate testing by creating one instance along with copying lines
over to a shell script as you run them, which does work, but it makes it
hard to relay what testing was done between steps.

Pros: Automatic.  Simple to understand.

Cons: Tests aren't encoded.  Runs after boot, not on boot.


                       Automating the Docker way
________________________________________________________________________

If [Docker] is running on the server, then it's pretty easy to push the
Docker image to the cloud and then run it from the server.  You can even
set up some automation there to automatically pull the latest image, but
it's not the most straightforward.


                        Automating the cloud way
________________________________________________________________________

This kind of problem has already been solved by cloud people who need to
install different software on different instances.  It's also been fair-
ly standardized because of the availability of many good cloud services.

The [cloud-init] approach is a YAML file that encodes many different de-
vops functions and runs through them.  These functions include: write
this content to this file, run this command in this directory, create
this user with this name, and more.

I see one large flaw in this approach for general cloud instance config-
uration: it is non-trivial to have things run in a particular order or
interleaved in a particular way.

Consider: to create a systemd service, we need to first [pip install]
the application ("run a command"), then we need to create the systemd
service file ("create a file"), and finally enable the service ("run a
command").  As far as I can tell, cloud-init doesn't work like this.  It
would want to run both commands one after another and create the file
either before or after both commands.

In practice, this doesn't have to be a problem because one could create
the files using a shell script ("run a command"), but it does feel at
odds with everything else.

There is a very good story about extending cloud-config, however.  By
creating custom "part handlers," one can define exactly what they want
to happen when a certain key is found in the YAML file.  This means that
one could create something that would allow the interleaved scenario
from above to be efficiently defined.

I also think there would be a lot of value in creating a Docker-to-
Cloud-Config converter.  There is a similar Docker-to-AMI converter, but
it functions like "Automating the manual way" from above.

Note: there is some nuance to the distinction between cloud-init and
cloud-config.  The cloud-init tool accepts many different types of in-
put: a shell script, a custom part-handler, a URL to download and run,
and a cloud-config file, etc.  It also allows you to combine multiple
inputs together in one "package" which it will run through sequentially.
The cloud-config file supports things like: running scripts, creating
users, writing to files, installing packages, etc.  Although cloud-con-
fig shares many functions with cloud-init, they serve different use cas-
es.

In the ideal case, I would be able to send multiple cloud-config files
to cloud-init and have each file set up a different tool, but I believe
cloud-init with merge these files together, thus breaking in the way I
described above because it would interleave things together.  I am un-
sure of this though, and it should be tested, because that would be an
easy solution to my problem.

Pros: Supported by every cloud.  Can be run earlier in the system setup.

Cons: File format is a little hard to understand (compared to a shell
script).  Might be composable.  Not applicable to non-cloud-machines.


                               Conclusion
________________________________________________________________________

I haven't found the best way to set up a cloud instance.  Currently, I'm
thinking that cloud-init is the best way and using cloud-config, but
that's only if the composition story makes sense.  Otherwise, cloud-init
with regular shell scripts and a framework that facilitates easier test-
ing in between steps is what I'll go with in the future.


                               Link Dump
________________________________________________________________________

[0]: https://cloudinit.readthedocs.io/en/latest/topics/format.html
Details how the cloud-init format works and how to combine different
steps into one file.

[1]: https://cloudinit.readthedocs.io/en/latest/topics/examples.html
A nice set of examples of the cloud-config format.

[2]: https://serverfault.com/a/413408
I ran into a problem where I needed to set custom environment variables
for different instances.  This answer shows one way to do this, and it's
the way I ended up using.

You can create a [/etc/systemd/system/myapplication.service.d] directory
and put [.conf] files inside.  Then systemd will read and merge all of
these files together, thus ensuring that your environment variables will
be loaded.

Note: You need to still put the right section titles.  Don't forget to
put the [Service] header before the [Environment=] settings, or it won't
work.

[3]: https://serverfault.com/a/410438
Inside of the instance, I wanted to know what type it is (i.e.  [t2.mi-
cro] or [t2.medium] .  AWS exposes a web server that has this informa-
tion at [169.254.169.254] .

Here is a bash snippet to store some of these values as environment
variables.

--------8<--------------------------------------------------------------
keys=( $(curl http://169.254.169.254/latest/meta-data/) )
for key in "${keys[@]}"; do
        case "$key" in
        (*/*)
                # Skip any multi-level keys
                continue;;
        esac

        value=$(curl http://169.254.169.254/latest/meta-data/$key)

        # Before: $key looks like "ami-id"
        # After: $key looks like "AMI_ID"
        key=${key^^}
        key=${key//-/_}

        # Safely use eval without any worry about escaping anything
        eval "AWS_$key=\$value"
done
-------->8--------------------------------------------------------------

Afterwards, you can use variables like [$AWS_AMI_ID] or [$AWS_IN-
STANCE_TYPE] .

[4]: https://forums.aws.amazon.com/thread.jspa?threadID=250683
Apparently AWS instances take a while to disappear after terminating
them, so it's fine if they don't disappear from the console immediately.

[5]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html
AWS has user-data which is sent to cloud-init, so it should be in cloud-
init formats (i.e. a shell script, a cloud-config, or many things con-
catenated).
________________________________________________________________________

I'm a big proponent of using all the built-in tools in Python.  This
manifests in weird ways: I like the http.server module; I abuse multi-
processing.Manager; and I like using pure distutils.

The latter gives me problems sometimes because no one seems to use it
like I do.  I often find myself getting lost in Python's lackluster doc-
umentation for this module trying to just find the name of the argument
I need.  It's starting to get a little silly how often and how long I
spend on this problem.

Of course, one solution to this problem is "just use an IDE with auto-
completion."  Another is "just use setuptools which has better documen-
tation."  For various reasons, I prefer to use simpler editors and also
to use standard library functions.

To that end, this document is a consolidation of the things I'm usually
looking for.  It mostly serves as a reference for myself, but maybe it
will be helpful to someone else.

Before the document gets into the details, here are some links that are
helpful.

API documentation for distutils.core.setup
[0]: https://docs.python.org/3/distutils/apiref.html

Official examples for distutils.core.setup
[1]: https://docs.python.org/3/distutils/setupscript.html

List of valid classifier values
[2]: https://pypi.org/classifiers/

API documentation for pkgutil.get_data
[3]: https://docs.python.org/3/library/pkgutil.html#pkgutil.get_data


                          The Simplest Script
________________________________________________________________________

If you are creating a package called MYPACKAGE, your script would look
like this.

--------8<--------------------------------------------------------------
from distutils.core import setup

setup(
        name='MYPACKAGE',
        version='0.1.0',
        packages=[
                'MYPACKAGE',
        ],
)
-------->8--------------------------------------------------------------

You should have a directory called MYPACKAGE with at least an
__init__.py file inside.

--------8<--------------------------------------------------------------
/
  /setup.py
  /MYPACKAGE/
    /MYPACKAGE/__init__.py
-------->8--------------------------------------------------------------


                          Adding Dependencies
________________________________________________________________________

If you need certain dependencies to be installed, you can specify them
with the requires keyword.  In this example, we require at least version
2.20.0 of requests.

--------8<--------------------------------------------------------------
from distutils.core import setup

setup(
        name='MYPACKAGE',
        version='0.1.0',
        packages=[
                'MYPACKAGE',
        ],
        requires=[
                'requests>=2.20.0',
        ],
)
-------->8--------------------------------------------------------------


                          Adding an Executable
________________________________________________________________________

Edit: Oops.  This one does actually depend on setuptools.  If you in-
stall the package with "pip install ."  then it will automatically use
setuptools for you, hence my confusion.

Sometimes you aren't trying to just create a library but also need an
actual executable.  Somewhat confusingly, in distutils land, this is
called an "entry point."  If you want an executable called MYEXEC for
the package MYPACKAGE, then you'll want this.

--------8<--------------------------------------------------------------
# File: setup.py

from distutils.core import setup

setup(
        name='MYPACKAGE',
        version='0.1.0',
        packages=[
                'MYPACKAGE',
        ],
        entry_points={
                'console_scripts': [
                        'MYEXEC=MYPACKAGE.__main__:cli',
                ],
        },
)
-------->8--------------------------------------------------------------

You should have a __main__.py script with a cli function inside.  A sim-
ple __main__.py script looks like:

--------8<--------------------------------------------------------------
# File: __main__.py

from . import hello

def main(name):
        print(f'The output of hello({name!r}) is {hello(name)!r}')

def cli():
        import argparse

        parser = argparse.ArgumentParser()
        parser.add_argument('name')
        args = vars(parser.parse_args())

        main(**args)

if __name__ == '__main__':
        cli()
-------->8--------------------------------------------------------------

The corresponding __init__.py might look like:

--------8<--------------------------------------------------------------
# File: __init__.py

def hello(name):
        return name.upper()
-------->8--------------------------------------------------------------

Now after you install the package, you can use MYEXEC as a normal script
and pass it arguments like: MYEXEC George.

The directory structure here is:

--------8<--------------------------------------------------------------
/
  /setup.py
  /MYPACKAGE/
    /MYPACKAGE/__init__.py
    /MYPACKAGE/__main__.py
-------->8--------------------------------------------------------------


                     Changing the Package Directory
________________________________________________________________________

Sometimes you want to put your code in a different directory than Python
expects.  By default, Python wants MYPACKAGE to be located at ./MYPACK-
AGE, but you can change this to look at ./src/MYPACKAGE or any other di-
rectory instead.

--------8<--------------------------------------------------------------
# File: setup.py

from distutils.core import setup

setup(
        name='MYPACKAGE',
        version='0.1.0',
        packages=[
                'MYPACKAGE',
        ],
        package_dir={
                'MYPACKAGE': 'src/MYPACKAGE',
        },
)
-------->8--------------------------------------------------------------

The directory structure now looks like:

--------8<--------------------------------------------------------------
/
  /setup.py
  /src/
    /src/MYPACKAGE/
      /src/MYPACKAGE/__init__.py
-------->8--------------------------------------------------------------


                       Including Extra Data Files
________________________________________________________________________

You may want to include some extra files with your package.  For in-
stance, I like to include an index.html page with my web server pack-
ages.

--------8<--------------------------------------------------------------
# File: setup.py

from distutils.core import setup

setup(
        name='MYPACKAGE',
        version='0.1.0',
        packages=[
                'MYPACKAGE',
        ],
        package_data={
                'MYPACKAGE': [
                        'static/*',
                ],
        },
)
-------->8--------------------------------------------------------------

Now any file in the static directory in your package will be included.

--------8<--------------------------------------------------------------
/
  /setup.py
  /MYPACKAGE/
    /MYPACKAGE/__init__.py
    /MYPACKAGE/static/
      /MYPACKAGE/static/index.html
-------->8--------------------------------------------------------------

To retrieve this file at runtime, you can use pkgutil.get_data

--------8<--------------------------------------------------------------
# File: __init__.py

import pkgutil

index_html = pkgutil.get_data('MYPACKAGE', 'static/index.html')

print(type(index_html))
#  => bytes
-------->8--------------------------------------------------------------


                         A More Complete Script
________________________________________________________________________

There's lots of parameters, but I suspect that the ones that are most
useful for actually distributing a package are as follows.

--------8<--------------------------------------------------------------
# File: setup.py

from distutils.core import setup

long_description = """\
This is the long description for my package.

It can be pretty long.

You can either use reStructuredText or GitHub Flavored Markdown.
"""

setup(
        name='MYPACKAGE',
        version='0.1.0',
        description="One line description",
        long_description=long_description,
        author='John Smith',
        author_email='johnsmith@example.com',
        url='https://github.com/example/MYPACKAGE',
        license='MIT',
        keywords=[
                'cool',
                'useful',
                'whatever',
        ],
        classifiers=[
                'Development Status :: 1 - Planning',
                'Programming Language :: Python :: 3.8',
        ],
        packages=[
                'MYPACKAGE',
        ],
        requires=[
                'requests>=2.20.0',
        ],
)
-------->8--------------------------------------------------------------