Blog

  • Working With Elb

    For the web project I am currently working on, I started needing more disk space on my AWS instance than the free 8gbs that come with t2 instances. Eventually I should probably move to S3 to host static assets like these, but for now I took the opportunity to learn how to attach EBS volumes to my ec2 instances.

    I was surprised at how much patching that needed to be done on top of Terraform to properly mount EBS volumes. The providers that are relevant are:

    Basically, aws_ebs_volume is what represents an EBS volume, and aws_ebs_volume_attachment is what associates this volume with an aws_instance. This in and of itself is not hard to grasp, but there are several gotchas.

    When defining aws_instance, it is also important to specify not only the region but the specific availability zone:

    resource "aws_instance" "instance" {
      availability_zone =  "us-east-1a"
      ...
    }
    
    resource "aws_ebs_volume" "my_volume" {
      availability_zone =  "us-east-1a"
      size              = 2
      type = "gp2"
    }
    

    This is so that, as you can see, the aws instance can be guaranteed to attach to the ebs volume. Now, if you don’t care about what’s on the ebs volume and can blow it out each time the ec2 instance changes, then this is not an issue, and you simply compute the availability zone each time:

    resource "aws_instance" "instance" {
      ...
    }
    
    resource "aws_ebs_volume" "my_volume" {
      availability_zone =  "${aws_instance.instance.availability_zone}"
      ...
    }
    

    This is a long way of saying that if you need a persistent ebs volume across aws instance restarts, then you must specify the availability zone explicitly.

    Another issue is that even after attaching the ebs volume to an instance, the volume must actually be mounted to be used. It turns out that you must modify fstab to mount the drives, but you have to do it after the volume attachment finishes attaching. A remote provisioner can be used here:

    resource "aws_volume_attachment" "attachment" {
      device_name = "/dev/sdh"
      skip_destroy = true
      volume_id   = "${aws_ebs_volume.my_volume.id}"
      instance_id = "${aws_instance.instance.id}"
    
      provisioner "remote-exec" {
        script = "mount_drives.sh"
        connection {
          user = "deploy_user"
          private_key = "${file("~/.ssh/id_rsa")}"
          host = "${aws_instance.instance.public_ip}"
        }
      }
    }
    

    I picked the name /dev/sdh for my volume, but ubuntu/debian maps this name to /dev/xvdh, and mapping, though always consistent within a linux distro, will be different between distros. AWS Linux AMIs apparently will create symbolic names so that the name that you chose for the volume will be preserved. In any case, here is mount_drives.sh:

    In this case, we supply the name we gave in the volume attachment above, and mount it to the MNTPOINT, which in this case for me is /images. This ensures that after this provisioner is run, we will have a usable space at /images.

    Which brings us to our last gotcha - since we had terraform doesn’t know anything about mounting, we also have to unmount the volume when we destroy the instance, which brings us to another configuration on the instance. If the ec2 instance gets destroyed, we make sure to unmount the volume first.

    resource "aws_instance" "instance" {
      ...
      provisioner "remote-exec" {
        when    = "destroy"
        inline = ["sudo umount -d /dev/xvdh"] # see aws_volume_attachment.attachment.device_name. This gets mapped to /dev/xvdh
        connection {
          user = "deploy_user"
          private_key = "${file("~/.ssh/id_rsa")}"
        }
      }
    }
    

    I wish that Terraform supported this kinds of use cases out of the box, but fortunately it is flexible enough that the workarounds can be implemented fairly easily.


  • Setting up logwatch

    One of the parts of managing linux instances is understanding the state of the machine so it can be troubleshooted. I really needed to output logs from my machine, so I set out to learn a very well known tool for log summarization called logwatch.

    There are three parts to any logwatch ‘service’. I use this term in quotes because I haven’t defined it yet, but also because it is different than the unix concept of a service. Generally it encompasses the type of logs you wish to summarize.

    1. A logfile configuration(located in <logwatch_root_dir>/logfiles/mylog.conf)
    2. A service configuration(located in <logwatch_root_dir>/services/mylog.conf)
    3. A filter script(located in <logwatch_root_dir>/scripts/services/mylog)

    This three pieces together are what forms a service. What really helped me understand was to go through an example of one of these logs. Let’s take for example, the http-error service.

    Before we continue, a note about <logwatch_root_dir>. Logwatch internally looks for these configuration files based on several dirs. For Ubuntu, the lookup is /usr/share/logwatch/default.conf > /usr/share/logwatch/dist.conf > /etc/logwatch/. The idea being that each successive directory overrides parameters from the previous location. This is covered really well here.

    Logfile Configuration
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    
    # /usr/share/logwatch/default.conf/logfiles/http-error.conf
    ########################################################
    #   Define log file group for httpd
    ########################################################
    
    # What actual file?  Defaults to LogPath if not absolute path....
    LogFile = httpd/*error_log
    LogFile = apache/*error.log.1
    
    [ ... truncated ]
    
    # If the archives are searched, here is one or more line
    # (optionally containing wildcards) that tell where they are...
    #If you use a "-" in naming add that as well -mgt
    Archive = archiv/httpd/*error_log.*
    Archive = httpd/*error_log.*
    Archive = apache/*error.log.*.gz
    
    [ ... truncated ]
    
    # Expand the repeats (actually just removes them now)
    *ExpandRepeats
    
    # Keep only the lines in the proper date range...
    *ApplyhttpDate

    Line 7-8 are basically file filters on which files from the log root will logwatch feed into your service. This is a pretty great idea, because you could potentially generate a custom log based on many different kinds of logs. For example, your custom log can incorporate the number of http access errors that were encountered by your server in a given time period. If absolute paths are not given, paths are relative to the default log root, /var/log/

    Lines 14-15 show that you can also search archive files for the same log information.

    Line 21 seems to be some left over unused code, but was meant to expand the logs when standard syslog files have the message “Last Message Repeated n Times”. As the comment indicates, it just removes repeats.

    Line 24 is interesting. The * indicates to the logwatch perl script to apply a filter function to all lines of this file. At <logwatch_root_dir>/scripts/shared/applyhttpdate, we can see that this filters the dates in the logs, assuming a certain header format for the lines in the file. Logwatch provides a couple of standard filters with intuitive names like onlycontains, remove, etc.

    Service Configuration

    So now we know how logwatch finds the logs that might be of interest to us. What does it do with these files? For that, we have to look at the service configuration file:

    1
    2
    3
    4
    5
    6
    7
    
    # /usr/share/logwatch/default.conf/services/http-error.conf
    Title = http errors
    
    # Which logfile group...
    LogFile = http-error
    
    Detail = High

    The directive on Line 2 is straightforward - what should this log be named? When the log output is generated, this is what goes in the headers.

    Line 5, confusingly, tells logwatch which logfile “group” it’s interested in. This is simply the logfile configuration we looked at earlier, minus the .conf extension. However, just as the logfile configuration can filter logs with different extensions and names, the service configuration can incorporate multiple logfile groups.

    Service Configuration

    Finally, logwatch runs the output of all of the logs gathered by the configurations through a script with the same name as the service configuration, but in <logwatch_root_dir>/scripts/services/<servicename>. Most bundled scripts are perl scripts, but the great thing is that you can pretty much use any scripting language.

    I won’t actually go through /usr/share/logwatch/scripts/services/http-error here, one because it’s pretty long, and two, I don’t understand perl and can’t explain it very well : ) However, the gist of it is that it takes all the output of all the logs and summarizes them, outputting the result in stdout.


    My custom log watch doesn’t actually watch any logs, but I still need to write these three files. This was my final setup.

    Logfile conf
    1
    2
    3
    4
    
    # /etc/logwatch/conf/logfiles/customlogger.conf
    # This is actually a hack - I ask for log files, but I never actually use them.
    LogFile = *.log
    *ExpandRepeats
    Service conf
    1
    2
    3
    
    # /etc/logwatch/conf/services/customlogger.conf
    Title = customlogger
    LogFile = customlogger
    Script
    1
    2
    3
    4
    5
    6
    
    # /etc/logwatch/scripts/services/customlogger
    #!/usr/bin/env bash
    
    # I just need to know what the memory usage is like at this point in time.
    top -o %MEM -n 1 -b | head -n 20
    free -m

    To test, I just ran:

    $ logwatch --service customlogger
    
     --------------------- customlogger Begin ------------------------
    
     top - 21:36:06 up  4:16,  1 user,  load average: 0.12, 0.05, 0.01
     Tasks: 144 total,   1 running, 143 sleeping,   0 stopped,   0 zombie
     %Cpu(s):  2.4 us,  0.6 sy,  0.0 ni, 96.8 id,  0.2 wa,  0.0 hi,  0.0 si,  0.0 st
     KiB Mem :   497664 total,    43716 free,   194456 used,   259492 buff/cache
     KiB Swap:        0 total,        0 free,        0 used.   280112 avail Mem
    
       PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
     13373 root      20   0  202988  44448  12796 S  0.0  8.9   0:00.76 uwsgi
     11868 root      20   0  762376  41940   8192 S  0.0  8.4   1:04.90 dockerd
     13389 root      20   0  202988  35224   3568 S  0.0  7.1   0:00.00 uwsgi
     13390 root      20   0  202988  35220   3564 S  0.0  7.1   0:00.00 uwsgi
     13305 root      20   0   47780  15720   6432 S  0.0  3.2   0:02.16 supervisord
      9024 root      20   0  290696  13876   3268 S  0.0  2.8   0:02.55 fail2ban-se+
     31127 ubuntu    20   0   36412  11060   4316 S  0.0  2.2   0:00.07 logwatch
     11872 root      20   0  229844   9460   2536 S  0.0  1.9   0:00.28 containerd
      1162 root      20   0  266524   7372    488 S  0.0  1.5   0:00.02 snapd
     27837 root      20   0  101808   6844   5868 S  0.0  1.4   0:00.00 sshd
       417 root      20   0   62048   6356   3732 S  0.0  1.3   0:01.52 systemd-jou+
         1 root      20   0   55208   5796   3936 S  0.0  1.2   0:04.99 systemd
     13430 root      20   0   67824   5636   4896 S  0.0  1.1   0:00.02 sshd
                   total        used        free      shared  buff/cache   available
     Mem:            486         190          42           7         253         273
     Swap:             0           0           0
    
     ---------------------- customlogger End -------------------------
    
    
     ###################### Logwatch End #########################
    

    So if you need this to run every other hour, the last thing to do is to set up a cron job to do it. Pretty nifty, I think.

    0 */2 * * * /usr/sbin/logwatch --service customlogger --output mail --mailto <your-email>
    

  • Deploying Selenium

    I had the misfortune of trying to use Selenium in one of my upcoming projects. Actually, Selenium is a pretty amazing tool for automating website testing, but the dependencies can be tricky to nail.

    Installing on OSX is pretty straightforward:

    pip install selenium
    brew install chromedriver
    

    But this became a huge nightmare for me when installing remotely. Fortunately, there is a selenium releases a docker image one can run with this one liner:

    docker run -d -p 4444:4444 --name selenium --shm-size=2g selenium/standalone-chrome:3.8.1-bohrium
    

    This is what you need to do in python:

    from selenium import webdriver
    
    # my_docker_host is usually localhost, but in Docker Toolbox is the ip of the
    # virtual machine
    selenium_server_url = 'http://my_docker_host:4444/wd/hub'
    
    options = webdriver.ChromeOptions()
    options.set_headless(True)
    capabilities = options.to_capabilities()
    
    driver = webdriver.Remote(desired_capabilities=capabilities,
                              command_executor=selenium_server_url)
    
    # then just use the driver as you would normally
    driver.get(some_url)
    

  • Virtualenv Workflow

    Over the past couple of days, I had gained an appreciation for the tooling that has developed around the python ecosystem for package management, from having to develop and deploy several python applications.

    I’d like to share some guides and tips to maintaining python environments.

    pyenv

    I highly recommend using pyenv to manage your python environments.(and while we’re on that topict, rvm for Ruby, nenv for node). What these tools have in common is that it makes sure that you can maintain a separate set of dependencies for different projects.

    The canonical example of where this useful is if you have two projects, A and B, that both depend on a library, let’s call it unicorn. While working on project A, you realize you need the absolute latest release of the unicorn library. But when you upgrade, project B breaks, because there was some code that made assumptions on the previous version of unicorn. This is a problem if there is a single global installation that both project share. This is why the “global” installation of dependencies can be dangerous.

    After installing pyenv, switching python versions becomes as easy as:

    pyenv shell anaconda-2.4.0
    

    I also highly recommend the pyenv-virtualenv tool. This allows an activation of the virtual environment based on the directory you’re in. The syntax for virtualenv is like this

    # this creates a new virtualenv managed by pyenv-virtualenv
    pyenv virtualenv <python-version> <env-name>
    
    # this creates a .python-version file in the local
    # directory, which will instruct the pyenv-virtualenv plugin to activate
    # the env whenever you switch to this directory
    pyenv local <env-name>
    

    Based on this, I recommend the following naming convention for the env name <python-version>_<project-name>. This is because BOTH the python version and the environment name is captured. WHen using pyenv virtualenv, one of the downsides is that all the envs are stored next to each other, so they somehow have to be namespaced.

    So for example, I would do:

    pyenv virtualenv 3.5.1 3_5_1_myproject # assuming python 3.5.1
    pyenv local 3_5_1_myproject
    

    Now the local .python-version can be committed to source code, and if someone else needs to recreate it, they just need to make sure that it’s done in the context of a python 3.5.1 environment.

    pipenv

    I hadn’t heard about pipenv until several days ago, when I had to deploy a python application. I had become so used to npm’s --save-dev option, that I wondered, what was the equivalent for python?

    All virtual environments come bundled with a python package manager called pip. Pip is pretty wonderful, but it has many limitations, one of which is the lack of the concept of dev packages. For me, I wanted to install jupyter notebook for development but not for deployment, so this was a dealbreaker.

    Enter pipenv. Pipenv allows for specifying dependencies and locking them, like in most other languages(package.json/package-lock.json, Podfile/Podfile.lock, etc)

    pipenv install selenium
    pipenv install --dev jupyter
    

    gives us this Pipfile:

    [[source]]
    
    url = "https://pypi.python.org/simple"
    verify_ssl = true
    name = "pypi"
    
    
    [packages]
    
    selenium = "*"
    
    
    [dev-packages]
    
    jupyter = "*"
    

    which is really neat. Furthermore, if you have an environment set up using pyenv, pipenv will happily use it.

    pipenv install will download dependencies only for the production version, which greatly simplifies deployment.


  • That thing in Hugo

    Hugo continues to make occasional splashes on the front page of Hacker News, and like many others who were a little tired of why Jekyll took so long to render even small pages, I took a leap, and would like to share some of my experiences doing so.

    The first caveat I should mention before you read ANY further. The biggest downside of Hugo, in my opinion, is that it does not come batteries included with regards to SASS processing. All the blogs that mention how blazingly fast it is(and they are right) don’t mention this fact. IMHO, writing css in the modern day always involves a css preprocessor. Fortunately, SASS compilation handled by many build systems, and I will share my setup at the end of this post.

    I didn’t have much luck with the hugo import jekyll command, it just ended up creating empty directories in the new project.

    Inspired by https://thatthinginswift.com, here are some basic translations that might help someone migrating from Jekyll. All entries are listed as Jekyll => Hugo.

    Helpful functions

    {{ page.title | relative_url }} => {{ .Title | relURL }}
    {{ page.description | escape }} => {{ .Description | safeHTML }}
    {{ page.date | date: '%B %d, %Y' }} => {{ .Date | dateFormat "January 2, 2006" }}
    

    Page titles are simply passed in with the root context, so no reference to page is necessary.

    Rendering list of items

    # Jekyll - /blog.html
    {% for post in site.posts %}
      <li>
          <h2 class="post-title-home">
            {{ post.title | escape }}
          </h2>
        {{ post.content }}
      </li>
    {% endfor %}
    
    # Hugo - /layouts/blog/list.html
    {{ range .Data.Pages }}
        <h2 class="post-title-home">
          {{ .Title | safeHTML }}
        </h2>
      {{ .Render "li"}}
    {{ end }}
    

    A couple of things to note here:

    • The .Data.Pages variable is populated automatically depeding on which section you are part of.
    • “li” is a template, that lives in layouts/blog/li.html, which tells a page how to render.

    For more, see:

    Template inheritance

    
    # Jekyll
    
    # /_layouts/default.html
    <html>
      {% include head.html %}
      <body>
        {{ content }}
      </body>
    </html>
    
    # /_includes/head.html
    <meta>...</meta>
    <meta>...</meta>
    <meta>...</meta>
    
    # /_layouts/post.html
    <div>
      <h1> {{ page.title }} </h1>
      {{ content }}
    </div>
    
    # Hugo
    
    # /layouts can be substituted for /themes/themename, see documentation below.
    
    # /layouts/default/baseof.html
    <html>
      {{ partial "head.html" . }}
      {%  head.html %}
      <body>
        {{ block "main" . }}
        {{ end }}
      </body>
    </html>
    
    # /layouts/partials/head.html
    <meta>...</meta>
    <meta>...</meta>
    <meta>...</meta>
    
    # /layouts/post/single.html
    {{ define "main" }}
    <div>
      <h1> {{ .Title }} </h1>
      {{ .Content }}
    </div>
    {{ end }}
    

    SASS compilation

    My old jekyll theme heavily used Bootstrap, so I needed a way to compile sass files. I ended up hacking an npm script to do this:

    {
      "name": "brightredchilli-website",
      "version": "0.0.1",
      "description": "Preprocessing code for a hugo site",
      "main": "index.js",
      "scripts": {
        "css:build": "node-sass --source-map true --output-style compressed './themes/brightredchilli/sass/main.scss' --glob -o ./themes/brightredchilli/static/css/",
        "css:watch": "onchange './themes/brightredchilli/sass/' -- npm run css:build",
        "build": "npm run css:build",
        "prewatch": "npm run build",
        "watch": "parallelshell 'npm run css:watch' 'hugo server --buildDrafts --verbose'",
        "start": "npm run watch",
        "deploy": "hugo --baseURL='https://www.yingquantan.com'"
      },
      "author": "Ying",
      "license": "MIT",
      "devDependencies": {
        "node-sass": "^4.7.2",
        "onchange": "^1.1.0",
        "parallelshell": "^3.0.2",
      },
      "dependencies": {}
    }
    

    The relevant part is the fact that I use a theme called brightredchilli, and put my sass files inside of the sass directory. Note that I don’t use static/sass, because that would cause hugo to copy the files over to the publish directory, which I don’t want. The ignoreFiles directive in config.yml didn’t seem to work for me.

    There are scripts that watch for changes, and node-sass compiles the changes and puts it into the static/css directory. Note in the watch script I start a hugo server with verbose flags.

    Another gotcha was that the hugo server serves pages from memory. This means that there is no reference on disk you can use to inspect the layout and output of the content, whether or not files copied over successfully, etc. I found myself periodically just running the hugo command generate the site to publish directory(defaults to /publish).

    I found this reading extemely helpful for the migration process - it goes about describing how to create a minimal theme. The process really helps someone learn about how Hugo renders content.

    I ended up killing a day doing this, and not everything is migrated properly over, but I’m glad I did it.


  • Nginx Certbot

    Finally, after a week of headbanging, I managed to figure out certbot’s installation process with nginx and how to deploy python apps with uwsgi.

    1. Make sure that existing nginx is not listening on port 443

    There was a tiny trick with certbot’s nginx plugin - if you have a virtualhost, say, in /etc/nginx/sites-available/test.com, you must not listen on port 443. In other words, make sure that:

      server {
    
        # MAKE SURE THESE TWO LINES ARE COMMENTED
        # listen 443 ssl default_server;
        # listen [::]:443 ssl default_server;
    
        ...
      }

    Otherwise, you might get a ‘connection reset by peer’ error. This is because there is already a listen on 443 on the default server block.

    2. Make sure that server_name is defined

    Nginx needs a virtualhost to be configured, so if you don’t include a configuration file with the server_name directive set, it will not work:

      server {
    
        # This does not work -  this is the default configuration
        server_name _;
    
        # Works
        server_name test.com www.test.com;
    
        ...
    
      }

  • Calling map on NodeList

    This blog post sums it up beautifully, though I don’t understand how a NodeList can be tricked so that one can call ‘map’ on it. If NodeList is a ‘subclass’ of Array, then it should ‘inherit’ the map function.

    For reference, if the link goes dead, here is how to call map on NodeList:

    1
    2
    3
    
    Array.prototype.map.call(nodelist, n => {
      console.log(n.innerHTML)
    })