Blog

  • Some notes on matplotlib

    Matplotlib has two modes, ‘implicit’, and ‘object oriented’. Most code I know seems to use the ‘implicit’ method, which is to say, they use

    import matplotlib.pyplot as plt
    

    Now, plt.figure is nice, but an implicit Axes object is created, and its kind of hard to get that object back. You might want Axes, so you could get the range in data coordinates, of the plot. So instead of using figure, use subplots:

    fig, ax = plt.subplots(figsize=(10,8))
    
    # plot some stuff here
    
    ax.viewLim.height # => the extent of the plot, in data units.
    

    I needed this to reposition some labels offset from a bar chart:

    # Add the text 'up' to days when close >= open, and vice versa
    # ax.viewLim returns bounding box in data units. This allows us to get a normalized padding
    padding = 0.01 * ax.viewLim.height
    for p in up_bars.patches:
      plt.text(p.get_x(), p.get_y() + p.get_height() + padding, 'up',
               color='green',
               fontsize='large',
               horizontalalignment='center',
               verticalalignment='bottom')
    

    It turns out this tom-foolery is not actually needed. matplotlib.Text allows us to specify in addition to an xy, an xytext and textcoords, which is the ‘coordinate system that xytext is given in’. This allows us to do:

    for p in up_bars.patches:
        plt.annotate('up',
                     (p.get_x(), p.get_y() + p.get_height()),
                     xytext=(0,5),
                     textcoords='offset points',
                     color='green',
                     fontsize='large',
                     horizontalalignment='center',
                     verticalalignment='bottom')
    

    Now, this is nice. xytext is specified in ‘offset points’, so no crazy conversion needed. But notice that this function is matplotlib.pyplot.annotate, not matplotlib.pyplot.text!

    And the signature is completely different! matplotlib.pyplot.text is x,y,str,**kwargs and matplotlib.pyplot.annotate is str,xy,**kwargs. Go figure.

    Also, it turns out that 3.5.0 has bar_label, so depending on the matplotlib version you’re on, you may be able to just use that. In any case, the deep dive into the source was fun.


  • A little bit about gamma

    Gamma is a kind of compression algorithm. It exploits the fact that our perception of brightness, as with many other sensory stimuli, follow the power law. Our eyes are much better at detecting the difference between 2 values that are darker, as opposed to two values that are brighter. From an evolutionary standpoint, this makes sense: it is much more advantageous to be able to see in the dark, than to accurately tell between two really bright shades.

    Gamma encoding exploits this so that each bit in the image data is used as efficiently as possible. Instead of storing the camera sensor data linearly, the data is encoded so that more bit space is given to darker ranges, and less to lighter ranges, mimicking our perception. This is why gamma encoded images are said to be perceptually linear.

    At the very simplest, gamma encoding uses a single value, gamma, which is typically 2.2. I cannot find any good explanation for where this value comes from, though my Macbook Pro monitor shows a gamma settings of 2.4. To gamma encode, we simply take pow(input,1/2.2). And to decode, we take pow(input, 2.2).

    Why is this important?

    Image algorithms are made to work on linear values, not perceptually linear values.

    Say we wanted the average of 2 pixels, 1 black, 1 white. If simply average their rgb values, we get 128. But 128 appears to be only 22% as bright, which is not what we expect. What we want is to get 50% brightness. So the correct way of doing this is to gamma decode the values first, do the average, then reencode them.

    These two python functions describe how one might encode and decode a value.

    def encode(x):
    	return math.pow(x/255.0, 1/2.2) * 255
    
    def decode(x):
    	return math.pow(x/255.0, 2.2) * 255
    

    Photography

    As a bonus, if you are a photography nerd and have messed around with tonal curves, then you already understand gamma intuitively. A higher slope in the tonal curve causes the values in the input range to be redistributed to a wider output range, resulting in more contrast. Gamma encoding is like a tonal curve applied to the entire input range.

    Helpful links:

    http://blog.johnnovak.net/2016/09/21/what-every-coder-should-know-about-gamma

    https://www.cambridgeincolour.com/tutorials/gamma-correction.htm


  • Concurrency vs Parallelism in Ruby Apps

    A thing that came up some weeks ago which confused me, is whether languages like Python and Ruby are multithreaded. This my attempt to explain to myself how it works in Ruby, and I hope it helps you too. Firstly, we need to distinguish concurrency and parallelism, which are conflated with multithreading, but are not the same. Concurrency can be thought of as interleaving, so if two jobs are switched back and forth very quickly, there is a sense that both are being done ‘at the same time’, but they are merely being done concurrently. For example, you may be eating food, and drinking beer. Take a bite, then sip, then bite, then sip, so you are concurrently drinking beer and eating food. But you are not literally drinking beer and eating food at the same time, that would require you to have both the cup to your lips and fork to your lips at the same time(in parallel), which is not possible. To do so you’d need two mouths.

    Second, Ruby has several implementations, the most popular of which is MRI(Matz’s Ruby Implementation), named after the Ruby creator Yukihiro Mastumoto. This is the canonical ‘Ruby’ that everyone refers to when they say ‘Ruby, the programming language’. In MRI, there is something called GIL(Global Interpreter Lock) that ensures that only one thread is ever running at once. Why the GIL is there in the first place will be a rabbit hole for another time. Which means when you call Thread.new in Ruby and schedule a job on it, it isn’t really running in parallel, because the GIL is locking Ruby code. There are other Ruby implementations like JRuby, that does not have the GIL and on those implementations, true parallelism is possible. On the Python side, the story is the same, with CPython(the default implementation) having a GIL.

    However, ruby threads are also native threads(only true as of Ruby 1.9). This means that every Ruby thread is backed by an OS thread. When a Ruby application blocks on i/o, the ruby runtime can actually switch to allow another thread to continue running, because this ‘blocking’ happens outside of the GIL. For example, if your Ruby application makes a network request, and is waiting for the network to respond, it can actually release its lock on the GIL, and allow another thread to serve an incoming request. When the network contents are fetched, the OS interrupts the blocked thread, and allows the thread to resume. And so it can be said that some amount of parallelism is happening in here! However, this only happens for I/O operations. To contrast it, if you had two threads handling incoming web requests, and two requests came in at the same time, you can bet that whether the first thread handles both requests, or whether they are each handled by different threads, the GIL will ensure that only one thread is only handling a request at any given time. In short: No parallelism during compute only operations. This is good news for Ruby and web applications, since the nature of web applications are that they are i/o bound. This means that most of the time, Ruby applications are blocked waiting for the database or the network. As discussed above, this block happens outside of the GIL, and so if a request arrives while the current thread is blocked on i/o, Ruby can execute the thread that serves that request.

    Now, what about Puma? Doesn’t that enable parallelism? Yes, parallelism happens with Puma, but through a different mechanism. Puma forks multiple OS processes, creating multiple copies of your apps in memory(multiprocessing). As a reminder, processes provides the resources needed to execute a program, and are isolated from each other by the OS, have their own virtual address space, executable code, environment variables, process identifier, and at least one thread of execution(the main thread). Threads on the other hand, is an entity within a process that can be scheduled for execution, but shares the processes’s virtual address space and system resources.

    So when Puma starts 5 worker processes, there are 5 copies(processes) of Rails apps running, isolated from each other by the OS. These copies live in memory(and thus takes up RAM), have their own db connection pools, and so on. However, if there are less than 5 cpu cores on the machine it is running on, our 5 workers would not be able to achieve full (compute) parallelism under peak load. Each worker then transparently schedules additional threads to serve the ruby application, so that the application doesn’t need to think about it.

    To come back to the original question, does Ruby support multithreading? If we simply define multithreading as having a thread primitive, then Ruby and Python are definitely multithreaded. But that doesn’t mean that those threads are running in parallel. But even if they aren’t running in parallel, multithreading in Ruby speeds up web applications because they are i/o heavy!

    Thanks to Tom Clark for reviewing drafts of this.


  • Tyranny of Virtualization

    Last Christmas I finally upgraded my 7 year old Macbook. I backed all my pictures, code, and dot files. But I left one very major thing out — I forgot to copy over my private keys. These were tied to all my AWS deploy keys deployed with Terraform over the past couple of years. This means I couldn’t get into my server to renew my SSL certs, (which for some awful reason that I couldn’t debug), did not update automatically. So I locked myself out of my own fortess, great. But I realized in retrospect that my whole set up to host this blog had been perhaps too close to the metal. I was interested in learning, and so led myself down that direction, but now I have no intention of doing so.

    To be clear, to host this site previously, I had:

    1. Provisioned ec2 instances, iam roles, ebs volumes, etc, via Terraform. I had to debug issues with servers not provisioned properly, and to make sure I understood how everything worked, I would destroy and recreate the entire set up several times to make sure it was absolutely turnkey.

    2. Bootstrapped the server by crafting Ansible files that would install nginx, spam blocking tools like fail2ban, and automated cert renewal with LetsEncrypt. Fought with Ubuntu versions not being supported mid development, and the differences between them.

    I haven’t even gone into some of my projects, which included bootstrapping a similiar ec2 instances but running a docker daemon with Selenium containers, all struggling on an t2.nano instance as I was too cheap to pay for more. I didn’t feel like running apps in 2018 ought to cost more than a couple bucks a month. I think thats true, but in return I had to move higher up the virtualization stack.

    I had already been relying on a lot of AWS infrastructure, but up till now I still felt I was in control over what was happening on my boxes, and if so, I could move compute resources to another cloud provider and it would work. But with the shift to s3 and cloudfront that changed.

    This site is generated with Hugo, and there is no reason a static site should need to be served with nginx:) However, I was learning, and none of that was wasted. I’m lucky that AWS s3 makes website hosting super easy. However, it does not support https, which is a show stopper for me. To get https to work, I need to provision a Cloudfront distribution which serves the s3 bucket. Naturally, this was all done through Terraform.

    I don’t know what my AWS bill is going to look like, but I expect it to be 0, since this site does not get nearly enough traffic to exceed the Free tier. Out the window went all of my nginx config files and aws policies that I had created earlier. My site is cheaper and much faster, but in the process I’m sucked a litle deeper into the vortex of the AWS ecosystem…what do they call it, causal pleasure?


  • Hoisting JS

    I’m going through Joel Martin’s amazing teaching tool, mal, which teaches you how to write your own Lisp from scratch. I wanted to get a little better at Javascript, and so chose to use it as my language of implementation.

    I’m really glad I’m writing Javascript in the realm of ES6, but there was a bit of hoisting that definitely took me by surprise.

    
    function EVAL(ast, env) {
      ...
      switch (ast[0]) {
        case "def!":
          [__, first, second] = ast
          let value = EVAL(second, env)
          ...
          return value
        case "let*":
          [__, first, second] = ast
          ...
          return EVAL(second, newEnv)
        ...
    
    ...
    

    It turns out that even with the array destructuring, the variable is hoisted into the global context. I only detected this issue with a def that was nested in a let. In those situations, the variable ‘second’ would be overidden in the nested call, so that it would actually change when it returned to the caller.

    If only I had remembered to enable strict mode


  • Poetry at SFPC

    Last year I was part of an independent art school in NYC called SFPC. A lot of people have since asked me what it is, and what it means to be in such a school. The site has a blurb on what they do, but that description still leaves me and the people I explain it to slightly puzzled.

    So what is SFPC? It is a school that teaches computation in the service of art. It strives to give students the tools to express themselves in the medium of computation. I think the choice of the word computation is deliberate. It encompasses more than software, and to me, is wider than the word ‘algorithm’, but it also implies some sort of mechanization. It also teaches students to look at technology critically.

    So what really is SFPC? I think for artists, who are used to expression, it can be a way of learning the ’engineering’ aspects of working with hardware and software, much like how photographers might eventually have to learn the intracacies of zone metering systems. For engineers like myself, it was about learning about how artists see the world, and how to evaluate things qualitatively as opposed to quantitatively.

    The artist Jer Thorp once said to me, “I’m allergic to outcomes”. I had asked him how he knew his St. Louis Map Room project was successful, and what metrics he was interested in in measuring its success. I highly recommend everyone to check out his blog post on it. I think artists believe, rightfully or wrongfully, that the process of asking questions will eventually lead to the right answers - the caveat being that the right questions are being asked, by the right people. Implicit in that is that there is no Right answer. This can be something pretty hard for an engineer to stomach.

    10 weeks is a short time to unlearn some deeply ingrained ways of thinking.


  • Jupyter in the Cloud

    I recently read Joe Feeeny’s amazing guide on how to get Jupyter set up in the cloud. Having suffered trying to optimize models on my laptop, I was really excited about the ability to do this, but automated, of course.

    I would recommend two small additions on top of that post:

    1. Use Amazon Linux Machine Learning AMIs, so that most deep learning frameworks(Keras + TensorFlow, Theano, numpy) and low level libraries(like CUDA) are installed already, so no need to waste precious time installing anaconda. I haven’t investigated this thoroughly, but it appears that the machine learning amis have 30gb of free storage that comes with the image, much higher than the 8gb limit that comes with Ubuntu AMIs.

    2. Actually secure the server. Fortunately, this is really easy to do with Ansible Roles.

    If you are new to Ansible and Terraform, this might not be the best post to start, as I will only cover the broad strokes.

    Provision the server

    The relevant parts here are to open an incoming port to the server so that Jupyter notebook server can listen on it, in addition to to the default ssh port that needs to be exposed for Ansible. I had already previously set up an AWS key pair and a security group enabling outbound access and opening the ssh port. As you can see here I also use cloudflare to provision an A record so that we can set up SSL.

    Note that I also modify a local file that is configured to be my ansible hosts file. You can make an ansible.cfg file to do this.

    # config.tf
    provider "aws" {
      access_key = "${var.aws_access_key}"
      secret_key = "${var.aws_secret_key}"
      region = "${var.region}"
    }
    
    provider "cloudflare" {
      email = "${var.cloudflare_email}"
      token = "${var.cloudflare_api_key}"
    }
    
    resource "aws_security_group" "notebook_access" {
      name        = "jupyter_access"
      description = "Allow access on Jupyter default port"
    
      ingress {
        from_port   = 8888
        to_port     = 8888
        protocol    = "tcp"
        cidr_blocks = ["0.0.0.0/0"]
      }
      tags {
        Name = "allow_notebook_access"
      }
    }
    
    data "aws_security_group" "default_security_group" {
      id = "${var.aws_default_security_group_id}"
    }
    
    resource "aws_instance" "chestnut" {
      ami           = "${lookup(var.deep_learning_amis, var.region)}"
      instance_type = "p2.xlarge"
      key_name = "deployer-key" # already existing through other configuration
      security_groups = ["${data.aws_security_group.default_security_group.name}", "${aws_security_group.notebook_access.name}"]
      count = "${var.count}"
    }
    
    resource "cloudflare_record" "chestnut" {
      domain = "${var.cloudflare_domain}"
      name   = "chestnut"
      value  = "${aws_instance.chestnut.public_ip}"
      type   = "A"
    }
    
    resource "local_file" "ansible_hosts" {
      filename = "${path.module}/ansible/hosts"
      content = <<EOF
    [web]
    ${cloudflare_record.chestnut.hostname}
    EOF
    }
    

    Configure notebook

    Using a playbook, we can do SSL signing and updating the notebook config in one fell swoop.

    ---
    - hosts: web
      gather_facts: no
      remote_user: ec2-user
      vars:
        domain: "mydomain.com"
        notebook_config_path: "~/.jupyter/jupyter_notebook_config.py"
        certbot_install_from_source: yes
        certbot_auto_renew: yes
        certbot_auto_renew_user: "{{ ansible_user }}"
        certbot_auto_renew_minute: 20
        certbot_auto_renew_hour: 5
        certbot_admin_email: "{{ email }}"
        certbot_create_if_missing: yes
        certbot_create_standalone_stop_services: []
        certbot_create_command: "{{ certbot_script }} certonly --standalone --noninteractive --agree-tos --email {{ cert_item.email | default(certbot_admin_email) }} -d {{ cert_item.domains | join(',') }} --debug"
        certbot_certs:
         - domains:
           - "{{ domain }}"
      roles:
        - role: geerlingguy.certbot
          become: yes
      tasks:
        - name: Enable daily security updates
          become: yes
          package:
            name: yum-cron-security.noarch
            state: present
    
        - name: Ensure that cert keys can be read
          become: yes
          file:
            path: /etc/letsencrypt/live
            mode: a+rx
            recurse: yes
    
        - name: Ensure that archive is readable too
          become: yes
          file:
            path: /etc/letsencrypt/archive
            mode: a+rx
            recurse: yes
    
        - name: Update certfile
          replace:
            path: "{{ notebook_config_path }}"
            regexp: '.*c.NotebookApp\.certfile.*'
            replace: "c.NotebookApp.certfile = '/etc/letsencrypt/live/{{ domain }}/fullchain.pem'"
    
        - name: Update keyfile
          replace:
            path: "{{ notebook_config_path }}"
            regexp: '.*c.NotebookApp\.keyfile.*'
            replace: "c.NotebookApp.keyfile = '/etc/letsencrypt/live/{{ domain }}/privkey.pem'"
    
        - name: Configure notebook to bind to all ips
          replace:
            path: "{{ notebook_config_path }}"
            regexp: '.*c.NotebookApp\.ip.*'
            replace: "c.NotebookApp.ip = '*'"
    
        - name: Don't open browser by default
          replace:
            path: "{{ notebook_config_path }}"
            regexp: '.*c.NotebookApp\.open_browser.*'
            replace: "c.NotebookApp.open_browser = False"
    

    Some interesting things to point out here:

    1. Lets Encrypt support for Amazon Linux AMIs are in development, so I had to essentially copy over certbot_create_command and add the ‘debug’ flag.
    2. certbot_create_standalone_stop_services has to be set to [] for me, since it assumes nginx is running by default, and the script fails if nginx is not running.
    3. You might need to install the geerlingguy.certbot role if you haven’t already
      • ansible-galaxy install geerlingguy.certbot

    The rest is straightforward, and can be updated to set more configurations on the config file!

    With that done, all that is left is to ssh into the server, source the right environment, and run the jupyter notebook(with a command like jupyter notebook). I guess this could be daemonized, but I like to be sshed in to have confirmations that the notebook is still alive. I ran into an issue trying to debug this on a t2.nano instance, where the notebook would continually crash, and it was good to see some output.

    I had to stop going down the rabbit hole, but it would be trivial to run fail2ban as good measure on the server. Right now we also still need to copy the token from stdout when the server starts, but the config file could be modified to do that.