AbacusI'm Daniel Kronovet, a data scientist living in New York City.
http://kronosapiens.github.io/
Wed, 29 Mar 2017 12:54:53 +0000Wed, 29 Mar 2017 12:54:53 +0000Jekyll v3.4.3Objective Functions in Machine Learning<p>Machine learning can be described in many ways. Perhaps the most useful is as type of optimization. Optimization problems, as the name implies, deal with finding the best, or “optimal” (hence the name) solution to some type of problem, generally mathematical.</p>
<p>In order to find the optimal solution, we need some way of measuring the quality of any solution. This is done via what is known as an <strong>objective function</strong>, with “objective” used in the sense of a goal. This function, taking data and model parameters as arguments, can be evaluated to return a number. Any given problem contains some parameters which can be changed; our goal is to find values for these parameters which either maximize or minimize this number.</p>
<p>The objective function is one of the most fundamental components of a machine learning problem, in that it provides the basic, formal specification of the problem. For some objectives, the optimal parameters can be found exactly (known as the analytic solution). For others, the optimal parameters cannot be found exactly, but can be approximated using a variety of iterative algorithms.</p>
<p>Put metaphorically, we can think of the model parameters as a ship in the sea. The goal of the algorithm designer is to navigate the space of possible values as efficiently as possible to guide the model to the optimal location.</p>
<p>For some models, the navigation is very precise. We can imagine this as a boat on a clear night, navigating by stars. For others yet, the ship is stuck in a fog, able to make small jumps without reference to a greater plan.</p>
<p>Let us consider a concrete example: finding an average. Our goal is to find a value, <script type="math/tex">\mu</script>, which is the best representation of the “center” of some set of n numbers. To find this value, we define an objective: the sum of the squared differences, between this value and our data:</p>
<script type="math/tex; mode=display">\mu = argmin_{\mu} \sum_{i=1}^n (x_i - \mu)^2</script>
<p>This is our objective function, and it provides the formal definition of the problem: to <strong>minimize an error</strong>. We can analyze and solve the problem using calculus. In this case, we rely on the foundational result that the minimum of a function is reliably located at the point where the derivative of the function takes on a zero value. To solve the function, we take the derivative, set it to 0, and solve for <script type="math/tex">\mu</script>:</p>
<script type="math/tex; mode=display">\frac{d}{d\mu} \sum_{i=1}^n (x_i - \mu)^2 = \sum_{i=1}^n -2(x_i - \mu) = 0</script>
<script type="math/tex; mode=display">\sum_{i=1}^n (x_i - \mu) = 0</script>
<script type="math/tex; mode=display">\sum_{i=1}^n x_i = n\mu</script>
<script type="math/tex; mode=display">\frac{\sum x_i}{n} = \mu</script>
<p>And so. We see that the value which minimizes the squared error is, in fact, the mean. This elementary example may seem trite, but it is important to see how something as simple as an average can be interpreted as a problem of optimization. Note how the value of the average changes with the objective function: the mean is the value which minimizes the sum of squared error, but it is the median which minimizes the sum of <em>absolute error</em>.</p>
<p>In this example, the problem could be solved analytically: we were able to find the exact answer, and calculate it in linear time. For other problems, the objective function does not permit an analytic or linear-time solution. Consider the logistic regression, a classification algorithm whose simplicity, flexibility, and robustness has made it a workhorse of data teams. This algorithm iterates over many possible classification boundaries, each iteration yielding a more discriminant classifier. Yet, the true optimum is never found: the algorithm simply terminates once the solution has reached relative stability.</p>
<p>There are other types of objective functions that we might consider. In particular, we can conceive of the <em>maximizing of a probability</em>.</p>
<p>Part of the power of probability theory is the way in which it allows one to reason formally (with mathematics) about that which is fundamentally uncertain (the world). The rules of probability are simple: events are assigned a probability, and the probabilties must all add to one, because <em>something</em> has to happen. The way we represent these probabilities, however, is somewhat arbitrary – a list of real numbers summing to 1 will do. In many cases, we use functions.</p>
<p>Consider flipping a coin. There are two possible outcomes: heads and tails. The odds of heads and the odds of tails must add to 1, because one of them must come up. We can represent this situation with the following <a href="https://en.wikipedia.org/wiki/Bernoulli_distribution">equation</a>:</p>
<script type="math/tex; mode=display">p^x(1-p)^{1-x}</script>
<p>Here <script type="math/tex">x</script> is the coin and <script type="math/tex">x = 1</script> means heads and <script type="math/tex">x = 0</script> if tails, and <script type="math/tex">p</script> is the odds of coming up heads. We see that if the coin is heads, the value is <script type="math/tex">p</script>, the chance of heads. If the coin is tails, the value is <script type="math/tex">1-p</script>, which by necessity is the chance of tails. We call this equation <script type="math/tex">P(x)</script>, and it is a probability distribution, telling us the probability of various outcomes.</p>
<p>Now, not all coins are fair (meaning that <script type="math/tex">p = 1-p = 0.5</script>). Some may be unfair – with heads, perhaps, coming up more often. Say we flipped a coin a few times, and we were curious as to whether the coin was biased. How might we discover this? Via the likelihood equation. Intuitively, we seek a value of p which gives the <em>maximum likelihood</em> to the coin flips we saw.</p>
<p>The word maximum should evoke our earlier discussion: we are again in the realm of optimization. We have a function and are looking for an optimal value: except now instead of minimizing an error, we want to <strong>maximize a likelihood</strong>. Calculus helped us one before – perhaps it may again?</p>
<p>Here is the <em>joint likelihood distribution</em> of our series <script type="math/tex">x</script> of n coin flips (now <script type="math/tex">x</script> represents many flips, each individual flip subscripted <script type="math/tex">x_1</script>, etc):</p>
<script type="math/tex; mode=display">P(x) = \prod_{i=1}^n p^{x_i}(1-p)^{1-x_i}</script>
<p>The thing to note here is that the probability of two what we call <em>independent</em> events (i.e. one does not give us knowledge about the other) is the product of the probability of the events separately. In this case, the coin flips are <em>conditionally independent</em> given heads probability p. One consequence is that <script type="math/tex">P(x) \in (0, 1)</script>, and generally much closer to 0 than 1.</p>
<p>The <em>logarithm</em> is a remarkable function. When introduced in high school, the logarithm is often presented as “the function which tells you the power you would need to raise a number to to get back the original argument”, or put more succintly, the degree to which you would need to exponentiate a base. This exposition obscures the key applications of the logarithm:</p>
<ol>
<li>It makes small numbers big, and big numbers small.</li>
<li>It turns multiplication into addition.</li>
<li>It increases monotonically (if <script type="math/tex">x</script> gets bigger, <script type="math/tex">log(x)</script> gets bigger).</li>
</ol>
<p>The first point helps motivate the use of “log scales” when presenting data of many types. Humans (and computers) are comfortable reasoining about magnitudes along certain types of scales; others, such as exponential scales, are less intuitive. The logarithm allows us to interpret events happening on incredible magnitude in a more familiar way. This property, conveniently, also comes in handy when working with very small numbers – such as those involved in join probability calculations, in which the probability of any particular complex event is nearly 0. The logarithm takes very small positive numbers and converts them to more comfortable, albeit negative, numbers – much easier to think about (and, perhaps more importantly, compute with).</p>
<p>The second point comes in handy when we attempt the actual calculus. By turning multiplication into addition, the function is more easily differentiated, without resorting to cumbersome applications of the product rule.</p>
<p>The third point provides the essential guarantee that the optimal solution for the log function will be identical with the optimal solution for the original function. This means that we can optimize the log function and get the right answer for the original.</p>
<p>Taking the logarithm of the joint likelihood function, we get the <strong>log likelihood</strong>:</p>
<script type="math/tex; mode=display">log(P(x)) = \sum_{i=1}^n x_ilog(p) + (1-x_i)log(1-p)</script>
<p>What can we do with this? In this problem, we can use it to find the optimal value for p. Taking the derivative of this function with respect to p (recall that the derivative of <script type="math/tex">log(x)</script> is <script type="math/tex">1/x</script>), and setting to 0, we have:</p>
<script type="math/tex; mode=display">\frac{d}{dp}log(P(x)) = \sum_{i=1}^n \frac{x_i}{p} - \frac{1-x_i}{1-p} = 0</script>
<p>We can solve for p:</p>
<script type="math/tex; mode=display">\sum_{i=1}^n \frac{x_i(1-p)}{p} - (1-x_i) = 0</script>
<script type="math/tex; mode=display">\sum_{i=1}^n (\frac{x_i}{p}-x_i) - (1-x_i) = 0</script>
<script type="math/tex; mode=display">\sum_{i=1}^n \frac{x_i}{p} - 1 = 0</script>
<script type="math/tex; mode=display">\frac{\sum x_i}{p} = n</script>
<script type="math/tex; mode=display">\frac{\sum x_i}{n} = p</script>
<p>And so again, the optimal value for the probability p of heads is, for this particular definition of optimal, the ratio of observed heads to total observations. We see how our intuition (“the average!”) is made rigorous by the formalism.</p>
<p>This example is a model of a simple object. More advanced objects (such as a constellation of interdependent events) require more advanced models (such as a Hidden Markov Model), for which the optimal solution involves many variables and as a consequence more elaborate calculations. In some cases, as with the logistic regression, the exact answer cannot ever be known, only <a href="http://kronosapiens.github.io/blog/2015/11/22/understanding-variational-inference.html">iteratively approached</a>.</p>
<p>In all of these cases, however, the log of the likelihood function remains an essential tool for the analysis. We can use it to calculate a measure of quality for an arbitrary combination of parameters, as well as use it (in a variety of ways) to attempt to find optimal parameters in a computationally efficient way. Further, while the examples given above are possibly the two simplest non-trivial examples of these concepts, they capture patterns of derivation which recur in more complex models.</p>
Tue, 28 Mar 2017 00:00:00 +0000
http://kronosapiens.github.io/blog/2017/03/28/objective-functions-in-machine-learning.html
http://kronosapiens.github.io/blog/2017/03/28/objective-functions-in-machine-learning.htmlmachine learningoptimizationmathblogMaster's Thesis<p>I am very happy to announce that I have completed a master’s thesis. Titled “An Analysis of Pairwise Preference”, it describes a theory of decision-making which allows for the formal analysis of subjective preferences. As an academic work, it is an exercise in developing a theory from first principles, demonstrating a grasp of method and rigor. The experience of writing it was excellent.</p>
<p><strong>Abstract:</strong></p>
<blockquote>
<p>Human coordination can be thought of as a problem in information processing. The subjective interiors of the entities involved creates challenges for formal analysis of preference. This work develops theoretical foundations for analysis of subjective preference, and develops and evaluates a number of algorithms for undertaking such analyses.</p>
</blockquote>
<p>The thesis consists of <a href="https://github.com/kronosapiens/thesis/tree/master/data">data</a>, <a href="https://github.com/kronosapiens/thesis/tree/master/code">code</a>, and <a href="http://nbviewer.jupyter.org/github/kronosapiens/thesis/blob/master/tex/thesis.pdf">tex</a>. Feedback welcome.</p>
Mon, 06 Feb 2017 00:00:00 +0000
http://kronosapiens.github.io/blog/2017/02/06/thesis.html
http://kronosapiens.github.io/blog/2017/02/06/thesis.htmlmathcomputer scienceeconomicsblogA Basic Computing Curriculum<p>Over the summer, I developed a <a href="https://github.com/kronosapiens/computing-basics"><strong>basic curriculum</strong></a> for those looking to become more proficient with computers. This project emerged as a result of work done as a TA for the QMSS G4063: Data Visualization. While assisting that course, it became clear that students were struggling with basic computing tasks, only tangentially related to the material of the course. As a result, much time was spent teaching students basic computing and programming concepts: the ambient knowledge that those comfortable with software take for granted.</p>
<p>I received positive reviews from students and faculty, and was hired to develop a simple, introductory curriculum for incoming students. The curriculum covers the following topics:</p>
<ol>
<li><a href="https://github.com/kronosapiens/computing-basics/tree/master/0-overview">Overview of computer architecture and interfaces</a></li>
<li><a href="https://github.com/kronosapiens/computing-basics/tree/master/1-development">Software development concepts</a></li>
<li><a href="https://github.com/kronosapiens/computing-basics/tree/master/2-python">Introduction to Python</a></li>
<li><a href="https://github.com/kronosapiens/computing-basics/tree/master/3-r">Introduction to R</a></li>
<li><a href="https://github.com/kronosapiens/computing-basics/tree/master/4-web">Networking and the web</a></li>
</ol>
<p>The goal of this content is not to be comprehensive, but rather succient, accessible and intuitive, without sacrificing precision and accuracy. The curriculum is designed to be interacted with: the hosting on Github and the included code templates are intended to provide entry points to encourage students to interact with files as developers, and to gain comfort with software development interfaces.</p>
<p>This is a work in progress. Suggestions and feedback would be welcomed :)</p>
Tue, 04 Oct 2016 00:00:00 +0000
http://kronosapiens.github.io/blog/2016/10/04/a-basic-computing-curriculum.html
http://kronosapiens.github.io/blog/2016/10/04/a-basic-computing-curriculum.htmlprogrammingeducationblogThe Problem of Information II<h4 id="1">1</h4>
<p>In Part I, we established the data processing inequality and used it to conclude that no analysis of data can increase the amount of information we have about the world, beyond the information provided by the data itself:</p>
<script type="math/tex; mode=display">X \rightarrow Y \rightarrow \hat{X}</script>
<script type="math/tex; mode=display">\Rightarrow</script>
<script type="math/tex; mode=display">I(\hat{X};X) \leq I(Y;X)</script>
<p>We’re not quite done. The fundamental problem is not simply <em>learning about the world</em>, but rather <em>human learning about the world</em>. The full model might look something like this:</p>
<script type="math/tex; mode=display">\text{world} \rightarrow \text{measurements} \rightarrow \text{analysis} \rightarrow [perception] \rightarrow \text{understanding}</script>
<p>Incorporating the human element requires a larger model and additional tools.</p>
<h4 id="2">2</h4>
<p>A <strong>channel</strong> is the medium by which information travels from one point to another:</p>
<script type="math/tex; mode=display">X \rightarrow [channel] \rightarrow Y</script>
<p>At one end, we have information, encoded as some sort of representation. We send this representation through the channel. A receiver at the other end receives some signal, which they reconstruct into some sort of representation. Hopefully, this reconstruction is close to the original representation.</p>
<p>No (known) channel is perfect. There is too much uncertainty in their underlying physics and mechanics of their actual construction. Mistakes are made. Bits are flipped. We say the amount of information that a channel can reliably transmit is that channels <strong>capacity</strong>. For a given channel, capacity is denoted <script type="math/tex">C</script>, and for input variable <script type="math/tex">X</script> and output varaible <script type="math/tex">Y</script> it is defined like this (<script type="math/tex">\triangleq</script> means “defined as”):</p>
<script type="math/tex; mode=display">C_{channel} \triangleq max_{P(X)} I(X; Y)</script>
<p>This means that the capacity is equal to the maximum mutual information between <script type="math/tex">X</script> and <script type="math/tex">Y</script>, over all distributions on <script type="math/tex">X</script>. Using a well-known identity, we can rewrite this equation as follows:</p>
<script type="math/tex; mode=display">I(X; Y) = H(Y) - H(Y|X)</script>
<p>This shows us that capacity is a function of both the entropy of <script type="math/tex">Y</script> and the conditional entropy of <script type="math/tex">Y</script> given <script type="math/tex">X</script>. The conditional entropy represents the uncertainty in <script type="math/tex">Y</script> given <script type="math/tex">X</script> – in other words, the quality of the channel (for a perfect channel, this value would be <script type="math/tex">0</script>). <script type="math/tex">H(Y)</script> is a function of <script type="math/tex">H(X)</script> and is what we try to maximize when determining capacity.</p>
<p>Observe that capacity <script type="math/tex">C</script> is a function of both the channel and the randomeness input. For a fixed channel, capacity is a function of the input. For a fixed input, the capacity is a function of the channel (here is it known as “distortion”).</p>
<p>Below is an example of what is known as a “<a href="https://en.wikipedia.org/wiki/Binary_symmetric_channel">Binary Symmetric Channel</a>” – a channel with two inputs and outputs (hence binary), and a symmetric probability of error <script type="math/tex">p</script>. This diagram should be interpretable given what we’ve discussed above.</p>
<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Binary_symmetric_channel_%28en%29.svg/800px-Binary_symmetric_channel_%28en%29.svg.png" alt="binary symmetric channel" /></p>
<p>The major result in information theory concerning channel capacity goes like this:</p>
<script type="math/tex; mode=display">P_e^{(n)} \rightarrow 0 \Rightarrow C \geq H(X)</script>
<p>What this says is that for any transmission scheme where the probability of error (<script type="math/tex">P_e^{(n)}</script>) goes to zero (as block length <script type="math/tex">n</script> increases), the capacity of the channel is greater than or equal to the entropy of the input. This is true even for perfect channels (with no error) – meaning that <script type="math/tex">H(X)</script>, the uncertainty inherent in the source, is a <strong>fundamental limit</strong> in communication.</p>
<p>More plainly, we observe that successful transmission of information requires a channel that is less uncertain than the source you’re trying to transmit. This should be intuitively satisfying. If the channel is more chaotic than what you’re trying to communicate, the output will be more a result of that randomness than whatever message you wanted to send. The flip interpretation, which is less intuitive, is that the more random the source, the more tolerant you can be of noisy channels.</p>
<p>Finally, the converse tells us that any attempt to send a high-entropy source through a low-capacity channel is gauranteed to result in high error.</p>
<h4 id="3">3</h4>
<p>With that established, we can now consider the question of <strong>human</strong> communication:</p>
<script type="math/tex; mode=display">stimulus \rightarrow [perception] \rightarrow impression</script>
<p>Let’s consider the metaphor and see if it holds. We want to say that the process of communication is exposing those around us to stimulus (ourselves, media, etc), having that stimulus transmitted through the channels of perception, and ultimately represented in the mind as some sort of impression (such as an understanding or feeling). On a first impression, this seems reasonable and general.</p>
<p>What is <em>not</em> present here is the concept of <strong>intention</strong>. In our communication, we may at various points be trying to teach, persuade, seduce, amuse, mislead, or learn. What is also absent is the concept of “creativity”, or receiving an impression somehow greater than the stimulus. We will return to these questions later and see if we can address it.</p>
<p>Let’s consider a simple case: the teacher trying to teach. We can assume good intention and an emphasis on the transfer of information. We model as follows:</p>
<script type="math/tex; mode=display">teaching \rightarrow [perception] \rightarrow learning</script>
<p>The “capacity” of human perception is then:</p>
<script type="math/tex; mode=display">C_{perception} = max_{P(teaching)} I(teaching; learning)</script>
<script type="math/tex; mode=display">I(teaching; learning) = H(learning) - H(learning|teaching)</script>
<p>This allows us to consider both the randomness of the source (teaching), and the uncertainty in the transmission (perception). We seem justified in proposing the following:</p>
<ol>
<li>The challenge of teaching is in maximizing the information the student has about the subject.</li>
<li>A subject is “harder” if there is more complexity in the subject matter.</li>
<li>A subject is also “harder” if it is difficult to convey the material in an understandable way.</li>
<li>A “good” teacher is one who can present the material in a way that is appropriate for the students.</li>
<li>A “good” student is one who can make the most sense of the material that was presented.</li>
</ol>
<p>Let’s begin with (4), the idea of material being tailored to the student, or the input being tailored to the channel. Intuitively, we would like to say that a good teacher can change the teaching (the stimulus) they present to the student in order to maximize the student’s learning.</p>
<p>First, consider that material may be too advanced for some students. We would like to say then that the capacity of that student was insufficient for the complexity of the material. To say this, we must first consider the relationship between randomness with complexity.</p>
<h4 id="4">4</h4>
<p>The language of information theory is the language of randomness and uncertainty. In teaching, it is more comfortable to speak in the language of complexity, difficulty, or challenge. Can these be equivalent?</p>
<p>Entropy is a measure of randomness, and entropy is a function of both 1) the number of possible outcomes of a random process, and 2) the likelihood of the various outcomes. A 100-sided fair die is more random than a 10-sided fair die, while a 1000-sided die that always came up 7 is not really random at all.</p>
<p>Complexity, on the other hand, can be understood as the number and nature of the relationships among various parts of a system. We can perhaps formalized this as the number of pathways by which a change in one part of the system can affect the overall state of the system.</p>
<p>To argue equivalence, we assert that there is always some degree of uncertainty in any system, or in any field of study. In math, these are formalized as variables. In history, these can be the motivations of various actors. The more complex a system, the larger the number of outcomes and the relationship between components. In the language of probability, we say there are more possible outcomes, and that due to the complex relationships between parts, that there are significant odds of many different outcomes.</p>
<p>Consider the example of teaching math. Arithmetic is simpler than geometry, in that the expression</p>
<script type="math/tex; mode=display">2 + 2</script>
<p>contains fewer conceptual “moving pieces” than the expression</p>
<script type="math/tex; mode=display">\sin(45°)</script>
<p>Understanding arithmetic requires the student to keep track of the concept of “magnitude” and be able to relate magnitudes via relations of joining (addition and subtraction) and scaling (multiplication and division). It requires the abstract concept of negative numbers.</p>
<p>Understanding geometry requires more tools. It requires students to be able to deal with points in space, and understand how to use the Cartesian plane to represent the relationship between points and numbers. It introduces the idea of “angle” as a new kind of relationship, on top of arithmetic’s “bigger” and “smaller”.</p>
<p>Put another way, arithmetic requires only a line, while geometry requires a plane. More concepts means more possible relationships between objects, which means more possible dimensions of uncertainty, which means more complexity.</p>
<p>We conclude at least a rough equivalence between complexity and uncertainty.</p>
<h3 id="v">V</h3>
<p>Returning to the teaching example, we can now speak in terms of complexity of the material instead of randomness of the source. If material is too complex for the student (<script type="math/tex">% <![CDATA[
C < H(X) %]]></script>), then the material cannot be taught to that student (yet).</p>
<p>Observe that the channel (the student) is not fixed, but is able to handle increasingly complex subjects over time.</p>
<p>… to be continued?</p>
Thu, 19 May 2016 00:00:00 +0000
http://kronosapiens.github.io/blog/2016/05/19/the-problem-of-information2.html
http://kronosapiens.github.io/blog/2016/05/19/the-problem-of-information2.htmlinformation-theorymachine-learningblogThe Problem of Information<h4 id="1">1</h4>
<p>The Data Processing Inequality is one of the first results in information theory.</p>
<p>It can be stated as follows:</p>
<p><em>No transformation of measurements of the world can increase the amount of information available about that world.</em></p>
<p>In formal language, it goes like this:</p>
<p>Given a first-order Markov chain</p>
<script type="math/tex; mode=display">X \rightarrow Y \rightarrow \hat{X}</script>
<p>such that <script type="math/tex">\hat{X}</script> depends only on <script type="math/tex">Y</script>, which depends only on <script type="math/tex">X</script>, then</p>
<script type="math/tex; mode=display">I(\hat{X};X) \leq I(Y;X)</script>
<p>The measure <script type="math/tex">I(A,B)</script> is known as the <a href="https://en.wikipedia.org/wiki/Mutual_information">mutual information</a>, a measure of how much information one variable gives us about another.</p>
<p>What this says is that the information <script type="math/tex">\hat{X}</script> tells us about <script type="math/tex">X</script> cannot be more than the information we already had from <script type="math/tex">Y</script>. In other words, that <strong>processing</strong> data adds no new information.</p>
<h4 id="2">2</h4>
<p>Let’s consider the problem of learning from data. Let’s put it in the framework:</p>
<script type="math/tex; mode=display">\text{the world} \rightarrow \text{some measurements} \rightarrow \hat{\text{your analysis}}</script>
<p>which implies</p>
<script type="math/tex; mode=display">I(\hat{\text{your analysis}};\text{the world}) \leq I(\text{some measurements};\text{the world})</script>
<p>In other words, analysis doesn’t tell you anything new. What it <strong>does</strong> do, though, is make the information you already have more easily digestible. It puts it in forms you can work with. Think averages and odds. Think dashboards. Less information, but more actionable.</p>
<h4 id="3">3</h4>
<p>Let’s take this a bit further. Think of your analysis as a function, <script type="math/tex">G</script>, of your data. This gives us:</p>
<script type="math/tex; mode=display">X \rightarrow Y \rightarrow G(Y)</script>
<p>We can then formulate the learning problem as a search over the space of possible functions <script type="math/tex">G</script>. In order to assess the quality of one <script type="math/tex">G</script> over another, we must use some sort of measure of “expressiveness”. Call this <script type="math/tex">E</script>, such that <script type="math/tex">E[G(Y)]</script> is some measurement of the expressiveness of the analysis <script type="math/tex">G(Y)</script>.</p>
<p>Our goal becomes finding an optimal function <script type="math/tex">G^*</script> such that:</p>
<script type="math/tex; mode=display">E[G^*(Y)] \geq E[G(Y)], \forall G</script>
<p>In other words, that <script type="math/tex">G^*</script> maximizes the expressive power of the data <script type="math/tex">Y</script>. Our choice of <script type="math/tex">E</script> drives the exploration of the space of possible <script type="math/tex">G</script>.</p>
<p>This is the general formulation. To see how this general formulation maps to practice, let’s take <script type="math/tex">G</script> to be some sort of classification or regression model and <script type="math/tex">E</script> to be the log likelihood or squared error. Note how we have described the typical machine learning setting. To see how this formulation helps frame different problems, let’s take <script type="math/tex">G</script> to be a causal graph – what then should <script type="math/tex">E</script> be? How could one select an <script type="math/tex">E</script> to drive exploration of the space of causal graphs?</p>
<h4 id="4">4</h4>
<p>If our goal is to understand the world, then it would seem as though we have two opportunities for growth.</p>
<p>First, in our measurements. The world is of infinite dimension, and any measurement is a finite reflection. Measurements are choices, and the dimensions along which we choose to measure will place the upper bound on our usable knowledge.</p>
<p>Second, in our analysis. Given a finite set of measurements, <script type="math/tex">Y</script>, our goal is to transform this into a different representation that expresses the information necessary to a given task, with “expressiveness” itself given by some measure. If that task is prediction or classification (core learning problems), then expressiveness will almost certainly be measured either via the likelihood of the analysis or the smallness of the error. But there can be other tasks and other measures of expression.</p>
<p>Which, at this time, is our limiting factor? Are we limited by our analysis, unable to make sense of what we know? Or are we limited by our measurements, trying to navigate with skewed vision?</p>
<p>Do you know?</p>
<h4 id="5">5</h4>
<p><strong>Proof:</strong></p>
<script type="math/tex; mode=display">I(\hat{X};X)</script>
<p>Definition of mutual information:</p>
<script type="math/tex; mode=display">= H(X) - H(X|\hat{X})</script>
<p>Conditioning reduces entropy:</p>
<script type="math/tex; mode=display">\leq H(X) - H(X|\hat{X}, Y)</script>
<p>By Markov property:</p>
<script type="math/tex; mode=display">= H(X) - H(X|Y)</script>
<p>Voila:</p>
<script type="math/tex; mode=display">= I(X,Y)</script>
<p><strong>Note:</strong> <script type="math/tex">H(A)</script> denotes the <strong><a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">entropy</a></strong> of the random variable <script type="math/tex">A</script>, a measure of uncertainty in <script type="math/tex">A</script>. Given that more information can’t hurt, the following is always true:</p>
<script type="math/tex; mode=display">H(A|B) \leq H(A)</script>
Sat, 16 Apr 2016 00:00:00 +0000
http://kronosapiens.github.io/blog/2016/04/16/the-problem-of-information.html
http://kronosapiens.github.io/blog/2016/04/16/the-problem-of-information.htmlinformation-theorymachine-learningblogElements of Modern Computing<p>The goal of this guide is to explain in a high-level but useful way the core concepts of modern computing. This guide is aimed at those who have never interacted with software as more than an end-user of graphical applications, but who for whatever reason have a desire for more flexible and precise control over their computer and its software.</p>
<h2 id="understanding-the-filesystem">Understanding the Filesystem</h2>
<p>The most important thing to remember when doing any sort of programming is that every command is run in the context of some <strong>location</strong>. Your Desktop is a location. Your Documents folder is a location. Everything on your computer has a location, and everything exists in relation to everything else. The whole thing is called a <strong>filesystem</strong>. Here is an illustration of the typical Mac OSX filesystem (Windows filesystems are fairly similar). Indentation implies nesting:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>/
Applications/
Chess.app
Rstudio.app
iTunes.app
...
System/
Library/
Users/
Guest/
Shared/
<username>/
Desktop/
file.txt
Documents/
Downloads/
...
bin/
pwd
ls
chmod
...
var/
log/
tmp/
...
etc/
...
</code></pre>
</div>
<p>The key takeaway here is that every file in your computer has a location, and this location can be described by the full, or “absolute” path. For example, the <code class="highlighter-rouge">file.txt</code> file on the desktop can be described in the following way:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>/Users/<username>/Desktop/file.txt
</code></pre>
</div>
<p>No matter where you are in your computer, this path will always reference the same file. However, it would be tedious to have to type this verbose path every time you needed to reference a file.</p>
<p>Fortunately, there are shortcuts. One is that the tilde (<code class="highlighter-rouge">~</code>) character stands for <code class="highlighter-rouge">/Users/<username>/</code>, where <code class="highlighter-rouge"><username></code> is the current logged-in user (i.e. you). Using the tilde, you can reference the same <code class="highlighter-rouge">file.txt</code> as:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>~/Desktop/file.txt
</code></pre>
</div>
<p>As many files you’ll be working with will be stored inside your user directory, this will often be helpful.</p>
<p>Another major shortcut is to use something called a “relative” path. To learn more, read on.</p>
<h2 id="understanding-the-command-line">Understanding the Command Line</h2>
<p><strong>Really, watch <a href="https://www.youtube.com/watch?v=tc4ROCJYbm0">the video</a>.</strong></p>
<p>To understand the command line, it’s valuable to first understand the various “layers” that make up a computer.</p>
<p>At the very bottom, there’s the <strong>hardware</strong>: chips, memory, and electricity. These are the fruits of electrical engineering and can do simple things very, very quickly. Programming these directly is very tedious. As a result, we wrap the hardware around a very core piece of software, known as a <strong>kernel</strong>. The kernel is software that controls the hardware and the basic resources (CPU power, memory) of the computer. To get the computer to do things, we talk to the kernel. Note how, by “abstracting” away the the computer internals, the problem of managing complexity just got a little bit easier.</p>
<p>For many people, interactions with a computer takes place via a graphical user interface, otherwise known as a “GUI”. Icons on your desktop, double-clicks, drag-and-drop – all of these are GUI operations. The GUI is a program, like any other, which puts things on the screen and interprets keystrokes and trackpad activity. The GUI talks to the kernel and turns your clicks and keystrokes into actions.</p>
<p>A GUI is a very sophisticated program, and GUI-based computers have been around only for the last twenty or so years. Before graphical interfaces were popular (or even possible), computing took place via a much simpler interface. That interface was known as the <strong>command line</strong>, otherwise known as the <strong>shell</strong>.</p>
<p>Why shell? Because the shell was a program that <em>wrapped around</em> (get it?) the kernel and provided a convenient way to run commands. A shell is also a program, much simpler than a GUI, which provides a text-based user interface.</p>
<p>Why would someone use a shell over a more user-friendly GUI? Principally, for control. The primary drawback of a GUI is that it can only do what it was programmed to do. It is very hard to program a GUI, and the interfaces popular on modern computers are virtually impossible to modify. A shell, on the other hand, is a simple program that can do almost anything. If a GUI is intuitive but inflexible, a command line is less intuitive (at first), but extremely powerful and flexible. For programmers, data scientists, and others for whom work involves the organization and manipulation of information, this power and flexibility is crucial. This is why people use the command line.</p>
<h2 id="working-with-the-shell">Working with the shell</h2>
<p>A shell is just a program. There are many kinds of shell. On Mac OSX, the default is the <a href="https://en.wikipedia.org/wiki/Bash_(Unix_shell)">bash shell</a>. On Windows, there is <a href="https://en.wikipedia.org/wiki/Windows_PowerShell">PowerShell</a>. On OSX, you can open a bash shell by opening the “Terminal” application. On Windows, there is a PowerShell application.</p>
<p>Firing up the shell will bring you to a boring-looking screen which looks something like this:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>Hi! Welcome to the shell!
$ _
</code></pre>
</div>
<p>Not much to look at. You type some things and hit enter. Something happens. Rinse and repeat until your computer crashes or you’re a billionare. Things get more interesting when you realize that every command you type into the shell looks like the following:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>
$ <program> <arguments>
</code></pre>
</div>
<p>There is always one program, and an arbitrary number of arguments. This is how every command works. Your computer comes with several dozen built-in programs, which do very simple but useful things. We will review them shortly. First, however, we must discuss the concept of a “working directory”.</p>
<p>Recall that any file can be fully described by it’s absolute path. When interacting with files (a common activity), you could in theory specify the full path of every file. This would be extremely tedious, especially given that, for any given task, related files are generally close together. In OSX and Windows, there are the Finder and Explorer programs for browsing through folders. These programs are GUIs accomplishing the same goal.</p>
<p>The <strong>working directory</strong> is your shell’s current “location” inside the filesystem. You can “navigate” the filesystem by running commands which cause the shell to move up or down directory hierarchies. This is analagous to double-clicking on a folder on your desktop.</p>
<p>When you are in a directory, every file argument is evaluated as though the current working directory were prepended to the argument. Let’s see an example, taking place in the context of the following filesystem:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>/
dir1/
file1.py
</code></pre>
</div>
<p>Here’s the example:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>/$ python dir1/file1.py
'Hello world'
/$ cd dir1
/dir1$ python file1.py
'Hello world'
/dir1$ cd ..
/$ python file1.py
usr/bin/python: can't open file 'file1.py': [Errno 2] No such file or directory
</code></pre>
</div>
<p>Here, we saw two programs run on three arguments.</p>
<ol>
<li>First, we ran the <code class="highlighter-rouge">python</code> program on argument <code class="highlighter-rouge">dir/file1.py</code>, a python file.</li>
<li>Then, we ran the <code class="highlighter-rouge">cd</code> (change directory) program on argument <code class="highlighter-rouge">dir1</code>, also a file (in Unix, <a href="https://en.wikipedia.org/wiki/Everything_is_a_file">everything is a file</a>, even a directory.) Note how the command prompt changes to reflect the fact that we moved through the filesystem.</li>
<li>We ran the <code class="highlighter-rouge">python</code> program on argument <code class="highlighter-rouge">file1.py</code>, the same python file.</li>
<li>We ran the <code class="highlighter-rouge">cd</code> program on <code class="highlighter-rouge">..</code>, representing the current <em>parent directory</em>.</li>
<li>We ran the <code class="highlighter-rouge">python</code> program on argument <code class="highlighter-rouge">file1.py</code>, but received an error, because there is no such file in our current location.</li>
</ol>
<p>Hopefully this example will illustrate the nature of using a shell to navigate and executing commands in a filesystem.</p>
<h2 id="command-reference">Command Reference</h2>
<p><em>For Linux-style command line interfaces</em></p>
<p>Note that in this program reference, filenames and directories can be given as either absolute or relative paths.</p>
<p><code class="highlighter-rouge">pwd</code> – print working directory</p>
<p><code class="highlighter-rouge">ls</code> – list files in current directory</p>
<p><code class="highlighter-rouge">touch <filename></code> – makes a new file</p>
<p><code class="highlighter-rouge">rm <filename></code> – delete a file</p>
<p><code class="highlighter-rouge">cd <directory></code> – change current directory to <code class="highlighter-rouge"><directory></code></p>
<p><code class="highlighter-rouge">python <filename>.py</code> – run a Python file</p>
<p><code class="highlighter-rouge">mkdir <directory></code> – create a new directory</p>
<p><code class="highlighter-rouge">rmdir <directory></code> – delete a directory</p>
<p><code class="highlighter-rouge">mv <filename1> <filename2></code> – move <code class="highlighter-rouge"><filename1></code> to <code class="highlighter-rouge"><filename2></code></p>
<p><code class="highlighter-rouge">cp <filename1> <filename2></code> – copy <code class="highlighter-rouge"><filename1></code> to <code class="highlighter-rouge"><filename2></code></p>
<p><code class="highlighter-rouge">cat <filename></code> – print the entire file to the screen</p>
<p><code class="highlighter-rouge">head <filename></code> – print the first few lines of a file to the screen</p>
<p><code class="highlighter-rouge">tail <filename></code> – print the last few lines of a file to the screen</p>
<p><code class="highlighter-rouge">man <program></code> – show additional information about the program</p>
<p><code class="highlighter-rouge">echo <string></code> – print <code class="highlighter-rouge"><string></code> to the screen (as in, a string of letters)</p>
<p><code class="highlighter-rouge">ps</code> – print information about currently-running processes (instances of programs)</p>
<p>Note that programs often accept additional optional arguments. Consider:</p>
<p><code class="highlighter-rouge">tail -n 20 <filename></code> – print the last 20 lines of a file to the screen</p>
<p><code class="highlighter-rouge">ls -lah</code> – list files in current directory in a super easy-to-read format</p>
<p><code class="highlighter-rouge">ps -aux</code> – print a lot of information about currently-running processes</p>
<p><code class="highlighter-rouge">kill <integer></code> – kill the process with process id <code class="highlighter-rouge"><integer></code></p>
<h2 id="stdin-stdout-processes-piping">Stdin, Stdout, Processes, Piping</h2>
<p>By default, every program takes input from one place, “Standard Input” (<code class="highlighter-rouge">stdin</code>), and sends output to one place, “Standard Output” (<code class="highlighter-rouge">stdout</code>). In general, <code class="highlighter-rouge">stdin</code> is the keyboard/trackpad. <code class="highlighter-rouge">stdout</code> is the screen. A surprisingly large amount of programming boils down to routing the output of one program into the input of another (either via some common data store like a file or database, or directly via a <strong>pipe</strong>).</p>
<p>When you execute a command in the shell, the program “takes control” of the terminal while it is running. When it finishes, it returns control of the shell. While the programming is running, it may request input from stdin (such as asking for a password). It may also send output to stdout (for example, updating you on the program’s progress).</p>
<p>A <strong>process</strong> is an instance of a running program. To think of it another way, a program is just a bunch of zeroes and ones inside of memory. A process is that program being executed in a million steps on the computer’s CPU. The same program can be run as many processes. They fundamental rule about computers is that processes aren’t allowed to mess with each other’s memory. The kernel makes sure of this (remember the kernel?).</p>
<p>It is possible to “background” a process, which simply means that you don’t let that process take control of your shell. This lets you type in more commands while the process is running. At times this behavior may be desirable. On Unix-like systems (Linux, OSX), you can background a process using the ampersand, like this:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>/$ <program> <arg1> <arg2> &
</code></pre>
</div>
<p>Also, processes can (and often do) create (“spawn”) other processes. The Chrome process spawns Tab processes, and so on. Processes can control other processes. There are basically no limits, except that processes can only interact via a specific interface. Processes can change their working directory, or spawn subprocesses in other directories.</p>
<p>Finally, <strong>piping</strong> is a technique for routing the output of one program into the input of another:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>/$ <program1> <args> | <program2>
</code></pre>
</div>
<p>Here, the flow of information would go as follows:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>stdin -> program1 -> program2 -> stdout
</code></pre>
</div>
<p>This is useful as it allows you to combine simple programs to create more complicated programs, which <a href="https://en.wikipedia.org/wiki/Unix_philosophy">some people think is a good idea</a>.</p>
<h2 id="debugging-error-messages">Debugging Error Messages</h2>
<p>Debugging is a large topic. Here we will discuss the most important principal for debugging anything. That principal is: <strong>simplify and isolate the problem.</strong></p>
<p>Software systems are complex, with many moving parts. Managing this complexity is a big part of software engineering. When trying to fix something, the best way to do it is to isolate the problem.</p>
<p>Imagine you are building an app that streams data from the internet, parses it, and loads it into a custom GUI. The app is currently broken. At this point in time, the problem could be with the internet, the parser, or the GUI. Your first job is to figure out where the problem is. This means isolation. Some things to consider:</p>
<ol>
<li>Replacing streaming data with static dummy data</li>
<li>Feeding the GUI static dummy data</li>
<li>Testing the parser on dummy data</li>
</ol>
<p>By testing each of these pieces in a controlled environment, it becomes possible to find and fix problems. There are many tools to help in this process (debuggers being a major one). If these tools are not available to you, then a sure-fire approach is to cut away and simplify your program as much as possible until it works, and then carefully rebuild from there.</p>
<p>Please read your error messages. They are in words and they mean things. If you read them they will usually tell you what the problem is so you can fix it. This is true a surprising amount of the time.</p>
<p>Reading error messages is intimidating at first, but will become easier over time as you develop a better sense of <em>what</em> kinds of things tend to go wrong, as well as what information error messages are able to convey.</p>
<p>If you see an error message that you don’t understand, do the following:</p>
<ol>
<li>Copy the error</li>
<li>Paste it into a Google searchbar</li>
<li>Look over the top couple of answers</li>
</ol>
<p>This will work extremely frequently. <a href="http://stackoverflow.com/">StackOverflow</a> deserves much thanks for this.</p>
<h2 id="a-kitchen-metaphor">A Kitchen Metaphor</h2>
<p>As a former student of the controversial linguist <a href="https://en.wikipedia.org/wiki/George_Lakoff">George Lakoff</a>, it would be bad form not to at least attempt one grand metaphor.</p>
<p>Think of your computer as a professional kitchen. Imagine shelves of recipe books, each containing many instructions for how to make certain dishes. Imagine teams of chefs and sous-chefs, working away at various dishes. Raw ingredients are transformed into delicious meals. The resources consumed – gas for the oven, water for the sink, are accessed via oven ranges and sink spouts.</p>
<p>The recipes are programs – they sit idle until some chefs are asked to prepare them. The actual cooking of a dish is a process – in which the kitchen’s resources are organized and devoted to preparing a dish. Ingredients and dishes are files – the objects on which we operate. The ovens and sinks are the kernel – the interface abstracting away the underlying resources, making them easier to work with. One recipe can contain many components (sub-recipes), which need to be spawned off and prepared seperately.</p>
<p>You can have a million recipes, but only cook two dishes. You can have five recipes, but make each one a thousand times. You can cook one dish at a time, and leave most of your resources unused, or turn on every single oven.</p>
<p>The owner of the restaraunt is <a href="https://en.wikipedia.org/wiki/Sudo">sudo</a> – able to overrule even the head chef, and set whatever rules she wants.</p>
<h2 id="a-practical-example">A Practical Example</h2>
<p>Imagine you want to do some R programming using RStudio. You begin by double-clicking the RStudio icon on your dock, which brings up the RStudio GUI. You write some R code inside of the GUI, and run it. Let’s think about what happened:</p>
<ol>
<li>Double-clicking the RStudio icon read the RStudio program and created an RStudio process. Most (but not all) of this process is the graphical user interface.</li>
<li>While starting up, RStudio checked its settings file to see what its current working directory should be (<code class="highlighter-rouge">~</code> by default).</li>
<li>When you ran your code in the GUI, RStudio spawned a new process in the context of that working directory, which executed your code and returned the result. This is known as a REPL (a “Read-Evaluate-Print” Loop).</li>
</ol>
<p>Now, imagine your code references a data file – say, a JSON file containined a bunch of tweets. This file is stored in some location. When you’re writing your R code, the location you give <em>must be relative to the file in which it is called.</em></p>
<p>For example, if your working directory is <code class="highlighter-rouge">~</code>, and your file, <code class="highlighter-rouge">code/script.R</code>, looks like this:</p>
<div class="highlighter-rouge"><pre class="highlight"><code># script.R
tweets = parseTweets('tweets.json')
</code></pre>
</div>
<p>Then RStudio will run this command on the file located at <code class="highlighter-rouge">~/code/tweets.json</code>. If the file is elsewhere, you’ll get an error.</p>
<h2 id="git-and-github">Git and GitHub</h2>
<p>The last topic in this guide is Git. Git not related to the command line per se, but rather is an important tool for the development of software. In many fields, but most of all in software, version matters. Software products are in constant states of change. Software projects almost always require the collaboration of multiple people. Managing all of this activity can be challenging. Enter Git.</p>
<p>Git is a tool for recording change over time, making it easier for people working on software to make sure they both have the correct and up-to-date version of their code, but also to be able to go back and see where and why changes were made. Known as source control, this bookkeeping is now fundamental in modern software development.</p>
<p>The core unit of Git is the <strong>repository</strong>. A repository can be thought of as unit being remembered. Work on repositories are saved in chunks known as <strong>commits</strong>. A commit can be thought of as a unit of memory. Developing software is then a process of committing changes to a repository representing the software project.</p>
<p>Some benefits of using Git:</p>
<ol>
<li>Easy to restore files if they have been lost or damaged</li>
<li>Easy to get changes to files or folders without having to re-download the entire file or folder</li>
</ol>
<p>There are more, but these two are so useful that rather than enumerate them it may be best to pause and reflect on these instead. Git, like many things, is a program.</p>
<p>GitHub, on the other hand, is a website which makes it easy to collaborate with others via the internet. GitHub uses Git as a foundation, and builds on it. This distinction is subtle but worth knowing.</p>
<p>Using Git involves understanding a handful of commands. Here are most of them:</p>
<p><code class="highlighter-rouge">git clone <url to repo></code> – clone a repository from GitHub to your computer</p>
<p><code class="highlighter-rouge">git add .</code> – prepare files for the next commit</p>
<p><code class="highlighter-rouge">git commit -am "<commit message>"</code> – make a commit, with the message <code class="highlighter-rouge"><commit message></code></p>
<p><code class="highlighter-rouge">git push</code> – “push” new changes up to GitHub</p>
<p><code class="highlighter-rouge">git pull</code> – “pull” new changes from GitHub</p>
<p>To get started:</p>
<ol>
<li><a href="https://github.com/">Make a GitHub Account</a>. Everyone who writes programs for anything has one. It’s like having an email address. If you don’t have one people will think you’re weird and won’t hire you.</li>
<li><a href="https://git-scm.com/book/en/v2/Getting-Started-Installing-Git">Install Git</a>. You can install just the command-line program or the fancy GUI.</li>
</ol>
<p>If taking a class taught via github, one fairly effective flow is the following:</p>
<ol>
<li>“Fork” the course repository. What this basically means is that you’re going to copy the course repository to your own account.</li>
<li>Clone the forked repository to your computer.</li>
<li>Create an “upstream remote” pointing to the original course repository.</li>
<li>Point RStudio’s default working directory to the course repository.</li>
<li>Avoid many problems.</li>
</ol>
<p>Whenever you want to sync your repository with the course repository, run <code class="highlighter-rouge">git pull upstream master</code> to pull changes from the upstream repository into your fork. What this will allow you to do is to build on top of the course materials, while making it easy to synchronize with updated material as necessary.</p>
Thu, 03 Mar 2016 00:00:00 +0000
http://kronosapiens.github.io/blog/2016/03/03/elements-of-modern-computing.html
http://kronosapiens.github.io/blog/2016/03/03/elements-of-modern-computing.htmlunixcommand-linefilesystemblogBlockchain as Talmud<h1 id="the-talmud">The Talmud</h1>
<p><a href="http://kronosapiens.com/2013/01/25/the-religious-mindset/">The Jewish Talmud is a remarkable object</a>. It is the product of hundreds of years of intense, rigorous, and highly formal debate and scholarship. It has served as the backbone of the Jewish people. The trunk of the tree.</p>
<p>The Talmud has a very interesting property, the inspection of which will prove illuminating.</p>
<p>First, one of the fundamental rules of Talmudic scholarship is that a recent scholar cannot contradict, reject, or overrule an older scholar. If a recent scholar takes issue with an historical analysis, the only avenue available to them is to reinterpret the intention of the older scholar. This chain of interpretation goes back, in an unbroken chain, to the first commentaries and ultimately to the old testament and the ten commandments.</p>
<p>As such, one can in theory always trace a current value, decision, or opinion back through history. Further, a concept or idea cannot be introduced arbitrarily, but must be rooted and stem from an existing concept or idea. Similarly, once an idea has been accepted, it can never be fully rejected, only reinterpreted.</p>
<p>The quality of the interpretations, as well as the intention of the interpreters, is a subject of ongoing debate and, well, interpretation. But this property in general holds.</p>
<p>This has several important implications.</p>
<p><strong>First</strong>, there is no possibility of complete revolution. A “revolution” in Judaism, defined as a complete rejection of what has come before and the attempt to institute a new faith on entirely new foundations, could never occur. If such a thing were attempted, those individuals would be seen as a new sect, ultimately disconnected from primary Judaism. The core of Judaism, defined as those who adhere to the teachings of the Talmud and associated texts, is fundamentally connected to this canon. <em>Jews hold the Talmud as the primary authority</em>. No one is forcing the Jews to respect the Talmud; it is simply that the study of the Talmud and its instruction is the common denominator for the Jewish identity. Any individual Jew is, at any time, free to completely reject the entirety of the Talmud. Such a person, however, would cease to be accepted by their community. In this way, the Talmud coordinates the self-identifying community of Jews.</p>
<p><strong>Second</strong>, the faith is capable of substantial dynamism. Interpretations can be fanciful and radical. Although new work must be based on historical scholarship, it is often the case that scholars will not agree with their contemporaries. In this way, the faith subdivides into movements, each respecting a particular strand of interpretation.</p>
<p><strong>Third</strong>, there is no need for a central authority. The extent to which any individual person holds themselves accountable to these laws and interpretations is, of course, a personal decision. Orthodox Jews take these laws very literally, and yet various schools of the orthodoxy have varying interpretations of some of the more ambiguous implementations of the faith. The Conservative, Reform, Reconstructionist, and other more progressive flavors permit more liberal general interpretations – interpretations which are, of course, rejected by the orthodoxy. The key here is that this is a single canon, to which all Jews can be seen as being in relation. Particularly relevant is that while inter-movement relationships are undefined and may be nonexistent or even hostile, the overall coordination of the movements is implicitly achieved.</p>
<p><strong>Fourth</strong>, the faith as a whole can never be destroyed, as the Talmud functions also as a memory. Regardless of what may occur to some or even a majority of the adherents of the faith, the survivors will be able to rebuild the community to within an arbitrary precision. This implication is well-documented by historical experience.</p>
<p><strong>Fifth</strong>, the Talmud is “bigger” and “wiser” than any individual. Considering Plato’s “Philosopher King”, we observe that individual humans are insufficient for the task. A shared, dynamic history of thought, however, might be. As the product of a history of reason and debate, the Talmud represents a cultural history orders of magnitude larger than any individual person. It has a fundamentally different, unique, and very functionally-relevant ontology.</p>
<p><strong>Sixth</strong> and relatedly, the Talmud then exhibits “maximal intelligence”, in the sense that the unbroken chain of interpretation represents more overall experience than any record which allowed for erasure and editing.</p>
<h1 id="democracy">Democracy</h1>
<p>Let us briefly consider the similarities and differences between Talmud-based governance and the kind of Democratic governance exemplified by the United States.</p>
<p>In both cases, we have the principle that decisions must occur within set boundaries. In the case of the Talmud, those boundaries are historical scholarship and foundational texts. In the case of the United States, those boundaries are the Constitution and Bill of Rights.</p>
<p>In both cases, there are established mechanisms for making changes. In the case of the Talmud, changes are made via extrapolation and interpretation of past work. In the case of the United States, changes are made via a legislative process.</p>
<p>A salient difference is that in the case of the United States, future changes are disconnected from past changes; if the basic boundaries are respected, then anything goes. It is technically possible to “reinterpret” the basic boundaries by amending the constitution, but this seems highly unlikely.</p>
<p>As such, we see the Talmudic process as having more gradual changes, while the US process sways more easily in changing political winds.</p>
<p>Of course, much of this difference can be rooted in the fact that the United States must secure territory, and relies on the use of force to enforce rules. As a faith, the Jews permit less well-defined borders. The United States and the Jewish people, like all states and religious communities, are entities of a fundamentally different nature. As such, they seem to necessitate fundamentally different approaches to change and control. Religion is currently (although not historically) an opt-in experience; citizenship typically is not.</p>
<p>Yet, we see some shared principles in both forms of governance. The differences appear to be in large part necessary differences coming from the fundamentally different natures of these entities, in particular with regards to group membership. As such, we should not necessarily feel obligated to reconcile them.</p>
<h1 id="the-blockchain">The Blockchain</h1>
<p>The Talmud, its properties, and role in Jewish life provides a crucial case study for those interested in effective means of coordinating large groups of people absent a central authority. The fundamental mechanic is the strict requirement that future change emerges from past work, prohibiting both the introduction of the completely novel and the rejection of any history. The historical experience of the Jews has shown that, if put in motion upon adequate foundations, such a mechanic is sufficient for the decentralized coordination of the activity of millions of people across time and space.</p>
<p>Recently, we have seen the emergence of technology which shares this property. The Blockchain, first described in the 2008 paper “<a href="https://bitcoin.org/bitcoin.pdf">Bitcon: A Peer to Peer Electronic Cash System</a>”, is in essence a decentralized public ledger, in which anything can be recorded and made publically available. The principal mechanic of the Blockchain is that future entries must build upon past entries, and that any entry in the chain, once accepted, can never be deleted. In the words of the author, Satoshi Nakamoto:</p>
<blockquote>
<p>The only way to confirm the absence of a transaction <a href="http://www.econlib.org/library/Essays/hykKnw1.html">is to be aware of all transactions</a>. In the mint based model, the mint was aware of all transactions and decided which arrived first. To accomplish this without a trusted party, transactions must be publicly announced, and we need a system for participants to agree on a single history of the order in which they were received.</p>
</blockquote>
<p>The Blockchain is thus a decentralized authority in which all new changes must be a continuation of past work. As such, the example of the Talmud suggests that such a tool could be used for the effective decentralized coordination of large groups of people across time and space, without the need for a central authority or any force.</p>
<p>In order for the Blockhain to be used in this way, it will be necessary for a large group of people to regard the Blockchain as an authority on a wide variety of issues. As in the case of Jews and the Talmud, the Talmud is an authority because Jews <em>see it as an authority</em>. In an important sense, this is arbitrary. Fortunately, this sense suggests that there is no fundamental shortcoming which prevents a Blockchain from serving a similar purpose.</p>
<p>This leads us to some interesting questions:</p>
<ul>
<li>
<p>What kind of information should this Blockchain contain?</p>
</li>
<li>
<p>How should this Blockchain be updated?</p>
</li>
<li>
<p>What content, if any, should be placed at the base of the Blockchain?</p>
</li>
</ul>
<p>As a first pass, it would seem as though this Blockchain should function as a repository of social values. As the community defined by the Blockchain adapts, and external circumstances change, these values would be updated. At any point, any member of the community could inspect the history of these values and the reasoning for their changes. As time passed, this record would become a deep and strong foundation for that community, and a trusted authority on the values and purpose of that community. Coordination without an authority. Updates to this Blockchain would occur on a rough consensus basis, allowing for the possibility of a split of the Blockchain at some point if there emerged a major disagreement over the direction of the community.</p>
<p>Given the earlier discussion of Democracy, it does not seem at this time that the Blockchain can be used effectively to govern a state; issues of control and security seem to preclude the rational, gradual, consensus-based change process we are discussing. However, it does seem as though the Blockchain could serve as an effective authority and memory for a self-selected community with shared values and without borders.</p>
<p>One could point to other self-selecting communities with rough consensus-based decision-making processes, and observe that they succeed while employing very different decision-making systems. The Python community, with BDFL and a PEP-based system of improvement, has been pretty effective. Yet a series of disconnected proposals is ontologically dissimilar to an unbroken chain of decisions. The importance of this dissimilarity is to be determined.</p>
<p>It would be very exciting to be a part of such a community.</p>
<h1 id="update-feb-1">Update (Feb 1)</h1>
<p>I shared this post with a few friends of mine with relevant domain knowledge. They gave valuable feedback and raised additional questions. Their responses are replicated below.</p>
<p><strong>From a Professor of Political Economy:</strong></p>
<blockquote>
<p>Interesting read–but I wonder about practicality. The beauty of many societies is being able to effect rapid change. The way you lay this out, suggests that this may be a lot more difficult. It may be a way to insure calmer legislation, etc. But it doesn’t move really seem to me to lead to faster movement of anything.</p>
</blockquote>
<p><strong>From a Rabbi:</strong></p>
<blockquote>
<p>This was fascinating. Thanks for sharing.
You know, I think the question I’m sitting with is, given decentralized systems of law or commerce, what role does the organizer or covener have, and how much authority is in that role. The Talmud, for example, is a collection of many voices, but someone did the collecting. The work of that someone, the editor (or likely, editors) is what academics are particularly fascinated with these days. So, too, with any wiki. There is someone who hosts it. How much control do they have? And I imagine that there are also decisionmakers with a Blockchain. What does it mean, then, to be at the center of a decentralizes system?</p>
</blockquote>
Mon, 11 Jan 2016 00:00:00 +0000
http://kronosapiens.github.io/blog/2016/01/11/blockchain-as-talmud.html
http://kronosapiens.github.io/blog/2016/01/11/blockchain-as-talmud.htmlblockchaingovernmentblogUnderstanding Variational Inference<p>In one of my courses this semester, <a href="http://www.columbia.edu/~jwp2128/Teaching/E6892/E6892Fall2015.html">Bayesian Models for Machine Learning</a>, we’ve been spending quite a bit of time on a technique called “<a href="https://en.wikipedia.org/wiki/Variational_Bayesian_methods">Variational Inference</a>”. I’ve spent the last few days working on an assigment using this technique, so I thought this would be a good occasion to test knowledge by attempting to describe the method. Much credit to <a href="http://www.columbia.edu/~jwp2128/">John Paisley</a> for teaching me all this in the first place. The Columbia faculty are really top-notch.</p>
<h2 id="high-level-overview">High-level Overview</h2>
<p>First, a brief overview of Bayesian statistics. We begin with a distribution on our data <script type="math/tex">x</script>, paramaterized by <script type="math/tex">\theta</script>:</p>
<script type="math/tex; mode=display">p(x | \theta)p(\theta)</script>
<p>Using basic rules of joint and conditional probability, we derive the theorem:</p>
<script type="math/tex; mode=display">p(\theta, x) = p(x, \theta)</script>
<script type="math/tex; mode=display">p(\theta | x)p(x) = p(x | \theta)p(\theta)</script>
<script type="math/tex; mode=display">\underbrace{
p(\theta | x) = \frac{p(x | \theta)p(\theta)}{p(x)}
}_{\text{Bayes Theorem}}</script>
<p>For those unfamiliar with Bayes Theorem, our goal is to use the data <script type="math/tex">x</script> to learn a <em>better</em> distribution on <script type="math/tex">\theta</script>. In other words, Bayes Theorem gives us a formal way to update our predictions, given our experience of the world. In particular,</p>
<script type="math/tex; mode=display">p(\theta | x)</script>
<p>is known as the <em>posterior</em> distribution of <script type="math/tex">\theta</script>. This distribution is our goal.</p>
<p>Now, the formulas given above are in terms of probability distributions. If we actually look under the hood at what these probabilities actually look like… well, it looks like a lot of calculus. Using Bayes Theorem involves working with those integrals. Fortunately, over the years statisticians have developed a pretty sophisticated body of knowledge around manipulating these probability distributions, so that (if you make smart choices about which distributions you pick) you can skip basically all of the calculus. This is convenient.</p>
<p>However, for more complicated models, things aren’t always guaranteed to work out so nicely. Sometimes, when we try to model something, we find that calculating the posteriors directly is impossible. What can we do?</p>
<p>Fortunately, statisticians have developed techniques for handling this. Variational Inference is one of those techniques.</p>
<p>Now, we will derive the VI master equation.</p>
<p>Recall some basic rules of probability:</p>
<script type="math/tex; mode=display">p(\theta | x)p(x) = p(x, \theta)</script>
<script type="math/tex; mode=display">p(x) = \frac{p(x, \theta)}{p(\theta | x)}</script>
<script type="math/tex; mode=display">lnp(x) = lnp(x, \theta) - lnp(\theta | x)</script>
<p>Now, we introduce an entirely new distribution, <script type="math/tex">q(\theta)</script>, and take the expectation with regards to this distribution:</p>
<script type="math/tex; mode=display">E_{q(\theta)}[lnp(x)] = E_{q(\theta)}[lnp(x, \theta)] - E_{q(\theta)}[lnp(\theta | x)]</script>
<script type="math/tex; mode=display">\int q(\theta) lnp(x) d\theta = \int q(\theta) lnp(x, \theta) d\theta - \int q(\theta) lnp(\theta | x) d\theta</script>
<p>Observing that the left-hand term is constant with respect to <script type="math/tex">\theta</script>:</p>
<script type="math/tex; mode=display">lnp(x) \int q(\theta) d\theta = \int q(\theta) lnp(x, \theta) d\theta - \int q(\theta) lnp(\theta | x) d\theta</script>
<script type="math/tex; mode=display">lnp(x) = \int q(\theta) lnp(x, \theta) d\theta - \int q(\theta) lnp(\theta | x) d\theta</script>
<p>We then add and subtract the <a href="https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution">entropy</a> of <script type="math/tex">q(\theta)</script>:</p>
<script type="math/tex; mode=display">lnp(x) = \int q(\theta) lnp(x, \theta) d\theta - \int q(\theta) lnp(\theta | x) d\theta
+ \int q(\theta) lnq(\theta) d\theta - \int q(\theta) lnq(\theta) d\theta</script>
<p>And reorganize:</p>
<script type="math/tex; mode=display">lnp(x) = \int q(\theta) (lnp(x, \theta) - lnq(\theta))d\theta - \int q(\theta) (lnp(\theta | x) - lnq(\theta)) d\theta</script>
<script type="math/tex; mode=display">lnp(x) =
\underbrace{
\int q(\theta) ln\frac{p(x, \theta)}{q(\theta)} d\theta
}_{L}
+ \underbrace{
\int q(\theta) ln\frac{q(\theta)}{p(\theta | x)} d\theta
}_{KL(q||p)}</script>
<p>Let’s take a moment to understand what was just derived. We have shown that the log probability of the random variable <script type="math/tex">x</script> is equal to the involved-looking equation on the right-hand side. This right-hand term is the sum of two terms. The first, which we call <script type="math/tex">L</script>, we will refer to as the “Variational Objective Function”. The second is the equation of something known as the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback–Leibler</a> divergence, or KL divergence for short.</p>
<p>Recall that our goal is to learn</p>
<script type="math/tex; mode=display">p(\theta | x)</script>
<p>We see that this term appears in the KL divergence, next to this new distribution we are calling <script type="math/tex">q(\theta)</script>. Conveniently, the KL divergence is a measure of the difference between two distributions. When the distributions are equal, the KL divergence equals 0. The more the distributions differ, the larger the term becomes. We’re not sure what this <script type="math/tex">q</script> distribution is, but let’s assume that we can control it. The closer it comes to approximating the posterior, the smaller the KL divergence will become.</p>
<p>We see also that the left-hand term, <script type="math/tex">lnp(x)</script>, is a constant (just the probability of the data). So we have a constant equal to an equation plus a term we want to minimize. Therefore, if we can find a way to <strong>maximize</strong> the <script type="math/tex">L</script> term, we will necessarily <strong>minimize</strong> the KL divergence. Therefore, the problem becomes one of finding a <script type="math/tex">q(\theta)</script> distribution which maximizes <script type="math/tex">L</script>!</p>
<h2 id="in-context">In Context</h2>
<p>Let’s consider the model in the context of a very interesting problem. Say we have a non-linear function (such as <script type="math/tex">sin(x), sinc(x)</script> or similar) and we would like to approximate this function via Bayesian Linear Regression. Approximating a non-linear function via a naive linear regression is not really feasible. However, if we expand the data into higher dimensions, it may be possible to learn a linear regression in the higher-dimensional space that corresponds to a non-linear function in the original dimension. This is what we will attempt in this problem.</p>
<p>We will project <script type="math/tex">x_i, ... x_n \in R^{2}</script> into <script type="math/tex">R^{n}</script>, by projecting each <script type="math/tex">x_i</script> into a vector of distances defined by the Gaussian kernel. In other words, every <script type="math/tex">x_i</script> will become a vector representing that point’s distance from every other point in the set – and <script type="math/tex">X</script>, the data, becomes a <script type="math/tex">n \times n</script> matrix of distances, with <script type="math/tex">X_{ii} = 1</script> and <script type="math/tex">X_{ij} \leq 1</script> for all <script type="math/tex">j \neq 1</script>. Adding a column of ones to represent any intercept term, we can now interpret the problem as a regression. Our goal is then to find a coefficient vector <script type="math/tex">w</script> which, given the vector of difference, map the point to its correct location in the original space, <script type="math/tex">R^{2}</script>.</p>
<p>This transformation is quite a profound. We have many tools for working with linear functions (i.e. linear algebra), but fewer for working with non-linear functions. We found ourselves facing a problem, and rather than attempt to solve the problem using a limited toolset, we simply transformed the problem into one that we can approach skillfully. For any Ender’s Game fans out there, this is an “enemy’s gate is down” kind of moment. Anyway.</p>
<p>Complicating the problem further, we would like to encourage sparsity in <script type="math/tex">w</script> – in other words, we would like to identify a small subset of <script type="math/tex">X</script> which are sufficiently discriminative to allow us to correctly place the other points. We can think of these points as “vantage points”, and they are especially good at creating distance between points and illuminating their differences.</p>
<p>With that setup, we turn to the actual model, which looks like this:</p>
<script type="math/tex; mode=display">y_i \sim N(x_i^Tw, \lambda^{-1})</script>
<script type="math/tex; mode=display">w_i \sim N(0, diag(\alpha_1, ..., \alpha_d)^{-1})</script>
<script type="math/tex; mode=display">\lambda \sim Gamma(e_0, f_0)</script>
<script type="math/tex; mode=display">\alpha_k \sim Gamma(a_0, b_0)</script>
<p>The joint probability distribution is as follows:</p>
<script type="math/tex; mode=display">p(x,y,w, \lambda, \alpha)
= \prod_{i=1}^n p(y_i | x_i, w, \lambda)p(\lambda | e_0, f_0)p(w | \alpha)\prod_{k=1}^dp(\alpha_k| a_0, b_0)</script>
<p>And the log joint probability is as follows:</p>
<script type="math/tex; mode=display">lnp(x,y,w, \lambda, \alpha)
= \sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(\lambda | e_0, f_0) + lnp(w | \alpha) + \sum_{k=1}^d lnp(\alpha_k| a_0, b_0)</script>
<p>We model the <script type="math/tex">q</script> distribution as a joint probability of independent distributions, one per variable:</p>
<script type="math/tex; mode=display">q(w, \lambda, \alpha) = q(w)q(\lambda)q(\alpha)</script>
<p>Now, with the model defined, we plug these values in to the VI maste requation we derived earlier:</p>
<script type="math/tex; mode=display">lnp(y, x) =
\int q(w)q(\lambda)q(\alpha) ln \frac{p(x,y,w, \lambda, \alpha)}{q(w)q(\lambda)q(\alpha)} dw d\lambda d\alpha
+ \int q(w)q(\lambda)q(\alpha) ln \frac{q(w)q(\lambda)q(\alpha)}{p(w, \lambda, \alpha | x, y)} dw d\lambda d\alpha</script>
<h2 id="learning-q">Learning q</h2>
<p>Recall, our goal is to maximize:</p>
<script type="math/tex; mode=display">L = \int q(w)q(\lambda)q(\alpha) ln \frac{p(x,y,w, \lambda, \alpha)}{q(w)q(\lambda)q(\alpha)} dw d\lambda d\alpha</script>
<p>We will do this by finding better values for <script type="math/tex">q(w)q(\lambda)q(\alpha)</script>. To show how this is done, let’s first consider <script type="math/tex">q(w)</script> in isolation (the process will be the same for each variable). First, we reorganize the equation to remove all terms which are constant with respect to <script type="math/tex">w</script> (in other words, which won’t change regardless of <script type="math/tex">w</script>, so aren’t important when it comes to maximizing the equation with regards to <script type="math/tex">w</script>):</p>
<script type="math/tex; mode=display">L = \int q(w) q(\lambda)q(\alpha) ln p(x,y,w, \lambda, \alpha) d\lambda d\alpha dw
- \int q(w) ln q(w) dw
- \text{const w.r.t } w</script>
<p>Observe next that we can interpret the first integral as an expectation, where <script type="math/tex">E_{-q(w)} = E_{q(\lambda)q(\alpha)}</script>, the expectation over all other variables.</p>
<script type="math/tex; mode=display">L = \int q(w) E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)] dw
- \int q(w) ln q(w) dw
- \text{const w.r.t } w</script>
<p>We will now pull off some slick math. First, observe that <script type="math/tex">ln e^x = x</script>. Now:</p>
<script type="math/tex; mode=display">L = \int q(w) ln \frac{e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]}}{q(w)} dw
- \text{const w.r.t } w</script>
<p>This is looking an awful lot like our friend the KL divergence. If only the numerator were a probability distribution! Fortunately we can make it one, by introducing a new term <script type="math/tex">Z</script>:</p>
<script type="math/tex; mode=display">Z = \int e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]} dw</script>
<p>Here, Z can be interpreted as the normalizing constant for the distribution <script type="math/tex">e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]}</script>. By adding and subtracting <script type="math/tex">lnZ</script>, which is constant with respect to <script type="math/tex">w</script>, we witness some more slick math:</p>
<script type="math/tex; mode=display">L = \int q(w) ln \frac{e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]}}{q(w)} dw
- \text{const w.r.t } w + lnZ - lnZ</script>
<script type="math/tex; mode=display">L = \int q(w) ln \frac{\frac{1}{Z}e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]}}{q(w)} dw
- \text{const w.r.t } w</script>
<p>Where are we now? We have succesfully transformed the integral into a KL divergence between <script type="math/tex">q(w)</script>, our distribution of interest, and <script type="math/tex">\frac{1}{Z}e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]}</script>, which is an expression involving terms we know. Specifically, we have:</p>
<script type="math/tex; mode=display">-KL(q(w)\|\frac{1}{Z}e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]})</script>
<p>We want to maximize this expression, which is equivalent to <em>minimizing</em> -KL. We minimize -KL when the two distributions are equal. Therefore, we know that:</p>
<script type="math/tex; mode=display">q(w) = \frac{1}{Z}e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]}</script>
<p>Sweet! Now we just need to solve for the left hand term. It’s worth pausing and noting how much fancy math it took to get us here. We relied on properties of logarithms, expectations, KL divergence, and the mechanics of probability distributions to derive this expression.</p>
<p>To actually evaluate this expression and figure out what <script type="math/tex">q(w)</script> should be, we’ll rewrite things to remove terms not involving <script type="math/tex">w</script>, by absorbing them in the normalizing constant. To see why this is the case, let’s first rewrite the expectation:</p>
<script type="math/tex; mode=display">E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]</script>
<p>Recalling the log joint probability we derived earlier:</p>
<script type="math/tex; mode=display">E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda)
+ lnp(\lambda | e_0, f_0)
+ lnp(w | \alpha)
+ \sum_{k=1}^d lnp(\alpha_k| a_0, b_0)]</script>
<script type="math/tex; mode=display">lnp(\lambda | e_0, f_0) + \sum_{k=1}^d lnp(\alpha_k| a_0, b_0)
+ E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]</script>
<p>Putting this back into context, we can rewrite the distribution:</p>
<script type="math/tex; mode=display">\frac{
e^{lnp(\lambda | e_0, f_0) + \sum_{k=1}^dlnp(\alpha_k| a_0, b_0)}
e^{E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]}
}{
\int e^{lnp(\lambda | e_0, f_0) + \sum_{k=1}^dlnp(\alpha_k| a_0, b_0)}
e^{E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]} dw
}</script>
<p>Bringing outside of the integral all terms constant with respect to <script type="math/tex">w</script>:</p>
<script type="math/tex; mode=display">\frac{
e^{lnp(\lambda | e_0, f_0) + \sum_{k=1}^dlnp(\alpha_k| a_0, b_0)}
e^{E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]}
}{
e^{lnp(\lambda | e_0, f_0) + \sum_{k=1}^dlnp(\alpha_k| a_0, b_0)}
\int e^{E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]} dw
}</script>
<p>And then cancelling:</p>
<script type="math/tex; mode=display">\frac{
e^{E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]}
}{
\int e^{E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]} dw
}</script>
<p>All that is left to do is to evaluate the expression</p>
<script type="math/tex; mode=display">e^{E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]}</script>
<p>to learn the distribution. We will not go through the specific derivation here, which involves evaluating the expectation of the log of the distributions on <script type="math/tex">y_i</script> and <script type="math/tex">w</script>; instead we will skip to the final result and claim that:</p>
<script type="math/tex; mode=display">q(w) \sim N(\mu, \Sigma)</script>
<p>With:</p>
<script type="math/tex; mode=display">\Sigma = (E_{q(\alpha)}[diag(\alpha)] + E_{q(\lambda)}[\lambda] \sum_{i=1}^n x_i x_i^T)^{-1}</script>
<script type="math/tex; mode=display">\mu = \Sigma(E_{q(\lambda)}[\lambda] \sum_{i=1}^n y_i x_i)</script>
<p>The <strong>key</strong> observation to make here is that <script type="math/tex">q(w)</script> involves the expected values of the <em>other</em> model variables. This will be true for the other variables as well. To show this, here are <script type="math/tex">q(\lambda)</script> and <script type="math/tex">q(\alpha)</script>:</p>
<script type="math/tex; mode=display">q(\lambda) \sim Gamma(e, f)</script>
<script type="math/tex; mode=display">e = e_0 + \frac{n}{2}</script>
<script type="math/tex; mode=display">f = f_0 + \frac{1}{2} \sum_{i=1}^n [(y_i - E_{q(w)}[w]^T x_i)^2 + x_i^T Var_{q(w)}[w] x_i]</script>
<script type="math/tex; mode=display">q(\alpha_k) \sim Gamma(a, b_k)</script>
<script type="math/tex; mode=display">a = a_0 + \frac{1}{2}</script>
<script type="math/tex; mode=display">b_k = b_0 + \frac{1}{2} E_{q(w)}[ww^T]_{kk}</script>
<p>Finally, we give the expectations:</p>
<script type="math/tex; mode=display">E_{q(w)}[w] = \mu</script>
<script type="math/tex; mode=display">Var_{q(w)}[w] = \Sigma</script>
<script type="math/tex; mode=display">E_{q(w)}[ww^T] = \Sigma + \mu\mu^T</script>
<script type="math/tex; mode=display">E_{q(\lambda)}[\lambda] = \frac{e}{f}</script>
<script type="math/tex; mode=display">E_{q(\alpha_k)}[\alpha_k] = \frac{a}{b_k}</script>
<p>Now we are prepared to discuss the value of this technique. Note how each <script type="math/tex">q()</script> distribution is a function of the expectations of the other random variables, <em>with respect to their <script type="math/tex">q()</script> distributions. This means that as one distribution changes, the others change… causing the first to change, causing the others to change, over and over again in a loop. The insight is that each change</em> brings the <script type="math/tex">q()</script> closer to the true posterior that we are trying to approximate. In other words, each iteration through this update loop gives us a better set of <script type="math/tex">q()</script>, as improved values for one give improved values for the others. Also note that since we have solved for the various <script type="math/tex">q()</script> distributions solely in terms of the data <script type="math/tex">y_i, x_i</script> and the expecatations <script type="math/tex">E[w], E[\lambda], E[\alpha]</script>, we can implement the algorithm efficiently using only basic arithmetic operations, without having to do any calculus or derive anything!</p>
<h2 id="assessing-convergence">Assessing Convergence</h2>
<p>The last thing we will need to do is discuss the process for assessing convergence – calculating how much and how quickly our <script type="math/tex">q</script> distributions are closing in on the true posteriors. To do this, we will have to evaluate the entire <script type="math/tex">L</script> equation using the new <script type="math/tex">q</script> distributions. Recall the equation:</p>
<script type="math/tex; mode=display">L = \int q(w)q(\lambda)q(\alpha) ln \frac{p(x,y,w, \lambda, \alpha)}{q(w)q(\lambda)q(\alpha)} dw d\lambda d\alpha</script>
<p>Which can be written as follows:</p>
<script type="math/tex; mode=display">L = \int q(w)q(\lambda)q(\alpha) ln p(x,y,w, \lambda, \alpha) dw d\lambda d\alpha
- \int q(w) ln q(w) dw
- \int q(\lambda) ln q(\lambda) d\lambda
- \int q(\alpha) ln q(\alpha) d\alpha</script>
<p>And interpreted as a sum of expectations:</p>
<script type="math/tex; mode=display">L = E_{q(w, \lambda, \alpha)}[ln p(x,y,w, \lambda, \alpha)]
- E_{q(w)}[ln q(w)]
- E_{q(\lambda)}[ln q(\lambda)]
- E_{q(\alpha)}[ln q(\alpha)]</script>
<script type="math/tex; mode=display">L = E_{q(w, \lambda, \alpha)}[ln p(x,y,w, \lambda, \alpha)]
- E_{q(w)}[ln q(w)]
- E_{q(\lambda)}[ln q(\lambda)]
- E_{q(\alpha)}[ln q(\alpha)]</script>
<script type="math/tex; mode=display">L = E_{q(w, \lambda, \alpha)}[\sum_{i=1}^n p(y_i | x_i, w, \lambda) + p(\lambda | e_0, f_0) + p(w | \alpha) + \sum_{k=1}^dp(\alpha_k| a_0, b_0)]
- E_{q(w)}[ln q(w)]
- E_{q(\lambda)}[ln q(\lambda)]
- E_{q(\alpha)}[ln q(\alpha)]</script>
<p>Which breaks down as follows:</p>
<script type="math/tex; mode=display">L =
\sum_{i=1}^n E_{q(w, \lambda)}[lnp(y_i | x_i, w, \lambda)]
+ E_{q(\lambda)}[lnp(\lambda | e_0, f_0)]
+ E_{q(w, \alpha)}[lnp(w | \alpha)]
+ \sum_{k=1}^d E_{q(\alpha)}[lnp(\alpha_k| a_0, b_0)]</script>
<script type="math/tex; mode=display">- E_{q(w)}[ln q(w)]
- E_{q(\lambda)}[ln q(\lambda)]
- E_{q(\alpha)}[ln q(\alpha)]</script>
<p>There is an important subtlety in evaluating these expectations. To understand this subtlety, let’s look at two of the terms:</p>
<script type="math/tex; mode=display">E_{q(\lambda)}[p(\lambda | e_0, f_0)] - E_{q(\lambda)}[ln q(\lambda)]</script>
<p>Observe how both are expectations over <script type="math/tex">q(\lambda)</script>. However the probability distributions are not the same. To see how this works out, let’s write out the actual log probabilities (both distributions are Gamma).</p>
<script type="math/tex; mode=display">E_{q(\lambda)}[p(\lambda | e_0, f_0)]
= E_{q(\lambda)}[(e_0lnf_0 - ln\Gamma(e_0)) + (e_0 - 1)ln\lambda - f_0\lambda]</script>
<script type="math/tex; mode=display">E_{q(\lambda)}[ln q(\lambda)]
= E_{q(\lambda)}[(e lnf - ln\Gamma(e)) + (e - 1)ln\lambda - f\lambda]</script>
<p>Now, passing the expectations through:</p>
<script type="math/tex; mode=display">(e_0lnf_0 - ln\Gamma(e_0)) + (e_0 - 1)E_{q(\lambda)}[ln\lambda] - f_0E_{q(\lambda)}[\lambda]</script>
<script type="math/tex; mode=display">(e lnf - ln\Gamma(e))+ (e - 1)E_{q(\lambda)}[ln\lambda] - fE_{q(\lambda)}[\lambda]</script>
<p>Writing in terms of the difference (note that the sign changes in the second expectation):</p>
<script type="math/tex; mode=display">(e_0lnf_0 - ln\Gamma(e_0)) + (e_0 - 1)E_{q(\lambda)}[ln\lambda] - f_0E_{q(\lambda)}[\lambda]
- (e lnf - ln\Gamma(e)) - (e - 1)E_{q(\lambda)}[ln\lambda] + fE_{q(\lambda)}[\lambda]</script>
<p>And combining terms:</p>
<script type="math/tex; mode=display">(e_0lnf_0 - ln\Gamma(e_0)) - (e lnf - ln\Gamma(e))
+ (e_0 - e)E_{q(\lambda)}[ln\lambda] - (f_0 - f)E_{q(\lambda)}[\lambda]</script>
<p>Notice how for both distributions, <script type="math/tex">E_{q(\lambda)}[\lambda]</script> is identical. This is because <script type="math/tex">E_{q(\lambda)}[\lambda]</script> is a function of the distribution with which we are taking the expectation, <script type="math/tex">q(\lambda)</script>. Therefore, even though the paramaters for <script type="math/tex">p(\lambda)</script> don’t change (they remain <script type="math/tex">e_0, f_0</script> always), <script type="math/tex">p(\lambda)</script> evaluates to a different result as <script type="math/tex">q(\lambda)</script> changes. For <script type="math/tex">q(\lambda)</script>, on the other hand, we always use the latest values of <script type="math/tex">f, e</script>.</p>
<p>At first I found it counterintuitive that <script type="math/tex">e_0, f_0</script> should be constant through every iteration – the Bayesian insight is that priors are constantly being updated as information comes in. The reason why, in this case, <script type="math/tex">e_0, f_0</script> are constant (and this is true for the priors on the other distributions as well) is that the entire Variational Inference algorithm is meant to approximate a <strong>single</strong> Bayesian update. Thus, it is wrong to interpret the <script type="math/tex">q()</script> distribution learned from iteration <script type="math/tex">t</script> as the new prior on the model variables for iteration <script type="math/tex">t+1</script>. In the context of the single update (regardless of current iteration), <script type="math/tex">p()</script> is always the initial prior, and <script type="math/tex">q()</script> is the best posterior-so-far. The VI concept is that we can iteratively improve the posteriors <script type="math/tex">q()</script>, but always in the context of a <em>single</em> Bayesian update.</p>
<h2 id="final-thoughts">Final Thoughts</h2>
<p>We have presented Variational Inference, in a hopefully accessible manner. It is a very slick technique that I am excited to continue to gain skill in applying. VI is the inference technique which underlies <a href="http://www.columbia.edu/~jwp2128/Teaching/E6892/papers/LDA.pdf">Latent Dirichlet Allocation</a>, a very popular learning algorithm developed by David Blei (now at Columbia!), Andrew Ng, and Michael Jordan, all Machine Learning heavyweights, while the former was at Cal (Go Bears)!</p>
<p>Earlier, we mentioned that we wanted to encourage sparsity in <script type="math/tex">w</script>. This can be accomplished (so I am told), by setting the priors on <script type="math/tex">\alpha_k</script> to <script type="math/tex">a_0, b_{0k} = 10^{-16}</script>. Tiny priors here will limit the dimensions of <script type="math/tex">w</script> which are signicantly non-zero. I’m not entirely sure why (something I still need to look into), but I can assure you that my model was super sparse :).</p>
<p>Variational Inference is a fairly sophisticated technique (the most complex algorithm I have encountered, but that might not count for much), and allows for the formal definition and learning of complex posteriors otherwise intractable using normal Bayesian methods.</p>
Sun, 22 Nov 2015 00:00:00 +0000
http://kronosapiens.github.io/blog/2015/11/22/understanding-variational-inference.html
http://kronosapiens.github.io/blog/2015/11/22/understanding-variational-inference.htmlbayesstatisticsmachine-learningdata-scienceblogOpsWorks, Flask, and Chef<p><em>This post is part two of a two-part series about deploying a Flask app on an Amazon EC2 instance running Ubuntu. You can read part one <a href="http://kronosapiens.github.io/blog/2015/08/11/understanding-unix-permissions.html">here</a></em>.</p>
<p>Today we conquered the kitchen.</p>
<p>We have used AWS OpsWorks and Chef to successfully configure and deploy a Flask app, and to redeploy the app after every GitHub commit. Now <a href="http://thena.io">Thena</a> can develop at significantly higher speeds.</p>
<h2 id="setup">Setup</h2>
<p>Let us consider the pieces in play:</p>
<ol>
<li>Thena, a <a href="http://flask.pocoo.org/">Flask</a>-based web application.</li>
<li><a href="https://aws.amazon.com/opsworks/">AWS OpsWorks</a>, an “application management service”.</li>
<li><a href="https://www.chef.io/">Chef</a>, a configuration mangement tool.</li>
<li>The correct environment and infrastructure for Thena, call it <script type="math/tex">\theta</script>.</li>
<li>A micro EC2 instance backed by a small Postgres database.</li>
</ol>
<p>Our goal is to configure the EC2 instance to be able to serve the app, and to automate all infrastructure such that new features can be deployed in real-time.</p>
<p>The first challenge is to successfully automate setup and deploy. This requires the interaction of five components: the EC2 instance, GitHub, the Chef process, the nginx process, and the uWSGI process.</p>
<p>We will need to do the following:</p>
<ol>
<li>Install all necessary ubuntu packages on the EC2 instance</li>
<li>Create (and update) all directories for application, configuration, and log files.</li>
<li>Create (and update) all necessary configuration files.</li>
<li>Set all necessary file and directory permissions.</li>
<li>Enable and start the nginx process.</li>
<li>Connect to GitHub and download most recent application code.</li>
<li>Update any Python packages.</li>
<li>Restart the uWSGI process.</li>
<li>Restart nginx.</li>
</ol>
<p>Steps 1-5 are known as the “Configure” phase. Steps 6-9 are known as the “Deploy phase”. The Configure commands are meant to be run once, when an EC2 instance is first brought online – this configuration is not expected to change much over the life of the application. The Deploy commands are meant to be run arbitrarily often – in this case, every time code is committed to GitHub.</p>
<h2 id="enter-the-kitchen">Enter the Kitchen</h2>
<p>We use the Chef process to execute steps 1-9, both the Configure and the Deploy phases. Chef is a tool which uses “cookbooks” to learn how to bring nodes (in this case, EC2 instances) into proper alignment (formally, to create state-of-affairs <script type="math/tex">\theta</script>). In order to configure Chef, we first have to write a cookbook.</p>
<p>Our cookbook, <code class="highlighter-rouge">thena-infra</code> is structured as follows:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>thena-infra/
attributes/
default.rb
recipes/
configure.rb
default.rb
deploy.rb
templates/
default/
thena-nginx.erb
thena-uwsgi.conf.erb
thena-uwsgi.ini.erb
metadata.rb
Berksfile
.kitchen.yml
</code></pre>
</div>
<p>These files and directories play various roles, as follows:</p>
<ul>
<li>
<p>The <code class="highlighter-rouge">attributes</code> directory hold attributes files. These files define the constants which will be used in the cookbook. Things like application paths, logfiles, ports, and the like are defined here.</p>
</li>
<li>
<p>The <code class="highlighter-rouge">recipes</code> directory contains the substance of the cookbook. This is where you specify the actual configuration that you want. With Chef, you define configuration using things called “resources”. To Chef, everything is a resource – files, processes, commands, packages, etc.</p>
</li>
<li>
<p>The <code class="highlighter-rouge">templates</code> directory contains templates for all of the specific files you want Chef to write to the node (our EC2 instance). All templates have the <code class="highlighter-rouge">.erb</code> extension, short for “Embedded Ruby” – allowing Chef to insert context-relevant data into the template before writing it to the node. You can think of these similarly to templates for web applications.</p>
</li>
<li>
<p>The <code class="highlighter-rouge">metadata.rb</code> file provides, unsurprisingly, metadata about the cookbook. This includes the name, author, and version of the cookbook, as well as defining any <strong>dependencies</strong>.</p>
</li>
<li>
<p>The <code class="highlighter-rouge">Berksfile</code> is the file used by Berkshelf, Chef’s dependency manager. Berkshelf is responsible for downloading any dependency cookbooks, and Berksfile is where we define those dependencies and where Berkshelf should look for the cookbooks (in our case, we simply tell Berkshelf to check <code class="highlighter-rouge">metadata.rb</code> to see what cookbooks are needed).</p>
</li>
<li>
<p>The <code class="highlighter-rouge">.kitchen.yml</code> file is used to configure Test Kitchen, Chef’s test harness.</p>
</li>
</ul>
<p>There will likely be other files in a cookbook, but we will concern ourselves with these for now.</p>
<p>Developing a cookbook is similar to developing any piece of software – it requires a tight feedback loop. To get this feedback, we used <a href="https://learn.chef.io/local-development/rhel/get-started-with-test-kitchen/">Test Kitchen</a>.</p>
<p>Test Kitchen is a tool that lets you start virtual machines locally, execute your Chef recipes on that VM, and then check the VM to confirm that everything is in order. It’s an easy-to-use tool that was very helpful when developing the <code class="highlighter-rouge">thena-infra</code> cookbook.</p>
<p>With Test Kitchen, I was able to develop the cookbook up to the point of launching a correctly-configured webserver serving a generic “Hello World” message. As the nginx process was listening on port 80, I was able to test the cookbook as follows:</p>
<div class="language-bash highlighter-rouge"><pre class="highlight"><code>ƒ: kitchen converge
<output logs suppressed>
ƒ: kitchen login
Welcome to Ubuntu 14.04.1 LTS <span class="o">(</span>GNU/Linux 3.13.0-24-generic x86_64<span class="o">)</span>
<span class="k">*</span> Documentation: https://help.ubuntu.com/
Last login: Tue Nov 10 16:51:41 2015 from 10.0.2.2
<span class="gp">vagrant@default-ubuntu-1404:~$ </span>curl http://0.0.0.0:80/
Hello Thena
</code></pre>
</div>
<p>The cookbook works!</p>
<h2 id="to-the-cloud">To the Cloud</h2>
<p>We’ve developed a cookbook which runs great locally. Now we need to get that cookbook onto AWS OpsWorks, and to update the cookbook to pull the application code from GitHub and serve that (instead of serving the generic message).</p>
<p>The first step will be to learn how to upload cookbooks to AWS OpsWorks. This ended up being a bit tricky, since the repository structure <a href="http://docs.aws.amazon.com/opsworks/latest/userguide/workingcookbook-installingcustom-repo.html">that AWS expects</a> is different from the one <a href="https://docs.chef.io/chef_repo.html">Chef uses</a> for its enterprise server product.</p>
<p>Specifically, AWS expects:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>chef-repo/
cookbook_a/
cookbook_b/
...
Berksfile
</code></pre>
</div>
<p>While Chef enterprise uses:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>chef-repo/
cookbooks/
cookbook_a/
Berksfile
cookbook_b/
Berksfile
...
...
</code></pre>
</div>
<p>With differences being the location of the cookbooks and the Berksfile(s). Having developed the <code class="highlighter-rouge">thena-infra</code> cookbook using the Chef-recommended structure, I ran into a few errors when trying to upload the cookbook to AWS. I eventually got the <a href="https://github.com/kronosapiens/chef-repo">cookbook working</a> as follows:</p>
<ol>
<li>Placed the cookbook at the top of <code class="highlighter-rouge">chef-repo/</code>, rather than nested inside <code class="highlighter-rouge">cookbooks/</code>.</li>
<li>Moved the Berkshelf file to the top of <code class="highlighter-rouge">chef-repo/</code> and hard-coded the dependencies (because it was no longer possible to reference <code class="highlighter-rouge">metadata.rb</code>).</li>
</ol>
<p>I wasn’t thrilled about change 2, since it violates single-source-of-truth and will require me to update dependencies in two places. As I add more cookbooks, I will have to record all of their dependencies in the shared Berksfile, which is also not ideal. It is sufficient for now, but in the future it may be worth returning to this.</p>
<p>With this, AWS OpsWorks now has the cookbook.</p>
<p>Now comes time to test OpsWorks’ ability to pull code from GitHub. OpsWorks provides some code you can use in your recipes to pull code from GitHub, using configuration from OpsWorks itself (which you set via the AWS Console). This isn’t something that can easily be tested using Test Kitchen, so from here on out we’re testing directly in OpsWorks, on our actual EC2 instance. OpsWorks makes it pretty easy to execute specific recipes via the console interface, which is what we’ll be doing.</p>
<p>For reference, here’s the boilerplate from AWS, meant to be copy-pasted directly into a deploy recipe.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">include_recipe</span> <span class="s">'deploy'</span>
<span class="n">node</span><span class="p">[:</span><span class="n">deploy</span><span class="p">]</span><span class="o">.</span><span class="n">each</span> <span class="n">do</span> <span class="o">|</span><span class="n">application</span><span class="p">,</span> <span class="n">deploy</span><span class="o">|</span>
<span class="n">opsworks_deploy_dir</span> <span class="n">do</span>
<span class="n">user</span> <span class="n">deploy</span><span class="p">[:</span><span class="n">user</span><span class="p">]</span>
<span class="n">group</span> <span class="n">deploy</span><span class="p">[:</span><span class="n">group</span><span class="p">]</span>
<span class="n">path</span> <span class="n">deploy</span><span class="p">[:</span><span class="n">deploy_to</span><span class="p">]</span>
<span class="n">end</span>
<span class="n">opsworks_deploy</span> <span class="n">do</span>
<span class="n">deploy_data</span> <span class="n">deploy</span>
<span class="n">app</span> <span class="n">application</span>
<span class="n">end</span>
<span class="n">end</span></code></pre></figure>
<p>And here is <a href="http://docs.aws.amazon.com/opsworks/latest/userguide/workingcookbook-json.html#workingcookbook-json-deploy">AWS’s guide</a> to deployment attributes.</p>
<p>Before going further, let’s take a moment to discuss how OpsWorks uses Chef. Chef is run as root, but can be configured to take specific actions as a specific user. When OpsWorks is executing deploy recipes, for example, commands are by default run as the <code class="highlighter-rouge">deploy</code> user. Chef allows you to specify, for any given resource, what user should be associated with that resource. If you want to start the uWSGI process as the <code class="highlighter-rouge">ubuntu</code> user, for example, you will specify this in the recipe.</p>
<p>Executing the cookbook for the first time yields an error in accessing GitHub. I look online and see that someone has managed to circumvent the error by telling OpsWorks to download the repo using HTTPS instead of SSH. I make the change and the code downloads without a problem. I’m planning on circling back around to this later – I anticipate the problem was that OpsWorks was trying to SSH to github as the <code class="highlighter-rouge">deploy</code> user, which hasn’t set up a public key on GitHub.</p>
<p>I run the cookbook again. Another error – uWSGI isn’t being restarted. I look at the logs and realize that Chef is trying to start uWSGI with <code class="highlighter-rouge">init</code>, the <a href="https://docs.chef.io/resource_service.html">default provider</a> for the service resrouce. I’m planning on using <a href="http://upstart.ubuntu.com/">Upstart</a> to manage the uWSGI process, so I add <a href="https://github.com/kronosapiens/chef-repo/blob/master/thena-infra/recipes/deploy.rb#L45">a line to the resource</a> specifying Upstart as the provider.</p>
<p>I run the cookbook again. No errors!</p>
<h2 id="debugging">Debugging</h2>
<p>Once the <code class="highlighter-rouge">deploy</code> recipe is run for the first time, I SSH into the instance and poke around. The most recent code is in <code class="highlighter-rouge">/srv/www/thena/current/thena/</code>, exactly where it should be. I check for webserver processes:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>ubuntu@aether:~$ ps -aux | grep nginx
root 30369 0.0 0.1 85880 1388 ? Ss Nov11 0:00 nginx: master process /usr/sbin/nginx
www-data 30370 0.0 0.1 86264 1816 ? S Nov11 0:00 nginx: worker process
www-data 30371 0.0 0.1 86264 1816 ? S Nov11 0:00 nginx: worker process
www-data 30372 0.0 0.2 86264 2308 ? S Nov11 0:00 nginx: worker process
www-data 30373 0.0 0.1 86264 1816 ? S Nov11 0:00 nginx: worker process
ubuntu 30809 0.0 0.0 10460 916 pts/0 S+ 00:15 0:00 grep nginx
ubuntu@aether:~$ ps -aux | grep uwsgi
ubuntu 30336 0.0 0.7 47772 7352 ? Ss Nov11 0:00 uwsgi --ini thena-uwsgi.ini
ubuntu 30351 0.0 2.9 156764 30368 ? S Nov11 0:00 uwsgi --ini thena-uwsgi.ini
ubuntu 30352 0.0 2.9 156276 29820 ? S Nov11 0:00 uwsgi --ini thena-uwsgi.ini
ubuntu 30353 0.0 3.0 156632 30500 ? S Nov11 0:00 uwsgi --ini thena-uwsgi.ini
ubuntu 30786 0.0 0.0 10460 912 pts/0 S+ 00:14 0:00 grep uwsgi
</code></pre>
</div>
<p>Excellent, the webserver is running. As a test, I kill the uWSGI processes and run the <code class="highlighter-rouge">deploy</code> command from the OpsWorks console. uWSGI is brought back up. More excellence. I hop onto Chrome to test the site.</p>
<p>Nothing. 502 errors every time. Frustrating. I check the uWSGI logs:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>Traceback (most recent call last):
...
File "./app/models.py", line 44, in load_user
return User.query.get(int(user_id))
File "/usr/local/lib/python2.7/dist-packages/flask_sqlalchemy/__init__.py", line 454, in __get__
return type.query_class(mapper, session=self.sa.session())
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/scoping.py", line 71, in __call__
return self.registry()
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/_collections.py", line 988, in __call__
return self.registry.setdefault(key, self.createfunc())
File "/usr/local/lib/python2.7/dist-packages/flask_sqlalchemy/__init__.py", line 704, in create_session
return SignallingSession(self, **options)
File "/usr/local/lib/python2.7/dist-packages/flask_sqlalchemy/__init__.py", line 157, in __init__
bind = options.pop('bind', None) or db.engine
File "/usr/local/lib/python2.7/dist-packages/flask_sqlalchemy/__init__.py", line 816, in engine
return self.get_engine(self.get_app())
File "/usr/local/lib/python2.7/dist-packages/flask_sqlalchemy/__init__.py", line 833, in get_engine
return connector.get_engine()
File "/usr/local/lib/python2.7/dist-packages/flask_sqlalchemy/__init__.py", line 496, in get_engine
self._sa.apply_driver_hacks(self._app, info, options)
File "/usr/local/lib/python2.7/dist-packages/flask_sqlalchemy/__init__.py", line 775, in apply_driver_hacks
if info.drivername.startswith('mysql'):
AttributeError: 'NoneType' object has no attribute 'drivername'
[pid: 21370|app: 0|req: 1/1] 129.236.232.24 () {44 vars in 955 bytes} [Mon Nov 9 20:52:44 2015] GET / => generated 0 bytes in 11 msecs (HTTP/1.1 500) 0 headers in 0 bytes (0 switches on core 0)
</code></pre>
</div>
<p>Wat. None of these words look familiar. Seems something databasey. I plug the error into Google, the results suggest a configuration error. Everything had been running fine before! (I had previously set up the server manually, it had been running successfully for a few months without any problems). I know things <strong>had</strong> been working. What’s changed?</p>
<p>First, I replace <code class="highlighter-rouge">wsgi.py</code> (the module which wraps the whole Flask app in a callable that is passed to the uWSGI process) with the same hard-coded “Hello World!” I had used for local development. I restart the webserver and load http://thena.io. I see the “Hello World!” message, confirming that the issue is not with nginx or uWSGI, but rather with Flask. This is progress.</p>
<p>This smells like a user/permissions issue. I SSH into the instance to see if I can connect to the database manually. I can – so I know the database is up and accessible. It must be that the database isn’t getting configured correctly when starting uWSGI.</p>
<p>I check the Flask config file, where I see this line:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">ProductionConfig</span><span class="p">(</span><span class="n">Config</span><span class="p">):</span>
<span class="n">SQLALCHEMY_DATABASE_URI</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'DATABASE_URL'</span><span class="p">)</span></code></pre></figure>
<p>So, the database is coming from the environment. I check the environmental variables, and see that DATABASE_ENV is defined just fine. Hmm. <a href="https://www.digitalocean.com/community/tutorials/how-to-read-and-set-environmental-and-shell-variables-on-a-linux-vps">I decide to learn about envronmental variables</a>. I learn that environmental variables are pretty temporary. They do not persist between shell sessions (unless saved in <code class="highlighter-rouge">.bash_profile</code>, <code class="highlighter-rouge">/etc/environment</code>, or similar location so they can be initialized at the start of every shell session). They are not passed from parents to children (so if you define a variable in a child process, the parent process will not have access to it). Running commands as sudo resets the environment (in the context of that one command), so commands run as sudo cannot access variables defined for the logged-in user.</p>
<p>I do some experiments, and realize that <code class="highlighter-rouge">DATABASE_URL</code> is defined for the <code class="highlighter-rouge">ubuntu</code> user, but <strong>not</strong> for <code class="highlighter-rouge">root</code>. I suspect that when Chef restarts the uWSGI process, it is running the command as <code class="highlighter-rouge">root</code> and therefore <code class="highlighter-rouge">DATABASE_URL</code> is not defined in that environment. I check this by hard-coding the database url in <code class="highlighter-rouge">config.py</code> and restarting uWSGI.</p>
<p>The site works! I am relieved. Now the challenge is to figure out an elegant way to pass the database info to uWSGI. Hard-coding it into <code class="highlighter-rouge">config.py</code> is not an option, for security reasons. It must come from the environment in some way. There should be as small a gap as possible between introducing the variable and starting the uWSGI process, to minimize the chance of the bug returning due to some minor unrelated change. I do a search on “environment variables uwsgi” and learn that you can specify environmental variables in a <code class="highlighter-rouge">.ini</code> file. I read the OpsWorks documentation and realize that environmental variables defined inside the OpsWorks console are available as attributes for Chef. My solution is to add the following line to my template <code class="highlighter-rouge">thena-uwsgi.ini.erb</code>:</p>
<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">env</span> <span class="no">DATABASE_URL</span><span class="o">=</span><span class="s2">"<%= node['deploy']['thena']['environment_variables']['DATABASE_URL'] %>"</span></code></pre></figure>
<p>Seems like a good solution. Since we are always going to be starting uWSGI through Upstart, adding the variable to the Upstart script seems like the perfect location. It will be hard-coded in the file (but <em>not</em> in the cookbook), meaning the variable will be available to uWSGI <em>regardless</em> of what user is actually starting uWSGI.</p>
<p>I update the cookbook and re-deploy. It works! Hooray.</p>
<p>I click a few links on the site; everything seems to be working nicely. I go to the homepage and immediately get another 502! I refresh and the site loads fine. WHAT IS THIS? I check the uWSGI logs again and see this:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>Traceback (most recent call last):
...
File "./app/main/views.py", line 22, in index
num_arcs = Arc.query.count()
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2735, in count
return self.from_self(col).scalar()
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2504, in scalar
ret = self.one()
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2473, in one
ret = list(self)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2516, in __iter__
return self._execute_and_instances(context)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2531, in _execute_and_instances
result = conn.execute(querycontext.statement, self._params)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 914, in execute
return meth(self, multiparams, params)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
compiled_sql, distilled_params
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
context)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
exc_info
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
context)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 450, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) SSL SYSCALL error: EOF detected
[SQL: 'SELECT count(*) AS count_1 \nFROM (SELECT arcs.id AS arcs_id, arcs.created_at AS arcs_created_at, arcs.updated_at AS arcs_updated_at, arcs.user_id AS arcs_user_id, arcs.tail AS arcs_tail, arcs.head AS arcs_head, arcs.tail_url AS arcs_tail_url, arcs.head_url AS arcs_head_url \nFROM arcs) AS anon_1']
[pid: 29957|app: 0|req: 33/225] 184.152.70.185 () {44 vars in 955 bytes} [Wed Nov 11 23:55:21 2015] GET / => generated 0 bytes in 9 msecs (HTTP/1.1 500) 0 headers in 0 bytes (0 switches on core 0)
</code></pre>
</div>
<p>I google the error and learn that this is a well-known bug when using uWSGI, Flask, and psycopg2 (the Postgres python driver), involving a feature called <a href="https://en.wikipedia.org/wiki/Copy-on-write">“Copy on Write”</a>.</p>
<p>When uWSGI starts, it begins as a master process, and then forks off into some number of child processes (in my case, 5). As a memory-saving optimization, uWSGI will write the application to memory once, and then spin off child processes which all read from one shared version of the app. Only when the child process needs to actually <em>write</em> some information (as opposed to read) does it take its own space in memory. This feature saves memory by ensuring that parts of the application which are static and read-only are not duplicated unecessarily.</p>
<p>From the uWSGI docs themselves:</p>
<blockquote>
<p>uWSGI tries to (ab)use the Copy On Write semantics of the fork() call whenever possible. By default it will fork after having loaded your applications to share as much of their memory as possible. If this behavior is undesirable for some reason, use the lazy-apps option. This will instruct uWSGI to load the applications after each worker’s fork().</p>
</blockquote>
<p>The bug was due to the fact that the database threadpool (a fixed number of connections to the database, which are requested and relinquished as needed) was being created once and then shared by all of the child uWSGI processes. I am uncertain as to the specific failure, but this sharing is unintentional and was causing these requests to trip on each other. The answer was to update the <code class="highlighter-rouge">thena-uwsgi.ini</code> to add <code class="highlighter-rouge">lazy-apps = true</code>. This setting causes each child process to load the entire app from scratch, ensuring each process as a dedicated threadpool. Slightly less memory-efficient, but more stable.</p>
<p>I update the cookbook and re-deploy. Everything works great again!</p>
<h2 id="final-touches">Final Touches</h2>
<p>We can now configure and deploy an EC2 instance to serve Thena at the touch of a button. But to touch that button, we still have to log in to OpsWorks. What if we could deploy the app every time we made a commit? Well, we can!</p>
<p>GitHub has an <a href="https://github.com/integrations">integrations</a> feature which makes this easy. We go to the <a href="https://github.com/kronosapiens/thena">Thena repository</a>, and under settings, we set up an integration with OpsWorks. It’s pretty simple: you plug in the OpsWorks “stack” and “app” IDs, which you can find in the descriptions of the stack and app, respectively. Then you pass in an AWS access code and secret access code (which you can generate via <a href="https://console.aws.amazon.com/iam/home?#security_credential">AWS identity management</a>. That’s it! Every time you push to the repository, GitHub will ping OpsWorks and start a deploy.</p>
<p>And with that, we have built <strong>a full automatic deployment pipeline</strong>, using a minimal Chef cookbook that we wrote ourselves, in which we understand the purpose of every setting. This is the level of control and reliability that can serve as a solid and extensible foundation for whatever comes next. We are able to focus 100% of our attention now to the development of Thena, confident in our knowledge that the site keep itself up and running.</p>
<p>What an adventure this has been!</p>
<p>Open questions:</p>
<ol>
<li>
<p>Why was OpsWorks unable to fetch the repo from GitHub using SSH? The HTTPS workaround will be sufficient as long as we are fetching public repositories. If we ever have to serve a private repo, we’ll need SSH access.</p>
</li>
<li>
<p>Why was nginx returning a 502 error when uWSGI was returning a 500? Seems peculiar.</p>
</li>
<li>
<p>Is it possible to set up <code class="highlighter-rouge">chef-repo</code> to have a per-cookbook Berksfile, rather than a single shared Berksfile?</p>
</li>
</ol>
<p><strong>You can see the full cookbook in all its glory <a href="https://github.com/kronosapiens/chef-repo/tree/master/thena-infra">here</a>.</strong></p>
Tue, 10 Nov 2015 00:00:00 +0000
http://kronosapiens.github.io/blog/2015/11/10/opsworks-flask-and-chef.html
http://kronosapiens.github.io/blog/2015/11/10/opsworks-flask-and-chef.htmlopsworksflaskchefinfrastructureblogOn Learning Some Math<p>In late May, I left my startup job to study math.</p>
<p>I had been working at the startup for the past eighth months, having started in mid-September of the year before. I was one of their two backend engineers, teaming up with a remote developer in China to build out the backend for the company’s product, an iPhone game. They hired me because I aced their coding test, and I ended up being directly responsible for the majority of the company’s client-serving backend. It was demanding, it pushed me, I liked it.</p>
<p>I was also starting out as a part-time student in Columbia’s QMSS program, in their data science concentration. It was supposed to be a challenging track – I was warned that most students drop out. It sounded exciting. I knew my math background was relatively weak – one semester of calculus, plus discrete math. I had been fully out of school for two years at that point, so I thought I’d take it slowly and sign up for two classes. Probability and Statistics. Social Science Theory and Methods. Plus the new startup job, and the non-profit I help run. Nothing could go wrong with this plan.</p>
<p>I am occasionally naive.</p>
<p>The year was rough. 80-hour-weeks-every-week rough. I drop into a calculus-based statistics course without really knowing that much calculus. I’m pretty sure that the first time I had to actually solve an integral, it was a double. Things are coming at us pretty fast. The midterm average is less than 40%. I contemplate the aesthetics of the letter “F”. I hustle and struggle and put in the time. Random variables, probability distributions. Independence, expectation. Samples, confidence, hypotheses. I get good at breaking big problems into smaller ones. Things come together. Everything starts to make sense. I get an A. It was encouraging.</p>
<p>Thanks, Professor Cunningham.</p>
<p>I didn’t like the Theories and Methods course. Mostly because the professor thought he was too important to bother preparing his lectures. I match his effort. Get an A-. By a point. I decide that’s fair. Social science is stepping into the background for me anyway.</p>
<p>I sign up for Data Visualization in the Spring. Only one class, plus a casual once-weekly seminar. I am hoping for a slightly more relaxed semester. Went in to data visualization the first day and immediately realized that class was not worth the money. I only had a few semesters of grad school, and I wasn’t going to be wasting time. I transfer into Machine Learning. It seems the semester will be slightly less relaxed.</p>
<p>My company is geting tired of my academic activities. I didn’t particularly think it was their business what I did after hours. I was exchanging ~50 hours a week of diligent labor for an 80k salary, which I thought was actually pretty fair. I am pulling a lot of weight. They do not invest in my professional development. They are not interested in my thoughts on the product. They suggest that I start coming in on weekends. I know I have the smallest equity stake out of anyone there. There is no opportunity for growth. They are not making a strong case. I am wondering what lessons I should be taking away from this. I contemplate the <a href="http://www.strategy-business.com/article/00366">increasing presence of Millenials</a> in the workforce.</p>
<p>I missed the first ML lecture, due to aforementioned schedule switch. I walk in fifteen minutes late to the second lecture, having gotten lost trying to find the room. I take a seat near the back, finding a spot next to a rebellious-looking kid straight out of hipster Brooklyn. I get out my notebook (an extra-large unlined Moleskine, my signature), and a pencil.</p>
<p>I look at the board and the first thing I see is some sort of upside-down triangle. The professor is talking about gradients and projections. I have no idea what is going on. For a few minutes I strongly consider walking out.</p>
<p>I stick it out. I basically understand what is going on. The professor is very good. I can visualize shapes in three-dimensional space. I know how to square things. I calm down. The kid next to me is funny and interesting.</p>
<p>I end up really liking Machine Learning. It is like being in wizard school. There are many interesting ways to turn data into different data that is somewhat more actionable. I use special symbols to make knowledge appear out of thin air. Hyperplanes. Prediction. Kernels. Boosting. I am very pleased with the whole thing.</p>
<p>I get another A. I am heartened. Thanks, Paisley.</p>
<p>I am starting to think that this math and computer thing is important and that I should start doing more of it. My company is reaching the same conclusion. We part ways in May, a calm and appropriate separation. I had managed to cover most of the year’s tuition as well as replenish most of my savings, so I was feeling comfortable financially. I easily had five months rent just lying around. I feel free.</p>
<p>I decide it’s time to learn math properly. Hustling through two semesters of grad school was fine, but I want to actually be good at this. I needed to know what things were, and how they worked. I wanted to learn the math that I could have learned in college, if I had been a bit more mature. I thought back to the Berkeley undergraduate math sequence, specifically the lower-division requirements. 1A: Differential Calculus. 1B: Integral Calculus. 53: Multivariable Calculus. 54: Linear Algebra and Differential Equations, 55: Discrete Math.</p>
<p>I decide that I want all of it. I had taken 1A as a freshman, driven by a certain exploratory impulse that was later attenuated by the pre-law GPA-protecting pragmatism. I took Math 55 as a senior, as one of the “hard classes” for my Cognitive Science major. Up until this year, I had had a fear of math as something which I would not easily be good at. I had been lazy about math in high school; I hadn’t yet seen the point.</p>
<p>I get to it. I plan to go to school full-time in the September. It is late May. I have about three months. I need to build a foundation. I am not interested in paying for undergraduate classes. A formal course would also be far too slow. I’ve heard of the internet. I’m motivated.</p>
<p>I start at <a href="https://www.khanacademy.org/">Khan Academy</a>. Their automated assessment politely suggests I review some precalculus. I am disheartened but resolved. I had copied my friend Stephen’s math homework all throughout 11th grade. Khan Academy is correct.</p>
<p>Salman Khan walks me from the Unit Sphere all the way through Integral Calculus. Triangles, sinusoids, Taylor series, implicit derivation, integration by parts, complex numbers, the works. I am embarassed at first – I am learning material being taught to high schoolers ten years my junior. But this is the path. I take every test, solve every problem that comes my way. (I admit to skipping the test for L’Hospital’s rule).</p>
<p>Thanks, Sal.</p>
<p><img src="https://s3.amazonaws.com/kronosapiens.github.io/images/khan.png" alt="Khan Academy" /></p>
<p>The whole thing takes about a month. I finish everything through single-variable calculus in late June. I decide to start multivariable. I begin looking at Khan Academy’s offerings, but find them underdeveloped. It is time for something more traditional. Enter <a href="http://ocw.mit.edu/index.htm">MIT OpenCourseWare</a>.</p>
<p>MIT foresaw the MOOC revolution at least ten years early, having started putting lectures and course materials online since 1999. Browsing the web for multivariable calculus courses, I find a series of videos of the entire set of lectures of MIT 18.02, Multivariable Calculus, taught in Fall 2007 by Denis Auroux. This is right.</p>
<p>I am able to work through 2-3 lectures per day. A 50-minute lecture takes me about 2-3 hours to digest. I do not skip days, although on weekends I go a bit easy. I am very happy about multivariable calculus. Vectors make sense, as do their gradients. Level sets. Divergence. Flux. Curl. I get used to drawing in three dimensions. I do a lot of integrals. All of a sudden there are theorems. I draw a lot of shapes. I like the way the pencil taps against the notebook when I am solving equations quickly. I develop certain writing flourishes. Denis Auroux is a funny lecturer. I like theorems.</p>
<p>I continue to take all the tests. I take my time with them. I enjoy them. I do pretty well.</p>
<p>Thanks, Denis.</p>
<p><img src="https://s3.amazonaws.com/kronosapiens.github.io/images/auroux.png" alt="Denis Auroux" /></p>
<p>I finish in about three weeks. It’s time for Linear Algebra. I look online again and end up back at MIT OCW, this time at Gilbert Strang’s 18.06, recorded Fall 1999.</p>
<p>I love linear algebra. I love Gilbert Strang. I felt transported to that classroom in 1999, learning about vector spaces and orthogonality and rank. The relationships between the various mathematical objects are rich and profound. Machine Learning is making more sense. I am intrigued by the properties of determinants.</p>
<p>I continue to process 2-3 lectures per day. In early August, I fly back to California for five weeks: Three in Santa Monica, two in Oakland, one in Black Rock City. Not a bad itinerary. My time in Santa Monica is spent riding my bike, hanging out with my parents, and finding eigenvectors. It is exceedingly pleasant. Gilbert Strang is an amazing lecturer. I multiply a lot of matrices. “Again,” I tell myself, starting on a practice problem. “Again. Again.” I develop intuitions.</p>
<p>By the time I get to Oakland, I have only three lectures and the final left to take. I find it markedly more difficult to focus once I arrive – the excitement, reunions with friends, the adventure of being back in the Bay, proves distracting. I struggle to focus, but eventually do power through and finish. I spend six hours on the final, sitting in a coffee shop on Telegraph Avenue. Again, I do well.</p>
<p>Thanks, Gilbert.</p>
<p><img src="https://s3.amazonaws.com/kronosapiens.github.io/images/strang.png" alt="Gilbert Strang" /></p>
<p>I want to take a moment and acknowledge the debt owed to these three men, and the teams which organized and published their content. The education I received would not have been possible ten years ago; someone in my position would have faced significantly greater obstacles. I feel a closeness to them, as though I had truly been their student. Thanks.</p>
<p>I get back to Brooklyn just in time for the second day of school. My first class is <a href="http://www.cs.columbia.edu/~djhsu/coms4772-f15/">Advanced Machine Learning</a>. The first lecture: a “calibration quiz” meant to prune the class down to size. I am anxious. This class is important to me. I get in.</p>
<p>The next chapter begins.</p>
Sun, 25 Oct 2015 00:00:00 +0000
http://kronosapiens.github.io/blog/2015/10/25/on-learning-some-math.html
http://kronosapiens.github.io/blog/2015/10/25/on-learning-some-math.htmlmathlearningblog