AbacusI'm Daniel Kronovet, a data scientist living in New York City.
http://kronosapiens.github.io/
Tue, 19 Sep 2017 04:10:00 +0000Tue, 19 Sep 2017 04:10:00 +0000Jekyll v3.5.2Time and Authority<p>I.</p>
<p>Living together, humans exchange ideas.</p>
<p>Some of those ideas have staying power. Views concerning
good and evil and right conduct have been around for millenia.
Notions of individual freedom and equal rights have been around for centuries.
Concepts of gender and racial equality, decades.</p>
<p>Ideas come in and out of fashion; some last longer than
others. Not all of them are good, and sometimes we have to let them go.</p>
<p>Still, we expect people to stick to their beliefs: especially our leaders and
people we depend on. It is destructive when leaders renege on
important issues to suit their immediate needs.</p>
<p>At the same time, we don’t want people to feel like they must perform
their beliefs under external pressure. As we change, our attitudes change,
and our expression of those attitudes should be free to change with them.</p>
<p>We want freedom. We want stability. How do we balance self and society?
If something is good, will it last? If something lasts, is it good?</p>
<p>II.</p>
<p>Since the end of World War II and the ushering in of the postmodern age,
it has become the norm to challenge and disassemble the authorities of
yesteryear. An idea which has been passed on for hundreds of years is
of the same value as one freshly conceived that morning; it is our
intellect, and nothing else, that arbitrates between them.</p>
<p>The deconstruction was a cultural breakthrough, but has
left us more sensitive to dialectical tensions yet <a href="https://www.theatlantic.com/politics/archive/2016/05/the-peril-of-writing-a-provocative-email-at-yale/484418/">poorly-equipped</a>
to resolve them.</p>
<p>Ultimately, the postmodern vision has been a gift and a curse. We cannot hold
ourselves right <em>a priori</em> and we must ultimately find a way to balance
flexiblity of thought with a valuing of tradition. The ideal balance is one
that allows an individual to change their mind, while at the same time
creating some incentive to stick to one’s beliefs. We must find a way to
embrace change without fearing destruction.</p>
<p>It is easy to speak in generalities; it is hard to put things into action.
In an attempt at the latter, we will bring this balance into practice,
as a demonstration and extension of <a href="http://nbviewer.jupyter.org/github/kronosapiens/thesis/blob/master/tex/thesis.pdf">this theory</a> of preference graphs.</p>
<p>III.</p>
<p>To review the language of preference graphs, we an individual <script type="math/tex">e</script>, who has
preferences written as <script type="math/tex">(b,a)</script> when <script type="math/tex">e</script> prefers <script type="math/tex">a</script> over <script type="math/tex">b</script>. We can
imagine <script type="math/tex">(b,a)</script> as a preference, or arrow, from from <script type="math/tex">b</script> to <script type="math/tex">a</script>.</p>
<p>Thusfar when aggregating preferences, all preferences are given a weight of 1.
Now we introduce a new dimension to preferences: the <em>authority</em> of a
preference, a variable weight defined as some function of the time <script type="math/tex">t</script> since
<script type="math/tex">e</script> first expressed the preference <script type="math/tex">p = (b,a)</script>.If <script type="math/tex">p</script> is an arbitrary
preference and <script type="math/tex">t_p</script> is the time since that preference was first expressed,
then the authority of <script type="math/tex">p</script> can be defined as:</p>
<script type="math/tex; mode=display">auth(p) \triangleq f(t_p)</script>
<p>The authority function is intentionally general; any function will do, and
the choice of function will shape our intereptation of the “authority.”
Using a monotonically-increasing function, like the logarithm, creates an
authority curve which is intuitive and useful.</p>
<p>IV.</p>
<p>What happens when we incorporate time into applications of preference graphs?
Assuming a monotonically-increasing authority function and rational,
self-interested participants, we might expect the following.</p>
<ol>
<li>
<p>Individuals are incentivized to register their preferences as early as
possible. Assume that individuals would like their views to have the
maximum impact on the group. In the context of an online application or
service, this creates an valuable incentive to adopt the product as
early as possible.</p>
</li>
<li>
<p>Individuals will change their preferences less frequently. If an individual
changes their preference, the authority of that preference resets. If an
individual then decided that their original preference was the right one,
the preference resets again, and the accumulated authority of thier initial
preference is lost. There is an incentive to get it right the first time.</p>
</li>
<li>
<p>Individuals will change their preference when their views truly change.
There is no benefit to holding on to views one no longer agrees with:
if an individual truly feels differently about an issue, then updating
their preference will achieve the desired directional affect.</p>
</li>
</ol>
<p>In summary, the addition of a time dimension to a preference-aggregation
platform creates powerful incentives to both adopt the platform and to
behave responsibly once on the platform. It is especially worth noting that
the additional computational complexity associated with incorporating the
time dimension is small: <script type="math/tex">O(n)</script>. That so many positive effects emerge
from a simple computation is highly suggestive.</p>
Thu, 14 Sep 2017 00:00:00 +0000
http://kronosapiens.github.io/blog/2017/09/14/time-and-authority.html
http://kronosapiens.github.io/blog/2017/09/14/time-and-authority.htmlvotingconsensusblockchainblogOn Meaning in Games<h1 id="i">I.</h1>
<p>Many years ago I read an interview of one of the game designers of World of Warcraft, a very popular massively multiplayer online role-playing game. It was a great interview; I was most struck by the way the designer thought about the construction of the game world. I’ve tried to find the interview, to no avail. Here is my recollection of this response:</p>
<blockquote>
<p>When you’re designing a game like World of Warcraft, you need to realize that you’re not designing a single game, but rather many games, one nested inside the other like a set of Russian dolls. At the top level, you have the main narrative: Arthas the Lich King is bent on destroying Azeroth, and you as the player must become strong to defeat him.</p>
</blockquote>
<blockquote>
<p>Beneath that, there is the game of becoming a strong hero. You go on adventures and fight monsters and advance your skills and acquire powerful, enchanted items.</p>
</blockquote>
<blockquote>
<p>Beneath that, there is the actual dungeon raid: you must fight through the dungeon (or mountain, or fortress) and defeat the dragon (or necromancer, or baron) who guards the treasure.</p>
</blockquote>
<blockquote>
<p>Beneath that, there is combat with a single monster: casting spells, swinging swords, pressing “heal” every sixty seconds.</p>
</blockquote>
<blockquote>
<p>Each one of these levels is a game, and each one of these games has to be fun. Character development must be fun. Dungeon crawls must be fun. Combat must be fun. If the game world is compelling and richly realized, but combat mechanics are tedious, then the game will not be fun to play. Conversely, if combat mechanics are well-executed, but the game world is flat and boring, then the game will lose its appeal. For a game to have staying power, each of the multiple layers of the game needs to be fun.</p>
</blockquote>
<p>I loved this description, and what it revealed about the nature of games, experience, and meaning. The top-level game (defeat Arthas!) provides the main structure of the story: a single loop, after which the whole game ends. This top-level loop provides meaning to the levels of gameplay which support it: you are developing your hero, fighting monsters, etc <em>in order to</em> defeat Arthas. Without Arthas, the other games would have no meaning.</p>
<p>Looking at it from the other direction, we see how the top-level game (defeat Arthas!) is realized, or implemented, by the levels beneath it. To defeat Arthas, we must strengthen our character, confront dangers, and defeat monsters. The top-level game is realized via the lower-level games, which provide the substance of the game experience. If these lower level games are poorly realized, then the process of defeating Arthas will be unrewarding.</p>
<p>Notice also how the top-level games consist of longer narrative loops which cycle less frequently, while the low-level games are short cycles which repeat frequently.</p>
<p>These relationships exist in miniature between any subset of these games. The desire to “clear” a dungeon provides meaning to individual fights with monsters; the individual fights with monsters consists of the experience of clearing the dungeon.</p>
<p>By the time we get to the simplest of games, that of “pressing the heal button every sixty seconds,” that simple action has been imbued with meaning from the multiple games stacked upon it.</p>
<h1 id="ii">II.</h1>
<p>More than a handful of <a href="https://en.wikipedia.org/wiki/Nir_Eyal">careers</a> have been made in Silicon Valley selling foolproof formulas for creating “addictive” products, doling out “dopamine hits” at irregular intervals.</p>
<p>As an exercise, lets analyze a selection of products from the perspective of hierarchichal levels of gameplay and attempt to understand their appeal, with a mind to meaning coming top-down and substance coming bottom-up.</p>
<p>Along the way, we will discuss the various forms which these games can take. For example, we will see how meaning can come from real-world social relationships, as much as from in-game narratives.</p>
<h3 id="pokemon-go">Pokemon Go</h3>
<p>Pokemon Go made waves when it was released in Summer 2016, notable for the way it incorporated real-world location into gameplay. Players would walk around the real world and encounter Pokemon via a google maps-style game screen. The game soared to the top of the charts, and that summer the streets were full of people playing the game.</p>
<p>As summer turned to fall, however, players started to drop off. The core game mechanics had lost their lustre, and with no support for duels and trading, players found little reason to keep catching pokemon.</p>
<p>Here, we see how a well-executed (in fact, groundbreaking) low-level game mechanic (walk around and catch pokemon) became unappealing after time, as there was no higher-level narrative to provide meaning to the game. The original Game Boy games pioneered the walk-and-catch mechanic (albeit without the real-world link), and made that mechanic the foundation for a larger narrative involving personal rivalry, evil organizations, and a league of fellow trainers to fight. Pokemon Go upgraded the core mechanic but failed to include any higher-level structure, such as a narrative or any type of peer-to-peer gameplay.</p>
<p>What is remarkable is how Niantic, the maker of Pokemon Go, failed to implement any of these levels of gameplay in the year following the game’s release. They spent the months following the release working to make the hugely popular game more stable (the weeks following the release were plagued with crashes), which was the right thing to do. After that, though, they chose to add additional nuances to the core catch mechanic (which was fine as-is) rather than implement any higher-level mechanics. As the game is now, it is incomplete.</p>
<p>The huge phenomenon the game became in the initial months suggest that there is an appetite for an augmented-reality Pokemon game; if they were ever to complete the game, it is likely that users would return.</p>
<h3 id="facebook">Facebook</h3>
<p>At first glance, Facebook might not appear to be game. But as an application which millions of people interact with regularly during their leisure time, it is worth considering it as such.</p>
<p>With Facebook, the low-level loop comes via participation in the history feed. We post updates and share media, and our contributions are acknowledged via likes and comments. The “game” is that of contributing content which is the most appealing to our network.</p>
<p>One might observe that this “low-level” game loop is the entirety of the platform; where then are the higher-level game loops seemingly necessary for long-term appeal? The answer to that is that for Facebook, the higher-level game loop is our social life itself, which exists outside of the network, and is augmented by it. The genius of Facebook is that it embeds itself within an existing narrative structure, and makes itself indispensible to it.</p>
<h3 id="donkey-kong">Donkey Kong</h3>
<p>A classic arcade game, in which the iconic characters of Mario and Donkey Kong make their debut. This game consists of manouvering Mario up a series of platforms and ladders, while avoiding the rolling barrels that Donkey Kong continualy throws down. Donkey Kong has kidnapped a princess, and Mario is trying to reach her to rescue her.</p>
<p>The game features a handful of levels, each level being completed when Mario has reached the top of the screen. The game ends when Mario reaches the top of the last level and rescues the princess.</p>
<p>Although the game is simple, it has stood the test of time, even becoming the subject of a <a href="https://en.wikipedia.org/wiki/The_King_of_Kong">documentary</a>. As with the example of Facebook, one might wonder as to the source of the game’s long-term appeal. The low-level loops are evident: jump over barrels and climb up ladders. What is it about this game which brings players back, year after year?</p>
<p>The answer, as with Facebook, is the social embeddedness, in the form of the high score board. While an individual play through Donkey Kong takes place in isolation and with minimal narrative, the opportunity to earn a high score and add one’s initials to the public board (marking one’s territory as a reigning champion) gives meaning to what would otherwise become a fairly rote game.</p>
<h3 id="swarm">Swarm</h3>
<p>Swarm is a social life-logging app developed by my company (Foursquare). In it, users “check in” to places they go in the real world, unlocking badges, earning coins, and competing for mayorships. Checking in to venues comprises the game’s low-level loop. The game’s initial design tried to create three higher-level loops: one for single players (badges), multiple players (mayorships), and social networks (leaderboards).</p>
<p>This design was effective for a period, and the app (then known as Foursquare) grew quickly in popularity. Over time, however, adoption leveled and the app failed to achieve the popularity of Facebook. With fewer users, the app failed to achieve the network effects of Facebook, and the leaderboards were not compelling for many users. The single-player badge loop and multi-player mayorship loop are engaging, and serve well as mid-level game loops, but do not go far enough in constructing a top-level narrative to provide meaning to the basic check-in.</p>
<p>One idea for Swarm would be to extend the idea of badges and create a hierarchy of “Explorer’s Clubs,” encompassing increasingly large geographies. As a resident of Bushwick, a neighborhood in Brooklyn, I can join the Bushwick Explorer’s Club by checking in at least 50 times, to 10 different venues, in 5 categories. I can join the Crown Heights Explorer’s Club by meeting the same criteria in that neighborhood. The Brooklyn Explorer’s Club (corresponding to a larger geography) can be entered by joining the Explorer’s Clubs of at least four constituent neighborhoods. The New York City club is joined analagously, after gaining entry to the clubs of at least three boroughs. We can envision state and national level clubs following the same structure.</p>
<p>The Explorer’s Club mechanic provides a top-level narrative by furnishing the implicit goal of gaining entry to the elite “World Explorer’s Club,” at which point the user has “won” Swarm. In the same stroke, it creates additional mid-level loops via the more accessible neighborhood and city clubs. It creates an incentive for users to check in to a variety of places as they visit new cities, as joining local clubs helps the user gain access to regional and national clubs.</p>
<p>Foursquare could enhance this mechanic further by providing in-app (or even real-world) benefits to the members of these myriad clubs (who by definition are power users of the app). These benefits, if implemented well, would add to the meaning of these Explorer’s Clubs, and consequently imbue the core check-in mechanic with even more meaning.</p>
Mon, 29 May 2017 00:00:00 +0000
http://kronosapiens.github.io/blog/2017/05/29/on-meaning-in-games.html
http://kronosapiens.github.io/blog/2017/05/29/on-meaning-in-games.htmlgamesdesignphilosophyblogObjective Functions in Machine Learning<p>Machine learning can be described in many ways. Perhaps the most useful is as type of optimization. Optimization problems, as the name implies, deal with finding the best, or “optimal” (hence the name) solution to some type of problem, generally mathematical.</p>
<p>In order to find the optimal solution, we need some way of measuring the quality of any solution. This is done via what is known as an <strong>objective function</strong>, with “objective” used in the sense of a goal. This function, taking data and model parameters as arguments, can be evaluated to return a number. Any given problem contains some parameters which can be changed; our goal is to find values for these parameters which either maximize or minimize this number.</p>
<p>The objective function is one of the most fundamental components of a machine learning problem, in that it provides the basic, formal specification of the problem. For some objectives, the optimal parameters can be found exactly (known as the analytic solution). For others, the optimal parameters cannot be found exactly, but can be approximated using a variety of iterative algorithms.</p>
<p>Put metaphorically, we can think of the model parameters as a ship in the sea. The goal of the algorithm designer is to navigate the space of possible values as efficiently as possible to guide the model to the optimal location.</p>
<p>For some models, the navigation is very precise. We can imagine this as a boat on a clear night, navigating by stars. For others yet, the ship is stuck in a fog, able to make small jumps without reference to a greater plan.</p>
<p>Let us consider a concrete example: finding an average. Our goal is to find a value, <script type="math/tex">\mu</script>, which is the best representation of the “center” of some set of n numbers. To find this value, we define an objective: the sum of the squared differences, between this value and our data:</p>
<script type="math/tex; mode=display">\mu = argmin_{\mu} \sum_{i=1}^n (x_i - \mu)^2</script>
<p>This is our objective function, and it provides the formal definition of the problem: to <strong>minimize an error</strong>. We can analyze and solve the problem using calculus. In this case, we rely on the foundational result that the minimum of a function is reliably located at the point where the derivative of the function takes on a zero value. To solve the function, we take the derivative, set it to 0, and solve for <script type="math/tex">\mu</script>:</p>
<script type="math/tex; mode=display">\frac{d}{d\mu} \sum_{i=1}^n (x_i - \mu)^2 = \sum_{i=1}^n -2(x_i - \mu) = 0</script>
<script type="math/tex; mode=display">\sum_{i=1}^n (x_i - \mu) = 0</script>
<script type="math/tex; mode=display">\sum_{i=1}^n x_i = n\mu</script>
<script type="math/tex; mode=display">\frac{\sum x_i}{n} = \mu</script>
<p>And so. We see that the value which minimizes the squared error is, in fact, the mean. This elementary example may seem trite, but it is important to see how something as simple as an average can be interpreted as a problem of optimization. Note how the value of the average changes with the objective function: the mean is the value which minimizes the sum of squared error, but it is the median which minimizes the sum of <em>absolute error</em>.</p>
<p>In this example, the problem could be solved analytically: we were able to find the exact answer, and calculate it in linear time. For other problems, the objective function does not permit an analytic or linear-time solution. Consider the logistic regression, a classification algorithm whose simplicity, flexibility, and robustness has made it a workhorse of data teams. This algorithm iterates over many possible classification boundaries, each iteration yielding a more discriminant classifier. Yet, the true optimum is never found: the algorithm simply terminates once the solution has reached relative stability.</p>
<p>There are other types of objective functions that we might consider. In particular, we can conceive of the <em>maximizing of a probability</em>.</p>
<p>Part of the power of probability theory is the way in which it allows one to reason formally (with mathematics) about that which is fundamentally uncertain (the world). The rules of probability are simple: events are assigned a probability, and the probabilties must all add to one, because <em>something</em> has to happen. The way we represent these probabilities, however, is somewhat arbitrary – a list of real numbers summing to 1 will do. In many cases, we use functions.</p>
<p>Consider flipping a coin. There are two possible outcomes: heads and tails. The odds of heads and the odds of tails must add to 1, because one of them must come up. We can represent this situation with the following <a href="https://en.wikipedia.org/wiki/Bernoulli_distribution">equation</a>:</p>
<script type="math/tex; mode=display">p^x(1-p)^{1-x}</script>
<p>Here <script type="math/tex">x</script> is the coin and <script type="math/tex">x = 1</script> means heads and <script type="math/tex">x = 0</script> if tails, and <script type="math/tex">p</script> is the odds of coming up heads. We see that if the coin is heads, the value is <script type="math/tex">p</script>, the chance of heads. If the coin is tails, the value is <script type="math/tex">1-p</script>, which by necessity is the chance of tails. We call this equation <script type="math/tex">P(x)</script>, and it is a probability distribution, telling us the probability of various outcomes.</p>
<p>Now, not all coins are fair (meaning that <script type="math/tex">p = 1-p = 0.5</script>). Some may be unfair – with heads, perhaps, coming up more often. Say we flipped a coin a few times, and we were curious as to whether the coin was biased. How might we discover this? Via the likelihood equation. Intuitively, we seek a value of p which gives the <em>maximum likelihood</em> to the coin flips we saw.</p>
<p>The word maximum should evoke our earlier discussion: we are again in the realm of optimization. We have a function and are looking for an optimal value: except now instead of minimizing an error, we want to <strong>maximize a likelihood</strong>. Calculus helped us one before – perhaps it may again?</p>
<p>Here is the <em>joint likelihood distribution</em> of our series <script type="math/tex">x</script> of n coin flips (now <script type="math/tex">x</script> represents many flips, each individual flip subscripted <script type="math/tex">x_1</script>, etc):</p>
<script type="math/tex; mode=display">P(x) = \prod_{i=1}^n p^{x_i}(1-p)^{1-x_i}</script>
<p>The thing to note here is that the probability of two what we call <em>independent</em> events (i.e. one does not give us knowledge about the other) is the product of the probability of the events separately. In this case, the coin flips are <em>conditionally independent</em> given heads probability p. One consequence is that <script type="math/tex">P(x) \in (0, 1)</script>, and generally much closer to 0 than 1.</p>
<p>The <em>logarithm</em> is a remarkable function. When introduced in high school, the logarithm is often presented as “the function which tells you the power you would need to raise a number to to get back the original argument”, or put more succintly, the degree to which you would need to exponentiate a base. This exposition obscures the key applications of the logarithm:</p>
<ol>
<li>It makes small numbers big, and big numbers small.</li>
<li>It turns multiplication into addition.</li>
<li>It increases monotonically (if <script type="math/tex">x</script> gets bigger, <script type="math/tex">log(x)</script> gets bigger).</li>
</ol>
<p>The first point helps motivate the use of “log scales” when presenting data of many types. Humans (and computers) are comfortable reasoining about magnitudes along certain types of scales; others, such as exponential scales, are less intuitive. The logarithm allows us to interpret events happening on incredible magnitude in a more familiar way. This property, conveniently, also comes in handy when working with very small numbers – such as those involved in join probability calculations, in which the probability of any particular complex event is nearly 0. The logarithm takes very small positive numbers and converts them to more comfortable, albeit negative, numbers – much easier to think about (and, perhaps more importantly, compute with).</p>
<p>The second point comes in handy when we attempt the actual calculus. By turning multiplication into addition, the function is more easily differentiated, without resorting to cumbersome applications of the product rule.</p>
<p>The third point provides the essential guarantee that the optimal solution for the log function will be identical with the optimal solution for the original function. This means that we can optimize the log function and get the right answer for the original.</p>
<p>Taking the logarithm of the joint likelihood function, we get the <strong>log likelihood</strong>:</p>
<script type="math/tex; mode=display">log(P(x)) = \sum_{i=1}^n x_ilog(p) + (1-x_i)log(1-p)</script>
<p>What can we do with this? In this problem, we can use it to find the optimal value for p. Taking the derivative of this function with respect to p (recall that the derivative of <script type="math/tex">log(x)</script> is <script type="math/tex">1/x</script>), and setting to 0, we have:</p>
<script type="math/tex; mode=display">\frac{d}{dp}log(P(x)) = \sum_{i=1}^n \frac{x_i}{p} - \frac{1-x_i}{1-p} = 0</script>
<p>We can solve for p:</p>
<script type="math/tex; mode=display">\sum_{i=1}^n \frac{x_i(1-p)}{p} - (1-x_i) = 0</script>
<script type="math/tex; mode=display">\sum_{i=1}^n (\frac{x_i}{p}-x_i) - (1-x_i) = 0</script>
<script type="math/tex; mode=display">\sum_{i=1}^n \frac{x_i}{p} - 1 = 0</script>
<script type="math/tex; mode=display">\frac{\sum x_i}{p} = n</script>
<script type="math/tex; mode=display">\frac{\sum x_i}{n} = p</script>
<p>And so again, the optimal value for the probability p of heads is, for this particular definition of optimal, the ratio of observed heads to total observations. We see how our intuition (“the average!”) is made rigorous by the formalism.</p>
<p>This example is a model of a simple object. More advanced objects (such as a constellation of interdependent events) require more advanced models (such as a Hidden Markov Model), for which the optimal solution involves many variables and as a consequence more elaborate calculations. In some cases, as with the logistic regression, the exact answer cannot ever be known, only <a href="http://kronosapiens.github.io/blog/2015/11/22/understanding-variational-inference.html">iteratively approached</a>.</p>
<p>In all of these cases, however, the log of the likelihood function remains an essential tool for the analysis. We can use it to calculate a measure of quality for an arbitrary combination of parameters, as well as use it (in a variety of ways) to attempt to find optimal parameters in a computationally efficient way. Further, while the examples given above are possibly the two simplest non-trivial examples of these concepts, they capture patterns of derivation which recur in more complex models.</p>
Tue, 28 Mar 2017 00:00:00 +0000
http://kronosapiens.github.io/blog/2017/03/28/objective-functions-in-machine-learning.html
http://kronosapiens.github.io/blog/2017/03/28/objective-functions-in-machine-learning.htmlmachine learningoptimizationmathblogMaster's Thesis<p>I am very happy to announce that I have completed a master’s thesis. Titled “An Analysis of Pairwise Preference”, it describes a theory of decision-making which allows for the formal analysis of subjective preferences. As an academic work, it is an exercise in developing a theory from first principles, demonstrating a grasp of method and rigor. The experience of writing it was excellent.</p>
<p><strong>Abstract:</strong></p>
<blockquote>
<p>Human coordination can be thought of as a problem in information processing. The subjective interiors of the entities involved creates challenges for formal analysis of preference. This work develops theoretical foundations for analysis of subjective preference, and develops and evaluates a number of algorithms for undertaking such analyses.</p>
</blockquote>
<p>The thesis consists of <a href="https://github.com/kronosapiens/thesis/tree/master/data">data</a>, <a href="https://github.com/kronosapiens/thesis/tree/master/code">code</a>, and <a href="http://nbviewer.jupyter.org/github/kronosapiens/thesis/blob/master/tex/thesis.pdf">tex</a>. Feedback welcome.</p>
Mon, 06 Feb 2017 00:00:00 +0000
http://kronosapiens.github.io/blog/2017/02/06/thesis.html
http://kronosapiens.github.io/blog/2017/02/06/thesis.htmlmathcomputer scienceeconomicsblogA Basic Computing Curriculum<p>Over the summer, I developed a <a href="https://github.com/kronosapiens/computing-basics"><strong>basic curriculum</strong></a> for those looking to become more proficient with computers. This project emerged as a result of work done as a TA for the QMSS G4063: Data Visualization. While assisting that course, it became clear that students were struggling with basic computing tasks, only tangentially related to the material of the course. As a result, much time was spent teaching students basic computing and programming concepts: the ambient knowledge that those comfortable with software take for granted.</p>
<p>I received positive reviews from students and faculty, and was hired to develop a simple, introductory curriculum for incoming students. The curriculum covers the following topics:</p>
<ol>
<li><a href="https://github.com/kronosapiens/computing-basics/tree/master/0-overview">Overview of computer architecture and interfaces</a></li>
<li><a href="https://github.com/kronosapiens/computing-basics/tree/master/1-development">Software development concepts</a></li>
<li><a href="https://github.com/kronosapiens/computing-basics/tree/master/2-python">Introduction to Python</a></li>
<li><a href="https://github.com/kronosapiens/computing-basics/tree/master/3-r">Introduction to R</a></li>
<li><a href="https://github.com/kronosapiens/computing-basics/tree/master/4-web">Networking and the web</a></li>
</ol>
<p>The goal of this content is not to be comprehensive, but rather succient, accessible and intuitive, without sacrificing precision and accuracy. The curriculum is designed to be interacted with: the hosting on Github and the included code templates are intended to provide entry points to encourage students to interact with files as developers, and to gain comfort with software development interfaces.</p>
<p>This is a work in progress. Suggestions and feedback would be welcomed :)</p>
Tue, 04 Oct 2016 00:00:00 +0000
http://kronosapiens.github.io/blog/2016/10/04/a-basic-computing-curriculum.html
http://kronosapiens.github.io/blog/2016/10/04/a-basic-computing-curriculum.htmlprogrammingeducationblogThe Problem of Information II<h4 id="1">1</h4>
<p>In Part I, we established the data processing inequality and used it to conclude that no analysis of data can increase the amount of information we have about the world, beyond the information provided by the data itself:</p>
<script type="math/tex; mode=display">X \rightarrow Y \rightarrow \hat{X}</script>
<script type="math/tex; mode=display">\Rightarrow</script>
<script type="math/tex; mode=display">I(\hat{X};X) \leq I(Y;X)</script>
<p>We’re not quite done. The fundamental problem is not simply <em>learning about the world</em>, but rather <em>human learning about the world</em>. The full model might look something like this:</p>
<script type="math/tex; mode=display">\text{world} \rightarrow \text{measurements} \rightarrow \text{analysis} \rightarrow [perception] \rightarrow \text{understanding}</script>
<p>Incorporating the human element requires a larger model and additional tools.</p>
<h4 id="2">2</h4>
<p>A <strong>channel</strong> is the medium by which information travels from one point to another:</p>
<script type="math/tex; mode=display">X \rightarrow [channel] \rightarrow Y</script>
<p>At one end, we have information, encoded as some sort of representation. We send this representation through the channel. A receiver at the other end receives some signal, which they reconstruct into some sort of representation. Hopefully, this reconstruction is close to the original representation.</p>
<p>No (known) channel is perfect. There is too much uncertainty in their underlying physics and mechanics of their actual construction. Mistakes are made. Bits are flipped. We say the amount of information that a channel can reliably transmit is that channels <strong>capacity</strong>. For a given channel, capacity is denoted <script type="math/tex">C</script>, and for input variable <script type="math/tex">X</script> and output varaible <script type="math/tex">Y</script> it is defined like this (<script type="math/tex">\triangleq</script> means “defined as”):</p>
<script type="math/tex; mode=display">C_{channel} \triangleq max_{P(X)} I(X; Y)</script>
<p>This means that the capacity is equal to the maximum mutual information between <script type="math/tex">X</script> and <script type="math/tex">Y</script>, over all distributions on <script type="math/tex">X</script>. Using a well-known identity, we can rewrite this equation as follows:</p>
<script type="math/tex; mode=display">I(X; Y) = H(Y) - H(Y|X)</script>
<p>This shows us that capacity is a function of both the entropy of <script type="math/tex">Y</script> and the conditional entropy of <script type="math/tex">Y</script> given <script type="math/tex">X</script>. The conditional entropy represents the uncertainty in <script type="math/tex">Y</script> given <script type="math/tex">X</script> – in other words, the quality of the channel (for a perfect channel, this value would be <script type="math/tex">0</script>). <script type="math/tex">H(Y)</script> is a function of <script type="math/tex">H(X)</script> and is what we try to maximize when determining capacity.</p>
<p>Observe that capacity <script type="math/tex">C</script> is a function of both the channel and the randomeness input. For a fixed channel, capacity is a function of the input. For a fixed input, the capacity is a function of the channel (here is it known as “distortion”).</p>
<p>Below is an example of what is known as a “<a href="https://en.wikipedia.org/wiki/Binary_symmetric_channel">Binary Symmetric Channel</a>” – a channel with two inputs and outputs (hence binary), and a symmetric probability of error <script type="math/tex">p</script>. This diagram should be interpretable given what we’ve discussed above.</p>
<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Binary_symmetric_channel_%28en%29.svg/800px-Binary_symmetric_channel_%28en%29.svg.png" alt="binary symmetric channel" /></p>
<p>The major result in information theory concerning channel capacity goes like this:</p>
<script type="math/tex; mode=display">P_e^{(n)} \rightarrow 0 \Rightarrow C \geq H(X)</script>
<p>What this says is that for any transmission scheme where the probability of error (<script type="math/tex">P_e^{(n)}</script>) goes to zero (as block length <script type="math/tex">n</script> increases), the capacity of the channel is greater than or equal to the entropy of the input. This is true even for perfect channels (with no error) – meaning that <script type="math/tex">H(X)</script>, the uncertainty inherent in the source, is a <strong>fundamental limit</strong> in communication.</p>
<p>More plainly, we observe that successful transmission of information requires a channel that is less uncertain than the source you’re trying to transmit. This should be intuitively satisfying. If the channel is more chaotic than what you’re trying to communicate, the output will be more a result of that randomness than whatever message you wanted to send. The flip interpretation, which is less intuitive, is that the more random the source, the more tolerant you can be of noisy channels.</p>
<p>Finally, the converse tells us that any attempt to send a high-entropy source through a low-capacity channel is gauranteed to result in high error.</p>
<h4 id="3">3</h4>
<p>With that established, we can now consider the question of <strong>human</strong> communication:</p>
<script type="math/tex; mode=display">stimulus \rightarrow [perception] \rightarrow impression</script>
<p>Let’s consider the metaphor and see if it holds. We want to say that the process of communication is exposing those around us to stimulus (ourselves, media, etc), having that stimulus transmitted through the channels of perception, and ultimately represented in the mind as some sort of impression (such as an understanding or feeling). On a first impression, this seems reasonable and general.</p>
<p>What is <em>not</em> present here is the concept of <strong>intention</strong>. In our communication, we may at various points be trying to teach, persuade, seduce, amuse, mislead, or learn. What is also absent is the concept of “creativity”, or receiving an impression somehow greater than the stimulus. We will return to these questions later and see if we can address it.</p>
<p>Let’s consider a simple case: the teacher trying to teach. We can assume good intention and an emphasis on the transfer of information. We model as follows:</p>
<script type="math/tex; mode=display">teaching \rightarrow [perception] \rightarrow learning</script>
<p>The “capacity” of human perception is then:</p>
<script type="math/tex; mode=display">C_{perception} = max_{P(teaching)} I(teaching; learning)</script>
<script type="math/tex; mode=display">I(teaching; learning) = H(learning) - H(learning|teaching)</script>
<p>This allows us to consider both the randomness of the source (teaching), and the uncertainty in the transmission (perception). We seem justified in proposing the following:</p>
<ol>
<li>The challenge of teaching is in maximizing the information the student has about the subject.</li>
<li>A subject is “harder” if there is more complexity in the subject matter.</li>
<li>A subject is also “harder” if it is difficult to convey the material in an understandable way.</li>
<li>A “good” teacher is one who can present the material in a way that is appropriate for the students.</li>
<li>A “good” student is one who can make the most sense of the material that was presented.</li>
</ol>
<p>Let’s begin with (4), the idea of material being tailored to the student, or the input being tailored to the channel. Intuitively, we would like to say that a good teacher can change the teaching (the stimulus) they present to the student in order to maximize the student’s learning.</p>
<p>First, consider that material may be too advanced for some students. We would like to say then that the capacity of that student was insufficient for the complexity of the material. To say this, we must first consider the relationship between randomness with complexity.</p>
<h4 id="4">4</h4>
<p>The language of information theory is the language of randomness and uncertainty. In teaching, it is more comfortable to speak in the language of complexity, difficulty, or challenge. Can these be equivalent?</p>
<p>Entropy is a measure of randomness, and entropy is a function of both 1) the number of possible outcomes of a random process, and 2) the likelihood of the various outcomes. A 100-sided fair die is more random than a 10-sided fair die, while a 1000-sided die that always came up 7 is not really random at all.</p>
<p>Complexity, on the other hand, can be understood as the number and nature of the relationships among various parts of a system. We can perhaps formalized this as the number of pathways by which a change in one part of the system can affect the overall state of the system.</p>
<p>To argue equivalence, we assert that there is always some degree of uncertainty in any system, or in any field of study. In math, these are formalized as variables. In history, these can be the motivations of various actors. The more complex a system, the larger the number of outcomes and the relationship between components. In the language of probability, we say there are more possible outcomes, and that due to the complex relationships between parts, that there are significant odds of many different outcomes.</p>
<p>Consider the example of teaching math. Arithmetic is simpler than geometry, in that the expression</p>
<script type="math/tex; mode=display">2 + 2</script>
<p>contains fewer conceptual “moving pieces” than the expression</p>
<script type="math/tex; mode=display">\sin(45°)</script>
<p>Understanding arithmetic requires the student to keep track of the concept of “magnitude” and be able to relate magnitudes via relations of joining (addition and subtraction) and scaling (multiplication and division). It requires the abstract concept of negative numbers.</p>
<p>Understanding geometry requires more tools. It requires students to be able to deal with points in space, and understand how to use the Cartesian plane to represent the relationship between points and numbers. It introduces the idea of “angle” as a new kind of relationship, on top of arithmetic’s “bigger” and “smaller”.</p>
<p>Put another way, arithmetic requires only a line, while geometry requires a plane. More concepts means more possible relationships between objects, which means more possible dimensions of uncertainty, which means more complexity.</p>
<p>We conclude at least a rough equivalence between complexity and uncertainty.</p>
<h3 id="v">V</h3>
<p>Returning to the teaching example, we can now speak in terms of complexity of the material instead of randomness of the source. If material is too complex for the student (<script type="math/tex">% <![CDATA[
C < H(X) %]]></script>), then the material cannot be taught to that student (yet).</p>
<p>Observe that the channel (the student) is not fixed, but is able to handle increasingly complex subjects over time.</p>
<p>… to be continued?</p>
Thu, 19 May 2016 00:00:00 +0000
http://kronosapiens.github.io/blog/2016/05/19/the-problem-of-information2.html
http://kronosapiens.github.io/blog/2016/05/19/the-problem-of-information2.htmlinformation-theorymachine-learningblogThe Problem of Information<h4 id="1">1</h4>
<p>The Data Processing Inequality is one of the first results in information theory.</p>
<p>It can be stated as follows:</p>
<p><em>No transformation of measurements of the world can increase the amount of information available about that world.</em></p>
<p>In formal language, it goes like this:</p>
<p>Given a first-order Markov chain</p>
<script type="math/tex; mode=display">X \rightarrow Y \rightarrow \hat{X}</script>
<p>such that <script type="math/tex">\hat{X}</script> depends only on <script type="math/tex">Y</script>, which depends only on <script type="math/tex">X</script>, then</p>
<script type="math/tex; mode=display">I(\hat{X};X) \leq I(Y;X)</script>
<p>The measure <script type="math/tex">I(A,B)</script> is known as the <a href="https://en.wikipedia.org/wiki/Mutual_information">mutual information</a>, a measure of how much information one variable gives us about another.</p>
<p>What this says is that the information <script type="math/tex">\hat{X}</script> tells us about <script type="math/tex">X</script> cannot be more than the information we already had from <script type="math/tex">Y</script>. In other words, that <strong>processing</strong> data adds no new information.</p>
<h4 id="2">2</h4>
<p>Let’s consider the problem of learning from data. Let’s put it in the framework:</p>
<script type="math/tex; mode=display">\text{the world} \rightarrow \text{some measurements} \rightarrow \hat{\text{your analysis}}</script>
<p>which implies</p>
<script type="math/tex; mode=display">I(\hat{\text{your analysis}};\text{the world}) \leq I(\text{some measurements};\text{the world})</script>
<p>In other words, analysis doesn’t tell you anything new. What it <strong>does</strong> do, though, is make the information you already have more easily digestible. It puts it in forms you can work with. Think averages and odds. Think dashboards. Less information, but more actionable.</p>
<h4 id="3">3</h4>
<p>Let’s take this a bit further. Think of your analysis as a function, <script type="math/tex">G</script>, of your data. This gives us:</p>
<script type="math/tex; mode=display">X \rightarrow Y \rightarrow G(Y)</script>
<p>We can then formulate the learning problem as a search over the space of possible functions <script type="math/tex">G</script>. In order to assess the quality of one <script type="math/tex">G</script> over another, we must use some sort of measure of “expressiveness”. Call this <script type="math/tex">E</script>, such that <script type="math/tex">E[G(Y)]</script> is some measurement of the expressiveness of the analysis <script type="math/tex">G(Y)</script>.</p>
<p>Our goal becomes finding an optimal function <script type="math/tex">G^*</script> such that:</p>
<script type="math/tex; mode=display">E[G^*(Y)] \geq E[G(Y)], \forall G</script>
<p>In other words, that <script type="math/tex">G^*</script> maximizes the expressive power of the data <script type="math/tex">Y</script>. Our choice of <script type="math/tex">E</script> drives the exploration of the space of possible <script type="math/tex">G</script>.</p>
<p>This is the general formulation. To see how this general formulation maps to practice, let’s take <script type="math/tex">G</script> to be some sort of classification or regression model and <script type="math/tex">E</script> to be the log likelihood or squared error. Note how we have described the typical machine learning setting. To see how this formulation helps frame different problems, let’s take <script type="math/tex">G</script> to be a causal graph – what then should <script type="math/tex">E</script> be? How could one select an <script type="math/tex">E</script> to drive exploration of the space of causal graphs?</p>
<h4 id="4">4</h4>
<p>If our goal is to understand the world, then it would seem as though we have two opportunities for growth.</p>
<p>First, in our measurements. The world is of infinite dimension, and any measurement is a finite reflection. Measurements are choices, and the dimensions along which we choose to measure will place the upper bound on our usable knowledge.</p>
<p>Second, in our analysis. Given a finite set of measurements, <script type="math/tex">Y</script>, our goal is to transform this into a different representation that expresses the information necessary to a given task, with “expressiveness” itself given by some measure. If that task is prediction or classification (core learning problems), then expressiveness will almost certainly be measured either via the likelihood of the analysis or the smallness of the error. But there can be other tasks and other measures of expression.</p>
<p>Which, at this time, is our limiting factor? Are we limited by our analysis, unable to make sense of what we know? Or are we limited by our measurements, trying to navigate with skewed vision?</p>
<p>Do you know?</p>
<h4 id="5">5</h4>
<p><strong>Proof:</strong></p>
<script type="math/tex; mode=display">I(\hat{X};X)</script>
<p>Definition of mutual information:</p>
<script type="math/tex; mode=display">= H(X) - H(X|\hat{X})</script>
<p>Conditioning reduces entropy:</p>
<script type="math/tex; mode=display">\leq H(X) - H(X|\hat{X}, Y)</script>
<p>By Markov property:</p>
<script type="math/tex; mode=display">= H(X) - H(X|Y)</script>
<p>Voila:</p>
<script type="math/tex; mode=display">= I(X,Y)</script>
<p><strong>Note:</strong> <script type="math/tex">H(A)</script> denotes the <strong><a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">entropy</a></strong> of the random variable <script type="math/tex">A</script>, a measure of uncertainty in <script type="math/tex">A</script>. Given that more information can’t hurt, the following is always true:</p>
<script type="math/tex; mode=display">H(A|B) \leq H(A)</script>
Sat, 16 Apr 2016 00:00:00 +0000
http://kronosapiens.github.io/blog/2016/04/16/the-problem-of-information.html
http://kronosapiens.github.io/blog/2016/04/16/the-problem-of-information.htmlinformation-theorymachine-learningblogElements of Modern Computing<p>The goal of this guide is to explain in a high-level but useful way the core concepts of modern computing. This guide is aimed at those who have never interacted with software as more than an end-user of graphical applications, but who for whatever reason have a desire for more flexible and precise control over their computer and its software.</p>
<h2 id="understanding-the-filesystem">Understanding the Filesystem</h2>
<p>The most important thing to remember when doing any sort of programming is that every command is run in the context of some <strong>location</strong>. Your Desktop is a location. Your Documents folder is a location. Everything on your computer has a location, and everything exists in relation to everything else. The whole thing is called a <strong>filesystem</strong>. Here is an illustration of the typical Mac OSX filesystem (Windows filesystems are fairly similar). Indentation implies nesting:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>/
Applications/
Chess.app
Rstudio.app
iTunes.app
...
System/
Library/
Users/
Guest/
Shared/
<username>/
Desktop/
file.txt
Documents/
Downloads/
...
bin/
pwd
ls
chmod
...
var/
log/
tmp/
...
etc/
...
</code></pre>
</div>
<p>The key takeaway here is that every file in your computer has a location, and this location can be described by the full, or “absolute” path. For example, the <code class="highlighter-rouge">file.txt</code> file on the desktop can be described in the following way:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>/Users/<username>/Desktop/file.txt
</code></pre>
</div>
<p>No matter where you are in your computer, this path will always reference the same file. However, it would be tedious to have to type this verbose path every time you needed to reference a file.</p>
<p>Fortunately, there are shortcuts. One is that the tilde (<code class="highlighter-rouge">~</code>) character stands for <code class="highlighter-rouge">/Users/<username>/</code>, where <code class="highlighter-rouge"><username></code> is the current logged-in user (i.e. you). Using the tilde, you can reference the same <code class="highlighter-rouge">file.txt</code> as:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>~/Desktop/file.txt
</code></pre>
</div>
<p>As many files you’ll be working with will be stored inside your user directory, this will often be helpful.</p>
<p>Another major shortcut is to use something called a “relative” path. To learn more, read on.</p>
<h2 id="understanding-the-command-line">Understanding the Command Line</h2>
<p><strong>Really, watch <a href="https://www.youtube.com/watch?v=tc4ROCJYbm0">the video</a>.</strong></p>
<p>To understand the command line, it’s valuable to first understand the various “layers” that make up a computer.</p>
<p>At the very bottom, there’s the <strong>hardware</strong>: chips, memory, and electricity. These are the fruits of electrical engineering and can do simple things very, very quickly. Programming these directly is very tedious. As a result, we wrap the hardware around a very core piece of software, known as a <strong>kernel</strong>. The kernel is software that controls the hardware and the basic resources (CPU power, memory) of the computer. To get the computer to do things, we talk to the kernel. Note how, by “abstracting” away the the computer internals, the problem of managing complexity just got a little bit easier.</p>
<p>For many people, interactions with a computer takes place via a graphical user interface, otherwise known as a “GUI”. Icons on your desktop, double-clicks, drag-and-drop – all of these are GUI operations. The GUI is a program, like any other, which puts things on the screen and interprets keystrokes and trackpad activity. The GUI talks to the kernel and turns your clicks and keystrokes into actions.</p>
<p>A GUI is a very sophisticated program, and GUI-based computers have been around only for the last twenty or so years. Before graphical interfaces were popular (or even possible), computing took place via a much simpler interface. That interface was known as the <strong>command line</strong>, otherwise known as the <strong>shell</strong>.</p>
<p>Why shell? Because the shell was a program that <em>wrapped around</em> (get it?) the kernel and provided a convenient way to run commands. A shell is also a program, much simpler than a GUI, which provides a text-based user interface.</p>
<p>Why would someone use a shell over a more user-friendly GUI? Principally, for control. The primary drawback of a GUI is that it can only do what it was programmed to do. It is very hard to program a GUI, and the interfaces popular on modern computers are virtually impossible to modify. A shell, on the other hand, is a simple program that can do almost anything. If a GUI is intuitive but inflexible, a command line is less intuitive (at first), but extremely powerful and flexible. For programmers, data scientists, and others for whom work involves the organization and manipulation of information, this power and flexibility is crucial. This is why people use the command line.</p>
<h2 id="working-with-the-shell">Working with the shell</h2>
<p>A shell is just a program. There are many kinds of shell. On Mac OSX, the default is the <a href="https://en.wikipedia.org/wiki/Bash_(Unix_shell)">bash shell</a>. On Windows, there is <a href="https://en.wikipedia.org/wiki/Windows_PowerShell">PowerShell</a>. On OSX, you can open a bash shell by opening the “Terminal” application. On Windows, there is a PowerShell application.</p>
<p>Firing up the shell will bring you to a boring-looking screen which looks something like this:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>Hi! Welcome to the shell!
$ _
</code></pre>
</div>
<p>Not much to look at. You type some things and hit enter. Something happens. Rinse and repeat until your computer crashes or you’re a billionare. Things get more interesting when you realize that every command you type into the shell looks like the following:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>
$ <program> <arguments>
</code></pre>
</div>
<p>There is always one program, and an arbitrary number of arguments. This is how every command works. Your computer comes with several dozen built-in programs, which do very simple but useful things. We will review them shortly. First, however, we must discuss the concept of a “working directory”.</p>
<p>Recall that any file can be fully described by it’s absolute path. When interacting with files (a common activity), you could in theory specify the full path of every file. This would be extremely tedious, especially given that, for any given task, related files are generally close together. In OSX and Windows, there are the Finder and Explorer programs for browsing through folders. These programs are GUIs accomplishing the same goal.</p>
<p>The <strong>working directory</strong> is your shell’s current “location” inside the filesystem. You can “navigate” the filesystem by running commands which cause the shell to move up or down directory hierarchies. This is analagous to double-clicking on a folder on your desktop.</p>
<p>When you are in a directory, every file argument is evaluated as though the current working directory were prepended to the argument. Let’s see an example, taking place in the context of the following filesystem:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>/
dir1/
file1.py
</code></pre>
</div>
<p>Here’s the example:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>/$ python dir1/file1.py
'Hello world'
/$ cd dir1
/dir1$ python file1.py
'Hello world'
/dir1$ cd ..
/$ python file1.py
usr/bin/python: can't open file 'file1.py': [Errno 2] No such file or directory
</code></pre>
</div>
<p>Here, we saw two programs run on three arguments.</p>
<ol>
<li>First, we ran the <code class="highlighter-rouge">python</code> program on argument <code class="highlighter-rouge">dir/file1.py</code>, a python file.</li>
<li>Then, we ran the <code class="highlighter-rouge">cd</code> (change directory) program on argument <code class="highlighter-rouge">dir1</code>, also a file (in Unix, <a href="https://en.wikipedia.org/wiki/Everything_is_a_file">everything is a file</a>, even a directory.) Note how the command prompt changes to reflect the fact that we moved through the filesystem.</li>
<li>We ran the <code class="highlighter-rouge">python</code> program on argument <code class="highlighter-rouge">file1.py</code>, the same python file.</li>
<li>We ran the <code class="highlighter-rouge">cd</code> program on <code class="highlighter-rouge">..</code>, representing the current <em>parent directory</em>.</li>
<li>We ran the <code class="highlighter-rouge">python</code> program on argument <code class="highlighter-rouge">file1.py</code>, but received an error, because there is no such file in our current location.</li>
</ol>
<p>Hopefully this example will illustrate the nature of using a shell to navigate and executing commands in a filesystem.</p>
<h2 id="command-reference">Command Reference</h2>
<p><em>For Linux-style command line interfaces</em></p>
<p>Note that in this program reference, filenames and directories can be given as either absolute or relative paths.</p>
<p><code class="highlighter-rouge">pwd</code> – print working directory</p>
<p><code class="highlighter-rouge">ls</code> – list files in current directory</p>
<p><code class="highlighter-rouge">touch <filename></code> – makes a new file</p>
<p><code class="highlighter-rouge">rm <filename></code> – delete a file</p>
<p><code class="highlighter-rouge">cd <directory></code> – change current directory to <code class="highlighter-rouge"><directory></code></p>
<p><code class="highlighter-rouge">python <filename>.py</code> – run a Python file</p>
<p><code class="highlighter-rouge">mkdir <directory></code> – create a new directory</p>
<p><code class="highlighter-rouge">rmdir <directory></code> – delete a directory</p>
<p><code class="highlighter-rouge">mv <filename1> <filename2></code> – move <code class="highlighter-rouge"><filename1></code> to <code class="highlighter-rouge"><filename2></code></p>
<p><code class="highlighter-rouge">cp <filename1> <filename2></code> – copy <code class="highlighter-rouge"><filename1></code> to <code class="highlighter-rouge"><filename2></code></p>
<p><code class="highlighter-rouge">cat <filename></code> – print the entire file to the screen</p>
<p><code class="highlighter-rouge">head <filename></code> – print the first few lines of a file to the screen</p>
<p><code class="highlighter-rouge">tail <filename></code> – print the last few lines of a file to the screen</p>
<p><code class="highlighter-rouge">man <program></code> – show additional information about the program</p>
<p><code class="highlighter-rouge">echo <string></code> – print <code class="highlighter-rouge"><string></code> to the screen (as in, a string of letters)</p>
<p><code class="highlighter-rouge">ps</code> – print information about currently-running processes (instances of programs)</p>
<p>Note that programs often accept additional optional arguments. Consider:</p>
<p><code class="highlighter-rouge">tail -n 20 <filename></code> – print the last 20 lines of a file to the screen</p>
<p><code class="highlighter-rouge">ls -lah</code> – list files in current directory in a super easy-to-read format</p>
<p><code class="highlighter-rouge">ps -aux</code> – print a lot of information about currently-running processes</p>
<p><code class="highlighter-rouge">kill <integer></code> – kill the process with process id <code class="highlighter-rouge"><integer></code></p>
<h2 id="stdin-stdout-processes-piping">Stdin, Stdout, Processes, Piping</h2>
<p>By default, every program takes input from one place, “Standard Input” (<code class="highlighter-rouge">stdin</code>), and sends output to one place, “Standard Output” (<code class="highlighter-rouge">stdout</code>). In general, <code class="highlighter-rouge">stdin</code> is the keyboard/trackpad. <code class="highlighter-rouge">stdout</code> is the screen. A surprisingly large amount of programming boils down to routing the output of one program into the input of another (either via some common data store like a file or database, or directly via a <strong>pipe</strong>).</p>
<p>When you execute a command in the shell, the program “takes control” of the terminal while it is running. When it finishes, it returns control of the shell. While the programming is running, it may request input from stdin (such as asking for a password). It may also send output to stdout (for example, updating you on the program’s progress).</p>
<p>A <strong>process</strong> is an instance of a running program. To think of it another way, a program is just a bunch of zeroes and ones inside of memory. A process is that program being executed in a million steps on the computer’s CPU. The same program can be run as many processes. They fundamental rule about computers is that processes aren’t allowed to mess with each other’s memory. The kernel makes sure of this (remember the kernel?).</p>
<p>It is possible to “background” a process, which simply means that you don’t let that process take control of your shell. This lets you type in more commands while the process is running. At times this behavior may be desirable. On Unix-like systems (Linux, OSX), you can background a process using the ampersand, like this:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>/$ <program> <arg1> <arg2> &
</code></pre>
</div>
<p>Also, processes can (and often do) create (“spawn”) other processes. The Chrome process spawns Tab processes, and so on. Processes can control other processes. There are basically no limits, except that processes can only interact via a specific interface. Processes can change their working directory, or spawn subprocesses in other directories.</p>
<p>Finally, <strong>piping</strong> is a technique for routing the output of one program into the input of another:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>/$ <program1> <args> | <program2>
</code></pre>
</div>
<p>Here, the flow of information would go as follows:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>stdin -> program1 -> program2 -> stdout
</code></pre>
</div>
<p>This is useful as it allows you to combine simple programs to create more complicated programs, which <a href="https://en.wikipedia.org/wiki/Unix_philosophy">some people think is a good idea</a>.</p>
<h2 id="debugging-error-messages">Debugging Error Messages</h2>
<p>Debugging is a large topic. Here we will discuss the most important principal for debugging anything. That principal is: <strong>simplify and isolate the problem.</strong></p>
<p>Software systems are complex, with many moving parts. Managing this complexity is a big part of software engineering. When trying to fix something, the best way to do it is to isolate the problem.</p>
<p>Imagine you are building an app that streams data from the internet, parses it, and loads it into a custom GUI. The app is currently broken. At this point in time, the problem could be with the internet, the parser, or the GUI. Your first job is to figure out where the problem is. This means isolation. Some things to consider:</p>
<ol>
<li>Replacing streaming data with static dummy data</li>
<li>Feeding the GUI static dummy data</li>
<li>Testing the parser on dummy data</li>
</ol>
<p>By testing each of these pieces in a controlled environment, it becomes possible to find and fix problems. There are many tools to help in this process (debuggers being a major one). If these tools are not available to you, then a sure-fire approach is to cut away and simplify your program as much as possible until it works, and then carefully rebuild from there.</p>
<p>Please read your error messages. They are in words and they mean things. If you read them they will usually tell you what the problem is so you can fix it. This is true a surprising amount of the time.</p>
<p>Reading error messages is intimidating at first, but will become easier over time as you develop a better sense of <em>what</em> kinds of things tend to go wrong, as well as what information error messages are able to convey.</p>
<p>If you see an error message that you don’t understand, do the following:</p>
<ol>
<li>Copy the error</li>
<li>Paste it into a Google searchbar</li>
<li>Look over the top couple of answers</li>
</ol>
<p>This will work extremely frequently. <a href="http://stackoverflow.com/">StackOverflow</a> deserves much thanks for this.</p>
<h2 id="a-kitchen-metaphor">A Kitchen Metaphor</h2>
<p>As a former student of the controversial linguist <a href="https://en.wikipedia.org/wiki/George_Lakoff">George Lakoff</a>, it would be bad form not to at least attempt one grand metaphor.</p>
<p>Think of your computer as a professional kitchen. Imagine shelves of recipe books, each containing many instructions for how to make certain dishes. Imagine teams of chefs and sous-chefs, working away at various dishes. Raw ingredients are transformed into delicious meals. The resources consumed – gas for the oven, water for the sink, are accessed via oven ranges and sink spouts.</p>
<p>The recipes are programs – they sit idle until some chefs are asked to prepare them. The actual cooking of a dish is a process – in which the kitchen’s resources are organized and devoted to preparing a dish. Ingredients and dishes are files – the objects on which we operate. The ovens and sinks are the kernel – the interface abstracting away the underlying resources, making them easier to work with. One recipe can contain many components (sub-recipes), which need to be spawned off and prepared seperately.</p>
<p>You can have a million recipes, but only cook two dishes. You can have five recipes, but make each one a thousand times. You can cook one dish at a time, and leave most of your resources unused, or turn on every single oven.</p>
<p>The owner of the restaraunt is <a href="https://en.wikipedia.org/wiki/Sudo">sudo</a> – able to overrule even the head chef, and set whatever rules she wants.</p>
<h2 id="a-practical-example">A Practical Example</h2>
<p>Imagine you want to do some R programming using RStudio. You begin by double-clicking the RStudio icon on your dock, which brings up the RStudio GUI. You write some R code inside of the GUI, and run it. Let’s think about what happened:</p>
<ol>
<li>Double-clicking the RStudio icon read the RStudio program and created an RStudio process. Most (but not all) of this process is the graphical user interface.</li>
<li>While starting up, RStudio checked its settings file to see what its current working directory should be (<code class="highlighter-rouge">~</code> by default).</li>
<li>When you ran your code in the GUI, RStudio spawned a new process in the context of that working directory, which executed your code and returned the result. This is known as a REPL (a “Read-Evaluate-Print” Loop).</li>
</ol>
<p>Now, imagine your code references a data file – say, a JSON file containined a bunch of tweets. This file is stored in some location. When you’re writing your R code, the location you give <em>must be relative to the file in which it is called.</em></p>
<p>For example, if your working directory is <code class="highlighter-rouge">~</code>, and your file, <code class="highlighter-rouge">code/script.R</code>, looks like this:</p>
<div class="highlighter-rouge"><pre class="highlight"><code># script.R
tweets = parseTweets('tweets.json')
</code></pre>
</div>
<p>Then RStudio will run this command on the file located at <code class="highlighter-rouge">~/code/tweets.json</code>. If the file is elsewhere, you’ll get an error.</p>
<h2 id="git-and-github">Git and GitHub</h2>
<p>The last topic in this guide is Git. Git not related to the command line per se, but rather is an important tool for the development of software. In many fields, but most of all in software, version matters. Software products are in constant states of change. Software projects almost always require the collaboration of multiple people. Managing all of this activity can be challenging. Enter Git.</p>
<p>Git is a tool for recording change over time, making it easier for people working on software to make sure they both have the correct and up-to-date version of their code, but also to be able to go back and see where and why changes were made. Known as source control, this bookkeeping is now fundamental in modern software development.</p>
<p>The core unit of Git is the <strong>repository</strong>. A repository can be thought of as unit being remembered. Work on repositories are saved in chunks known as <strong>commits</strong>. A commit can be thought of as a unit of memory. Developing software is then a process of committing changes to a repository representing the software project.</p>
<p>Some benefits of using Git:</p>
<ol>
<li>Easy to restore files if they have been lost or damaged</li>
<li>Easy to get changes to files or folders without having to re-download the entire file or folder</li>
</ol>
<p>There are more, but these two are so useful that rather than enumerate them it may be best to pause and reflect on these instead. Git, like many things, is a program.</p>
<p>GitHub, on the other hand, is a website which makes it easy to collaborate with others via the internet. GitHub uses Git as a foundation, and builds on it. This distinction is subtle but worth knowing.</p>
<p>Using Git involves understanding a handful of commands. Here are most of them:</p>
<p><code class="highlighter-rouge">git clone <url to repo></code> – clone a repository from GitHub to your computer</p>
<p><code class="highlighter-rouge">git add .</code> – prepare files for the next commit</p>
<p><code class="highlighter-rouge">git commit -am "<commit message>"</code> – make a commit, with the message <code class="highlighter-rouge"><commit message></code></p>
<p><code class="highlighter-rouge">git push</code> – “push” new changes up to GitHub</p>
<p><code class="highlighter-rouge">git pull</code> – “pull” new changes from GitHub</p>
<p>To get started:</p>
<ol>
<li><a href="https://github.com/">Make a GitHub Account</a>. Everyone who writes programs for anything has one. It’s like having an email address. If you don’t have one people will think you’re weird and won’t hire you.</li>
<li><a href="https://git-scm.com/book/en/v2/Getting-Started-Installing-Git">Install Git</a>. You can install just the command-line program or the fancy GUI.</li>
</ol>
<p>If taking a class taught via github, one fairly effective flow is the following:</p>
<ol>
<li>“Fork” the course repository. What this basically means is that you’re going to copy the course repository to your own account.</li>
<li>Clone the forked repository to your computer.</li>
<li>Create an “upstream remote” pointing to the original course repository.</li>
<li>Point RStudio’s default working directory to the course repository.</li>
<li>Avoid many problems.</li>
</ol>
<p>Whenever you want to sync your repository with the course repository, run <code class="highlighter-rouge">git pull upstream master</code> to pull changes from the upstream repository into your fork. What this will allow you to do is to build on top of the course materials, while making it easy to synchronize with updated material as necessary.</p>
Thu, 03 Mar 2016 00:00:00 +0000
http://kronosapiens.github.io/blog/2016/03/03/elements-of-modern-computing.html
http://kronosapiens.github.io/blog/2016/03/03/elements-of-modern-computing.htmlunixcommand-linefilesystemblogBlockchain as Talmud<h1 id="the-talmud">The Talmud</h1>
<p><a href="http://kronosapiens.com/2013/01/25/the-religious-mindset/">The Jewish Talmud is a remarkable object</a>. It is the product of hundreds of years of intense, rigorous, and highly formal debate and scholarship. It has served as the backbone of the Jewish people. The trunk of the tree.</p>
<p>The Talmud has a very interesting property, the inspection of which will prove illuminating.</p>
<p>First, one of the fundamental rules of Talmudic scholarship is that a recent scholar cannot contradict, reject, or overrule an older scholar. If a recent scholar takes issue with an historical analysis, the only avenue available to them is to reinterpret the intention of the older scholar. This chain of interpretation goes back, in an unbroken chain, to the first commentaries and ultimately to the old testament and the ten commandments.</p>
<p>As such, one can in theory always trace a current value, decision, or opinion back through history. Further, a concept or idea cannot be introduced arbitrarily, but must be rooted and stem from an existing concept or idea. Similarly, once an idea has been accepted, it can never be fully rejected, only reinterpreted.</p>
<p>The quality of the interpretations, as well as the intention of the interpreters, is a subject of ongoing debate and, well, interpretation. But this property in general holds.</p>
<p>This has several important implications.</p>
<p><strong>First</strong>, there is no possibility of complete revolution. A “revolution” in Judaism, defined as a complete rejection of what has come before and the attempt to institute a new faith on entirely new foundations, could never occur. If such a thing were attempted, those individuals would be seen as a new sect, ultimately disconnected from primary Judaism. The core of Judaism, defined as those who adhere to the teachings of the Talmud and associated texts, is fundamentally connected to this canon. <em>Jews hold the Talmud as the primary authority</em>. No one is forcing the Jews to respect the Talmud; it is simply that the study of the Talmud and its instruction is the common denominator for the Jewish identity. Any individual Jew is, at any time, free to completely reject the entirety of the Talmud. Such a person, however, would cease to be accepted by their community. In this way, the Talmud coordinates the self-identifying community of Jews.</p>
<p><strong>Second</strong>, the faith is capable of substantial dynamism. Interpretations can be fanciful and radical. Although new work must be based on historical scholarship, it is often the case that scholars will not agree with their contemporaries. In this way, the faith subdivides into movements, each respecting a particular strand of interpretation.</p>
<p><strong>Third</strong>, there is no need for a central authority. The extent to which any individual person holds themselves accountable to these laws and interpretations is, of course, a personal decision. Orthodox Jews take these laws very literally, and yet various schools of the orthodoxy have varying interpretations of some of the more ambiguous implementations of the faith. The Conservative, Reform, Reconstructionist, and other more progressive flavors permit more liberal general interpretations – interpretations which are, of course, rejected by the orthodoxy. The key here is that this is a single canon, to which all Jews can be seen as being in relation. Particularly relevant is that while inter-movement relationships are undefined and may be nonexistent or even hostile, the overall coordination of the movements is implicitly achieved.</p>
<p><strong>Fourth</strong>, the faith as a whole can never be destroyed, as the Talmud functions also as a memory. Regardless of what may occur to some or even a majority of the adherents of the faith, the survivors will be able to rebuild the community to within an arbitrary precision. This implication is well-documented by historical experience.</p>
<p><strong>Fifth</strong>, the Talmud is “bigger” and “wiser” than any individual. Considering Plato’s “Philosopher King”, we observe that individual humans are insufficient for the task. A shared, dynamic history of thought, however, might be. As the product of a history of reason and debate, the Talmud represents a cultural history orders of magnitude larger than any individual person. It has a fundamentally different, unique, and very functionally-relevant ontology.</p>
<p><strong>Sixth</strong> and relatedly, the Talmud then exhibits “maximal intelligence”, in the sense that the unbroken chain of interpretation represents more overall experience than any record which allowed for erasure and editing.</p>
<h1 id="democracy">Democracy</h1>
<p>Let us briefly consider the similarities and differences between Talmud-based governance and the kind of Democratic governance exemplified by the United States.</p>
<p>In both cases, we have the principle that decisions must occur within set boundaries. In the case of the Talmud, those boundaries are historical scholarship and foundational texts. In the case of the United States, those boundaries are the Constitution and Bill of Rights.</p>
<p>In both cases, there are established mechanisms for making changes. In the case of the Talmud, changes are made via extrapolation and interpretation of past work. In the case of the United States, changes are made via a legislative process.</p>
<p>A salient difference is that in the case of the United States, future changes are disconnected from past changes; if the basic boundaries are respected, then anything goes. It is technically possible to “reinterpret” the basic boundaries by amending the constitution, but this seems highly unlikely.</p>
<p>As such, we see the Talmudic process as having more gradual changes, while the US process sways more easily in changing political winds.</p>
<p>Of course, much of this difference can be rooted in the fact that the United States must secure territory, and relies on the use of force to enforce rules. As a faith, the Jews permit less well-defined borders. The United States and the Jewish people, like all states and religious communities, are entities of a fundamentally different nature. As such, they seem to necessitate fundamentally different approaches to change and control. Religion is currently (although not historically) an opt-in experience; citizenship typically is not.</p>
<p>Yet, we see some shared principles in both forms of governance. The differences appear to be in large part necessary differences coming from the fundamentally different natures of these entities, in particular with regards to group membership. As such, we should not necessarily feel obligated to reconcile them.</p>
<h1 id="the-blockchain">The Blockchain</h1>
<p>The Talmud, its properties, and role in Jewish life provides a crucial case study for those interested in effective means of coordinating large groups of people absent a central authority. The fundamental mechanic is the strict requirement that future change emerges from past work, prohibiting both the introduction of the completely novel and the rejection of any history. The historical experience of the Jews has shown that, if put in motion upon adequate foundations, such a mechanic is sufficient for the decentralized coordination of the activity of millions of people across time and space.</p>
<p>Recently, we have seen the emergence of technology which shares this property. The Blockchain, first described in the 2008 paper “<a href="https://bitcoin.org/bitcoin.pdf">Bitcon: A Peer to Peer Electronic Cash System</a>”, is in essence a decentralized public ledger, in which anything can be recorded and made publically available. The principal mechanic of the Blockchain is that future entries must build upon past entries, and that any entry in the chain, once accepted, can never be deleted. In the words of the author, Satoshi Nakamoto:</p>
<blockquote>
<p>The only way to confirm the absence of a transaction <a href="http://www.econlib.org/library/Essays/hykKnw1.html">is to be aware of all transactions</a>. In the mint based model, the mint was aware of all transactions and decided which arrived first. To accomplish this without a trusted party, transactions must be publicly announced, and we need a system for participants to agree on a single history of the order in which they were received.</p>
</blockquote>
<p>The Blockchain is thus a decentralized authority in which all new changes must be a continuation of past work. As such, the example of the Talmud suggests that such a tool could be used for the effective decentralized coordination of large groups of people across time and space, without the need for a central authority or any force.</p>
<p>In order for the Blockhain to be used in this way, it will be necessary for a large group of people to regard the Blockchain as an authority on a wide variety of issues. As in the case of Jews and the Talmud, the Talmud is an authority because Jews <em>see it as an authority</em>. In an important sense, this is arbitrary. Fortunately, this sense suggests that there is no fundamental shortcoming which prevents a Blockchain from serving a similar purpose.</p>
<p>This leads us to some interesting questions:</p>
<ul>
<li>
<p>What kind of information should this Blockchain contain?</p>
</li>
<li>
<p>How should this Blockchain be updated?</p>
</li>
<li>
<p>What content, if any, should be placed at the base of the Blockchain?</p>
</li>
</ul>
<p>As a first pass, it would seem as though this Blockchain should function as a repository of social values. As the community defined by the Blockchain adapts, and external circumstances change, these values would be updated. At any point, any member of the community could inspect the history of these values and the reasoning for their changes. As time passed, this record would become a deep and strong foundation for that community, and a trusted authority on the values and purpose of that community. Coordination without an authority. Updates to this Blockchain would occur on a rough consensus basis, allowing for the possibility of a split of the Blockchain at some point if there emerged a major disagreement over the direction of the community.</p>
<p>Given the earlier discussion of Democracy, it does not seem at this time that the Blockchain can be used effectively to govern a state; issues of control and security seem to preclude the rational, gradual, consensus-based change process we are discussing. However, it does seem as though the Blockchain could serve as an effective authority and memory for a self-selected community with shared values and without borders.</p>
<p>One could point to other self-selecting communities with rough consensus-based decision-making processes, and observe that they succeed while employing very different decision-making systems. The Python community, with BDFL and a PEP-based system of improvement, has been pretty effective. Yet a series of disconnected proposals is ontologically dissimilar to an unbroken chain of decisions. The importance of this dissimilarity is to be determined.</p>
<p>It would be very exciting to be a part of such a community.</p>
<h1 id="update-feb-1">Update (Feb 1)</h1>
<p>I shared this post with a few friends of mine with relevant domain knowledge. They gave valuable feedback and raised additional questions. Their responses are replicated below.</p>
<p><strong>From a Professor of Political Economy:</strong></p>
<blockquote>
<p>Interesting read–but I wonder about practicality. The beauty of many societies is being able to effect rapid change. The way you lay this out, suggests that this may be a lot more difficult. It may be a way to insure calmer legislation, etc. But it doesn’t move really seem to me to lead to faster movement of anything.</p>
</blockquote>
<p><strong>From a Rabbi:</strong></p>
<blockquote>
<p>This was fascinating. Thanks for sharing.
You know, I think the question I’m sitting with is, given decentralized systems of law or commerce, what role does the organizer or covener have, and how much authority is in that role. The Talmud, for example, is a collection of many voices, but someone did the collecting. The work of that someone, the editor (or likely, editors) is what academics are particularly fascinated with these days. So, too, with any wiki. There is someone who hosts it. How much control do they have? And I imagine that there are also decisionmakers with a Blockchain. What does it mean, then, to be at the center of a decentralizes system?</p>
</blockquote>
Mon, 11 Jan 2016 00:00:00 +0000
http://kronosapiens.github.io/blog/2016/01/11/blockchain-as-talmud.html
http://kronosapiens.github.io/blog/2016/01/11/blockchain-as-talmud.htmlblockchaingovernmentblogUnderstanding Variational Inference<p>In one of my courses this semester, <a href="http://www.columbia.edu/~jwp2128/Teaching/E6892/E6892Fall2015.html">Bayesian Models for Machine Learning</a>, we’ve been spending quite a bit of time on a technique called “<a href="https://en.wikipedia.org/wiki/Variational_Bayesian_methods">Variational Inference</a>”. I’ve spent the last few days working on an assigment using this technique, so I thought this would be a good occasion to test knowledge by attempting to describe the method. Much credit to <a href="http://www.columbia.edu/~jwp2128/">John Paisley</a> for teaching me all this in the first place. The Columbia faculty are really top-notch.</p>
<h2 id="high-level-overview">High-level Overview</h2>
<p>First, a brief overview of Bayesian statistics. We begin with a distribution on our data <script type="math/tex">x</script>, paramaterized by <script type="math/tex">\theta</script>:</p>
<script type="math/tex; mode=display">p(x | \theta)p(\theta)</script>
<p>Using basic rules of joint and conditional probability, we derive the theorem:</p>
<script type="math/tex; mode=display">p(\theta, x) = p(x, \theta)</script>
<script type="math/tex; mode=display">p(\theta | x)p(x) = p(x | \theta)p(\theta)</script>
<script type="math/tex; mode=display">\underbrace{
p(\theta | x) = \frac{p(x | \theta)p(\theta)}{p(x)}
}_{\text{Bayes Theorem}}</script>
<p>For those unfamiliar with Bayes Theorem, our goal is to use the data <script type="math/tex">x</script> to learn a <em>better</em> distribution on <script type="math/tex">\theta</script>. In other words, Bayes Theorem gives us a formal way to update our predictions, given our experience of the world. In particular,</p>
<script type="math/tex; mode=display">p(\theta | x)</script>
<p>is known as the <em>posterior</em> distribution of <script type="math/tex">\theta</script>. This distribution is our goal.</p>
<p>Now, the formulas given above are in terms of probability distributions. If we actually look under the hood at what these probabilities actually look like… well, it looks like a lot of calculus. Using Bayes Theorem involves working with those integrals. Fortunately, over the years statisticians have developed a pretty sophisticated body of knowledge around manipulating these probability distributions, so that (if you make smart choices about which distributions you pick) you can skip basically all of the calculus. This is convenient.</p>
<p>However, for more complicated models, things aren’t always guaranteed to work out so nicely. Sometimes, when we try to model something, we find that calculating the posteriors directly is impossible. What can we do?</p>
<p>Fortunately, statisticians have developed techniques for handling this. Variational Inference is one of those techniques.</p>
<p>Now, we will derive the VI master equation.</p>
<p>Recall some basic rules of probability:</p>
<script type="math/tex; mode=display">p(\theta | x)p(x) = p(x, \theta)</script>
<script type="math/tex; mode=display">p(x) = \frac{p(x, \theta)}{p(\theta | x)}</script>
<script type="math/tex; mode=display">lnp(x) = lnp(x, \theta) - lnp(\theta | x)</script>
<p>Now, we introduce an entirely new distribution, <script type="math/tex">q(\theta)</script>, and take the expectation with regards to this distribution:</p>
<script type="math/tex; mode=display">E_{q(\theta)}[lnp(x)] = E_{q(\theta)}[lnp(x, \theta)] - E_{q(\theta)}[lnp(\theta | x)]</script>
<script type="math/tex; mode=display">\int q(\theta) lnp(x) d\theta = \int q(\theta) lnp(x, \theta) d\theta - \int q(\theta) lnp(\theta | x) d\theta</script>
<p>Observing that the left-hand term is constant with respect to <script type="math/tex">\theta</script>:</p>
<script type="math/tex; mode=display">lnp(x) \int q(\theta) d\theta = \int q(\theta) lnp(x, \theta) d\theta - \int q(\theta) lnp(\theta | x) d\theta</script>
<script type="math/tex; mode=display">lnp(x) = \int q(\theta) lnp(x, \theta) d\theta - \int q(\theta) lnp(\theta | x) d\theta</script>
<p>We then add and subtract the <a href="https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution">entropy</a> of <script type="math/tex">q(\theta)</script>:</p>
<script type="math/tex; mode=display">lnp(x) = \int q(\theta) lnp(x, \theta) d\theta - \int q(\theta) lnp(\theta | x) d\theta
+ \int q(\theta) lnq(\theta) d\theta - \int q(\theta) lnq(\theta) d\theta</script>
<p>And reorganize:</p>
<script type="math/tex; mode=display">lnp(x) = \int q(\theta) (lnp(x, \theta) - lnq(\theta))d\theta - \int q(\theta) (lnp(\theta | x) - lnq(\theta)) d\theta</script>
<script type="math/tex; mode=display">lnp(x) =
\underbrace{
\int q(\theta) ln\frac{p(x, \theta)}{q(\theta)} d\theta
}_{L}
+ \underbrace{
\int q(\theta) ln\frac{q(\theta)}{p(\theta | x)} d\theta
}_{KL(q||p)}</script>
<p>Let’s take a moment to understand what was just derived. We have shown that the log probability of the random variable <script type="math/tex">x</script> is equal to the involved-looking equation on the right-hand side. This right-hand term is the sum of two terms. The first, which we call <script type="math/tex">L</script>, we will refer to as the “Variational Objective Function”. The second is the equation of something known as the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback–Leibler</a> divergence, or KL divergence for short.</p>
<p>Recall that our goal is to learn</p>
<script type="math/tex; mode=display">p(\theta | x)</script>
<p>We see that this term appears in the KL divergence, next to this new distribution we are calling <script type="math/tex">q(\theta)</script>. Conveniently, the KL divergence is a measure of the difference between two distributions. When the distributions are equal, the KL divergence equals 0. The more the distributions differ, the larger the term becomes. We’re not sure what this <script type="math/tex">q</script> distribution is, but let’s assume that we can control it. The closer it comes to approximating the posterior, the smaller the KL divergence will become.</p>
<p>We see also that the left-hand term, <script type="math/tex">lnp(x)</script>, is a constant (just the probability of the data). So we have a constant equal to an equation plus a term we want to minimize. Therefore, if we can find a way to <strong>maximize</strong> the <script type="math/tex">L</script> term, we will necessarily <strong>minimize</strong> the KL divergence. Therefore, the problem becomes one of finding a <script type="math/tex">q(\theta)</script> distribution which maximizes <script type="math/tex">L</script>!</p>
<h2 id="in-context">In Context</h2>
<p>Let’s consider the model in the context of a very interesting problem. Say we have a non-linear function (such as <script type="math/tex">sin(x), sinc(x)</script> or similar) and we would like to approximate this function via Bayesian Linear Regression. Approximating a non-linear function via a naive linear regression is not really feasible. However, if we expand the data into higher dimensions, it may be possible to learn a linear regression in the higher-dimensional space that corresponds to a non-linear function in the original dimension. This is what we will attempt in this problem.</p>
<p>We will project <script type="math/tex">x_i, ... x_n \in R^{2}</script> into <script type="math/tex">R^{n}</script>, by projecting each <script type="math/tex">x_i</script> into a vector of distances defined by the Gaussian kernel. In other words, every <script type="math/tex">x_i</script> will become a vector representing that point’s distance from every other point in the set – and <script type="math/tex">X</script>, the data, becomes a <script type="math/tex">n \times n</script> matrix of distances, with <script type="math/tex">X_{ii} = 1</script> and <script type="math/tex">X_{ij} \leq 1</script> for all <script type="math/tex">j \neq 1</script>. Adding a column of ones to represent any intercept term, we can now interpret the problem as a regression. Our goal is then to find a coefficient vector <script type="math/tex">w</script> which, given the vector of difference, map the point to its correct location in the original space, <script type="math/tex">R^{2}</script>.</p>
<p>This transformation is quite a profound. We have many tools for working with linear functions (i.e. linear algebra), but fewer for working with non-linear functions. We found ourselves facing a problem, and rather than attempt to solve the problem using a limited toolset, we simply transformed the problem into one that we can approach skillfully. For any Ender’s Game fans out there, this is an “enemy’s gate is down” kind of moment. Anyway.</p>
<p>Complicating the problem further, we would like to encourage sparsity in <script type="math/tex">w</script> – in other words, we would like to identify a small subset of <script type="math/tex">X</script> which are sufficiently discriminative to allow us to correctly place the other points. We can think of these points as “vantage points”, and they are especially good at creating distance between points and illuminating their differences.</p>
<p>With that setup, we turn to the actual model, which looks like this:</p>
<script type="math/tex; mode=display">y_i \sim N(x_i^Tw, \lambda^{-1})</script>
<script type="math/tex; mode=display">w_i \sim N(0, diag(\alpha_1, ..., \alpha_d)^{-1})</script>
<script type="math/tex; mode=display">\lambda \sim Gamma(e_0, f_0)</script>
<script type="math/tex; mode=display">\alpha_k \sim Gamma(a_0, b_0)</script>
<p>The joint probability distribution is as follows:</p>
<script type="math/tex; mode=display">p(x,y,w, \lambda, \alpha)
= \prod_{i=1}^n p(y_i | x_i, w, \lambda)p(\lambda | e_0, f_0)p(w | \alpha)\prod_{k=1}^dp(\alpha_k| a_0, b_0)</script>
<p>And the log joint probability is as follows:</p>
<script type="math/tex; mode=display">lnp(x,y,w, \lambda, \alpha)
= \sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(\lambda | e_0, f_0) + lnp(w | \alpha) + \sum_{k=1}^d lnp(\alpha_k| a_0, b_0)</script>
<p>We model the <script type="math/tex">q</script> distribution as a joint probability of independent distributions, one per variable:</p>
<script type="math/tex; mode=display">q(w, \lambda, \alpha) = q(w)q(\lambda)q(\alpha)</script>
<p>Now, with the model defined, we plug these values in to the VI maste requation we derived earlier:</p>
<script type="math/tex; mode=display">lnp(y, x) =
\int q(w)q(\lambda)q(\alpha) ln \frac{p(x,y,w, \lambda, \alpha)}{q(w)q(\lambda)q(\alpha)} dw d\lambda d\alpha
+ \int q(w)q(\lambda)q(\alpha) ln \frac{q(w)q(\lambda)q(\alpha)}{p(w, \lambda, \alpha | x, y)} dw d\lambda d\alpha</script>
<h2 id="learning-q">Learning q</h2>
<p>Recall, our goal is to maximize:</p>
<script type="math/tex; mode=display">L = \int q(w)q(\lambda)q(\alpha) ln \frac{p(x,y,w, \lambda, \alpha)}{q(w)q(\lambda)q(\alpha)} dw d\lambda d\alpha</script>
<p>We will do this by finding better values for <script type="math/tex">q(w)q(\lambda)q(\alpha)</script>. To show how this is done, let’s first consider <script type="math/tex">q(w)</script> in isolation (the process will be the same for each variable). First, we reorganize the equation to remove all terms which are constant with respect to <script type="math/tex">w</script> (in other words, which won’t change regardless of <script type="math/tex">w</script>, so aren’t important when it comes to maximizing the equation with regards to <script type="math/tex">w</script>):</p>
<script type="math/tex; mode=display">L = \int q(w) q(\lambda)q(\alpha) ln p(x,y,w, \lambda, \alpha) d\lambda d\alpha dw
- \int q(w) ln q(w) dw
- \text{const w.r.t } w</script>
<p>Observe next that we can interpret the first integral as an expectation, where <script type="math/tex">E_{-q(w)} = E_{q(\lambda)q(\alpha)}</script>, the expectation over all other variables.</p>
<script type="math/tex; mode=display">L = \int q(w) E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)] dw
- \int q(w) ln q(w) dw
- \text{const w.r.t } w</script>
<p>We will now pull off some slick math. First, observe that <script type="math/tex">ln e^x = x</script>. Now:</p>
<script type="math/tex; mode=display">L = \int q(w) ln \frac{e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]}}{q(w)} dw
- \text{const w.r.t } w</script>
<p>This is looking an awful lot like our friend the KL divergence. If only the numerator were a probability distribution! Fortunately we can make it one, by introducing a new term <script type="math/tex">Z</script>:</p>
<script type="math/tex; mode=display">Z = \int e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]} dw</script>
<p>Here, Z can be interpreted as the normalizing constant for the distribution <script type="math/tex">e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]}</script>. By adding and subtracting <script type="math/tex">lnZ</script>, which is constant with respect to <script type="math/tex">w</script>, we witness some more slick math:</p>
<script type="math/tex; mode=display">L = \int q(w) ln \frac{e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]}}{q(w)} dw
- \text{const w.r.t } w + lnZ - lnZ</script>
<script type="math/tex; mode=display">L = \int q(w) ln \frac{\frac{1}{Z}e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]}}{q(w)} dw
- \text{const w.r.t } w</script>
<p>Where are we now? We have succesfully transformed the integral into a KL divergence between <script type="math/tex">q(w)</script>, our distribution of interest, and <script type="math/tex">\frac{1}{Z}e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]}</script>, which is an expression involving terms we know. Specifically, we have:</p>
<script type="math/tex; mode=display">-KL(q(w)\|\frac{1}{Z}e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]})</script>
<p>We want to maximize this expression, which is equivalent to <em>minimizing</em> -KL. We minimize -KL when the two distributions are equal. Therefore, we know that:</p>
<script type="math/tex; mode=display">q(w) = \frac{1}{Z}e^{E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]}</script>
<p>Sweet! Now we just need to solve for the left hand term. It’s worth pausing and noting how much fancy math it took to get us here. We relied on properties of logarithms, expectations, KL divergence, and the mechanics of probability distributions to derive this expression.</p>
<p>To actually evaluate this expression and figure out what <script type="math/tex">q(w)</script> should be, we’ll rewrite things to remove terms not involving <script type="math/tex">w</script>, by absorbing them in the normalizing constant. To see why this is the case, let’s first rewrite the expectation:</p>
<script type="math/tex; mode=display">E_{-q(w)}[ln p(x,y,w, \lambda, \alpha)]</script>
<p>Recalling the log joint probability we derived earlier:</p>
<script type="math/tex; mode=display">E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda)
+ lnp(\lambda | e_0, f_0)
+ lnp(w | \alpha)
+ \sum_{k=1}^d lnp(\alpha_k| a_0, b_0)]</script>
<script type="math/tex; mode=display">lnp(\lambda | e_0, f_0) + \sum_{k=1}^d lnp(\alpha_k| a_0, b_0)
+ E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]</script>
<p>Putting this back into context, we can rewrite the distribution:</p>
<script type="math/tex; mode=display">\frac{
e^{lnp(\lambda | e_0, f_0) + \sum_{k=1}^dlnp(\alpha_k| a_0, b_0)}
e^{E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]}
}{
\int e^{lnp(\lambda | e_0, f_0) + \sum_{k=1}^dlnp(\alpha_k| a_0, b_0)}
e^{E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]} dw
}</script>
<p>Bringing outside of the integral all terms constant with respect to <script type="math/tex">w</script>:</p>
<script type="math/tex; mode=display">\frac{
e^{lnp(\lambda | e_0, f_0) + \sum_{k=1}^dlnp(\alpha_k| a_0, b_0)}
e^{E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]}
}{
e^{lnp(\lambda | e_0, f_0) + \sum_{k=1}^dlnp(\alpha_k| a_0, b_0)}
\int e^{E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]} dw
}</script>
<p>And then cancelling:</p>
<script type="math/tex; mode=display">\frac{
e^{E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]}
}{
\int e^{E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]} dw
}</script>
<p>All that is left to do is to evaluate the expression</p>
<script type="math/tex; mode=display">e^{E_{-q(w)}[\sum_{i=1}^n lnp(y_i | x_i, w, \lambda) + lnp(w | \alpha)]}</script>
<p>to learn the distribution. We will not go through the specific derivation here, which involves evaluating the expectation of the log of the distributions on <script type="math/tex">y_i</script> and <script type="math/tex">w</script>; instead we will skip to the final result and claim that:</p>
<script type="math/tex; mode=display">q(w) \sim N(\mu, \Sigma)</script>
<p>With:</p>
<script type="math/tex; mode=display">\Sigma = (E_{q(\alpha)}[diag(\alpha)] + E_{q(\lambda)}[\lambda] \sum_{i=1}^n x_i x_i^T)^{-1}</script>
<script type="math/tex; mode=display">\mu = \Sigma(E_{q(\lambda)}[\lambda] \sum_{i=1}^n y_i x_i)</script>
<p>The <strong>key</strong> observation to make here is that <script type="math/tex">q(w)</script> involves the expected values of the <em>other</em> model variables. This will be true for the other variables as well. To show this, here are <script type="math/tex">q(\lambda)</script> and <script type="math/tex">q(\alpha)</script>:</p>
<script type="math/tex; mode=display">q(\lambda) \sim Gamma(e, f)</script>
<script type="math/tex; mode=display">e = e_0 + \frac{n}{2}</script>
<script type="math/tex; mode=display">f = f_0 + \frac{1}{2} \sum_{i=1}^n [(y_i - E_{q(w)}[w]^T x_i)^2 + x_i^T Var_{q(w)}[w] x_i]</script>
<script type="math/tex; mode=display">q(\alpha_k) \sim Gamma(a, b_k)</script>
<script type="math/tex; mode=display">a = a_0 + \frac{1}{2}</script>
<script type="math/tex; mode=display">b_k = b_0 + \frac{1}{2} E_{q(w)}[ww^T]_{kk}</script>
<p>Finally, we give the expectations:</p>
<script type="math/tex; mode=display">E_{q(w)}[w] = \mu</script>
<script type="math/tex; mode=display">Var_{q(w)}[w] = \Sigma</script>
<script type="math/tex; mode=display">E_{q(w)}[ww^T] = \Sigma + \mu\mu^T</script>
<script type="math/tex; mode=display">E_{q(\lambda)}[\lambda] = \frac{e}{f}</script>
<script type="math/tex; mode=display">E_{q(\alpha_k)}[\alpha_k] = \frac{a}{b_k}</script>
<p>Now we are prepared to discuss the value of this technique. Note how each <script type="math/tex">q()</script> distribution is a function of the expectations of the other random variables, <em>with respect to their <script type="math/tex">q()</script> distributions. This means that as one distribution changes, the others change… causing the first to change, causing the others to change, over and over again in a loop. The insight is that each change</em> brings the <script type="math/tex">q()</script> closer to the true posterior that we are trying to approximate. In other words, each iteration through this update loop gives us a better set of <script type="math/tex">q()</script>, as improved values for one give improved values for the others. Also note that since we have solved for the various <script type="math/tex">q()</script> distributions solely in terms of the data <script type="math/tex">y_i, x_i</script> and the expecatations <script type="math/tex">E[w], E[\lambda], E[\alpha]</script>, we can implement the algorithm efficiently using only basic arithmetic operations, without having to do any calculus or derive anything!</p>
<h2 id="assessing-convergence">Assessing Convergence</h2>
<p>The last thing we will need to do is discuss the process for assessing convergence – calculating how much and how quickly our <script type="math/tex">q</script> distributions are closing in on the true posteriors. To do this, we will have to evaluate the entire <script type="math/tex">L</script> equation using the new <script type="math/tex">q</script> distributions. Recall the equation:</p>
<script type="math/tex; mode=display">L = \int q(w)q(\lambda)q(\alpha) ln \frac{p(x,y,w, \lambda, \alpha)}{q(w)q(\lambda)q(\alpha)} dw d\lambda d\alpha</script>
<p>Which can be written as follows:</p>
<script type="math/tex; mode=display">L = \int q(w)q(\lambda)q(\alpha) ln p(x,y,w, \lambda, \alpha) dw d\lambda d\alpha
- \int q(w) ln q(w) dw
- \int q(\lambda) ln q(\lambda) d\lambda
- \int q(\alpha) ln q(\alpha) d\alpha</script>
<p>And interpreted as a sum of expectations:</p>
<script type="math/tex; mode=display">L = E_{q(w, \lambda, \alpha)}[ln p(x,y,w, \lambda, \alpha)]
- E_{q(w)}[ln q(w)]
- E_{q(\lambda)}[ln q(\lambda)]
- E_{q(\alpha)}[ln q(\alpha)]</script>
<script type="math/tex; mode=display">L = E_{q(w, \lambda, \alpha)}[ln p(x,y,w, \lambda, \alpha)]
- E_{q(w)}[ln q(w)]
- E_{q(\lambda)}[ln q(\lambda)]
- E_{q(\alpha)}[ln q(\alpha)]</script>
<script type="math/tex; mode=display">L = E_{q(w, \lambda, \alpha)}[\sum_{i=1}^n p(y_i | x_i, w, \lambda) + p(\lambda | e_0, f_0) + p(w | \alpha) + \sum_{k=1}^dp(\alpha_k| a_0, b_0)]
- E_{q(w)}[ln q(w)]
- E_{q(\lambda)}[ln q(\lambda)]
- E_{q(\alpha)}[ln q(\alpha)]</script>
<p>Which breaks down as follows:</p>
<script type="math/tex; mode=display">L =
\sum_{i=1}^n E_{q(w, \lambda)}[lnp(y_i | x_i, w, \lambda)]
+ E_{q(\lambda)}[lnp(\lambda | e_0, f_0)]
+ E_{q(w, \alpha)}[lnp(w | \alpha)]
+ \sum_{k=1}^d E_{q(\alpha)}[lnp(\alpha_k| a_0, b_0)]</script>
<script type="math/tex; mode=display">- E_{q(w)}[ln q(w)]
- E_{q(\lambda)}[ln q(\lambda)]
- E_{q(\alpha)}[ln q(\alpha)]</script>
<p>There is an important subtlety in evaluating these expectations. To understand this subtlety, let’s look at two of the terms:</p>
<script type="math/tex; mode=display">E_{q(\lambda)}[p(\lambda | e_0, f_0)] - E_{q(\lambda)}[ln q(\lambda)]</script>
<p>Observe how both are expectations over <script type="math/tex">q(\lambda)</script>. However the probability distributions are not the same. To see how this works out, let’s write out the actual log probabilities (both distributions are Gamma).</p>
<script type="math/tex; mode=display">E_{q(\lambda)}[p(\lambda | e_0, f_0)]
= E_{q(\lambda)}[(e_0lnf_0 - ln\Gamma(e_0)) + (e_0 - 1)ln\lambda - f_0\lambda]</script>
<script type="math/tex; mode=display">E_{q(\lambda)}[ln q(\lambda)]
= E_{q(\lambda)}[(e lnf - ln\Gamma(e)) + (e - 1)ln\lambda - f\lambda]</script>
<p>Now, passing the expectations through:</p>
<script type="math/tex; mode=display">(e_0lnf_0 - ln\Gamma(e_0)) + (e_0 - 1)E_{q(\lambda)}[ln\lambda] - f_0E_{q(\lambda)}[\lambda]</script>
<script type="math/tex; mode=display">(e lnf - ln\Gamma(e))+ (e - 1)E_{q(\lambda)}[ln\lambda] - fE_{q(\lambda)}[\lambda]</script>
<p>Writing in terms of the difference (note that the sign changes in the second expectation):</p>
<script type="math/tex; mode=display">(e_0lnf_0 - ln\Gamma(e_0)) + (e_0 - 1)E_{q(\lambda)}[ln\lambda] - f_0E_{q(\lambda)}[\lambda]
- (e lnf - ln\Gamma(e)) - (e - 1)E_{q(\lambda)}[ln\lambda] + fE_{q(\lambda)}[\lambda]</script>
<p>And combining terms:</p>
<script type="math/tex; mode=display">(e_0lnf_0 - ln\Gamma(e_0)) - (e lnf - ln\Gamma(e))
+ (e_0 - e)E_{q(\lambda)}[ln\lambda] - (f_0 - f)E_{q(\lambda)}[\lambda]</script>
<p>Notice how for both distributions, <script type="math/tex">E_{q(\lambda)}[\lambda]</script> is identical. This is because <script type="math/tex">E_{q(\lambda)}[\lambda]</script> is a function of the distribution with which we are taking the expectation, <script type="math/tex">q(\lambda)</script>. Therefore, even though the paramaters for <script type="math/tex">p(\lambda)</script> don’t change (they remain <script type="math/tex">e_0, f_0</script> always), <script type="math/tex">p(\lambda)</script> evaluates to a different result as <script type="math/tex">q(\lambda)</script> changes. For <script type="math/tex">q(\lambda)</script>, on the other hand, we always use the latest values of <script type="math/tex">f, e</script>.</p>
<p>At first I found it counterintuitive that <script type="math/tex">e_0, f_0</script> should be constant through every iteration – the Bayesian insight is that priors are constantly being updated as information comes in. The reason why, in this case, <script type="math/tex">e_0, f_0</script> are constant (and this is true for the priors on the other distributions as well) is that the entire Variational Inference algorithm is meant to approximate a <strong>single</strong> Bayesian update. Thus, it is wrong to interpret the <script type="math/tex">q()</script> distribution learned from iteration <script type="math/tex">t</script> as the new prior on the model variables for iteration <script type="math/tex">t+1</script>. In the context of the single update (regardless of current iteration), <script type="math/tex">p()</script> is always the initial prior, and <script type="math/tex">q()</script> is the best posterior-so-far. The VI concept is that we can iteratively improve the posteriors <script type="math/tex">q()</script>, but always in the context of a <em>single</em> Bayesian update.</p>
<h2 id="final-thoughts">Final Thoughts</h2>
<p>We have presented Variational Inference, in a hopefully accessible manner. It is a very slick technique that I am excited to continue to gain skill in applying. VI is the inference technique which underlies <a href="http://www.columbia.edu/~jwp2128/Teaching/E6892/papers/LDA.pdf">Latent Dirichlet Allocation</a>, a very popular learning algorithm developed by David Blei (now at Columbia!), Andrew Ng, and Michael Jordan, all Machine Learning heavyweights, while the former was at Cal (Go Bears)!</p>
<p>Earlier, we mentioned that we wanted to encourage sparsity in <script type="math/tex">w</script>. This can be accomplished (so I am told), by setting the priors on <script type="math/tex">\alpha_k</script> to <script type="math/tex">a_0, b_{0k} = 10^{-16}</script>. Tiny priors here will limit the dimensions of <script type="math/tex">w</script> which are signicantly non-zero. I’m not entirely sure why (something I still need to look into), but I can assure you that my model was super sparse :).</p>
<p>Variational Inference is a fairly sophisticated technique (the most complex algorithm I have encountered, but that might not count for much), and allows for the formal definition and learning of complex posteriors otherwise intractable using normal Bayesian methods.</p>
Sun, 22 Nov 2015 00:00:00 +0000
http://kronosapiens.github.io/blog/2015/11/22/understanding-variational-inference.html
http://kronosapiens.github.io/blog/2015/11/22/understanding-variational-inference.htmlbayesstatisticsmachine-learningdata-scienceblog