{"id":264,"date":"2013-06-11T11:43:11","date_gmt":"2013-06-11T19:43:11","guid":{"rendered":"http:\/\/blog.richardkiss.com\/?p=264"},"modified":"2021-02-05T17:30:13","modified_gmt":"2021-02-06T01:30:13","slug":"how-to-be-minimally-redundant-or-a-splitting-headache","status":"publish","type":"post","link":"https:\/\/blog.richardkiss.com\/?p=264","title":{"rendered":"How to be Minimally Redundant (or &#8220;A Splitting Headache&#8221;)"},"content":{"rendered":"<p>I&#8217;ve recently been investigating the command-line utility <a href=\"https:\/\/github.com\/richardkiss\/py-zfec\">zfec<\/a>, which is a lot like the UNIX &#8220;split&#8221; utility, except, using a technique called <a href=\"http:\/\/en.wikipedia.org\/wiki\/Erasure_code\">erasure coding<\/a>, you can choose to split your file into M pieces, any K of which are enough to recreate the file using the complimentary [sic] zunfec command.<\/p>\n<p>Concrete example: a 100K file can be split into 10 (or more) pieces, each just over 25K long, and zunfec can recreate the original from any 4 of them.<\/p>\n<p><strong>Any<\/strong> 4. You might expect this to incur a lot of overhead in each piece, but it turns out it&#8217;s just a header of four bytes or less. Pretty much the same as cutting the file into pieces.<\/p>\n<p>This amazed me. How could this possible work? How can you split data into M pieces so that any K of them is enough to reconstruct them? Linear algebra to the rescue!<\/p>\n<p>Suppose we want to encode a twelve character text string into a bunch of arrays, each with four values, any three of which are sufficient to reconstruct the original. Let&#8217;s use the string &#8220;Lovefromdawn&#8221;. Here&#8217;s what you do:<\/p>\n<p><a href=\"http:\/\/blog.richardkiss.com\/wp-content\/uploads\/2013\/06\/fig1.gif\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/blog.richardkiss.com\/wp-content\/uploads\/2013\/06\/fig1.gif\" alt=\"Figure 1\" width=\"661\" height=\"131\" class=\"alignnone size-full wp-image-302\" \/><\/a><\/p>\n<p>The matrix furthest to the left is a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Vandermonde_matrix\">Vandermonde matrix<\/a>, which is a matrix where each row forms a geometric series starting with 1. A square Vandermonde matrix has the special property that as long as the second column has no repeated elements, the matrix is invertible (and in fact, there is a special algorithm to invert it quickly).<\/p>\n<p>The split pieces correspond to rows of the matrix on the right. Choose any three of them; let&#8217;s say row 1, 2 and 4. Once we know what row numbers they are (and this explains the tiny header), we get the following system of equations:<\/p>\n<p><a href=\"http:\/\/blog.richardkiss.com\/wp-content\/uploads\/2013\/06\/fig2.gif\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/blog.richardkiss.com\/wp-content\/uploads\/2013\/06\/fig2.gif\" alt=\"Figure 2\" width=\"560\" height=\"61\" class=\"alignnone size-full wp-image-303\" \/><\/a><\/p>\n<p>We know the leftmost matrix is invertible, so multiply both sides by the inverse, and we solve for X, recovering &#8220;Lovefromdawn&#8221;. Wowza!<\/p>\n<p>But wait! One problem remains: if we use standard integer arithmetic, the matrix on the right ends up with a lot of values larger than 255, and we can&#8217;t store it in a byte array. That ain&#8217;t no good.<\/p>\n<p>Luckily, it turns out that most theorems dealing with matrices and invertibility only assume we are operating over an arbitrary <a href=\"http:\/\/en.wikipedia.org\/wiki\/Field_(mathematics)\">field<\/a>, a mathematical structure that has addition, a 0, multiplication, a 1, and a reciprocal for every non-0 element.<\/p>\n<p>Abstract algebra to the rescue! (Math seems to rescue computer science a lot.) It turns out that there is a field with 256 elements known as <a href=\"http:\/\/en.wikipedia.org\/wiki\/Finite_field_arithmetic\">GF(2**8)<\/a>. In this field, 0 is the 0, 1 is the 1 (surprise!), addition is bitwise exclusive-or (so every number is its own additive inverse!), and multiplication is a weird, and seemingly unpredictable beast, that is best calculated by pre-populating a table of logarithms and exponents, then creating a 256&#215;256 array containing the multiplication table (since each number fits in a byte, the whole table only takes 64K).<\/p>\n<p>[There are actually lots of ways to define multiplication that still meets the requirements of a field, but we have to pick one by defining what&#8217;s called the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Irreducible_polynomial\">irreducible polynomial<\/a> for the field.]<\/p>\n<p>Once we operate over this field, all matrix values are guaranteed to be in the range [0, 255], and our method for bytewise encoding and decoding becomes easy.<\/p>\n<p>One more aside: zfec actually piles the 3&#215;3 (well, KxK) identity matrix on top of the Vandermonde matrix, so the first K pieces are exactly what you&#8217;d get by cutting the file into K pieces in the obvious way and putting the tiny header in front. Even though our matrix is no longer a Vandermonde matrix, submatrices are still guaranteed to be invertible and it still works.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve recently been investigating the command-line utility zfec, which is a lot like the UNIX &#8220;split&#8221; utility, except, using a technique called erasure coding, you can choose to split your file into M pieces, any K of which are enough to recreate the file using the complimentary [sic] zunfec command. Concrete example: a 100K file &hellip; <a href=\"https:\/\/blog.richardkiss.com\/?p=264\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">How to be Minimally Redundant (or &#8220;A Splitting Headache&#8221;)<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-264","post","type-post","status-publish","format-standard","hentry","category-computers"],"_links":{"self":[{"href":"https:\/\/blog.richardkiss.com\/index.php?rest_route=\/wp\/v2\/posts\/264","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.richardkiss.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.richardkiss.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.richardkiss.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.richardkiss.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=264"}],"version-history":[{"count":44,"href":"https:\/\/blog.richardkiss.com\/index.php?rest_route=\/wp\/v2\/posts\/264\/revisions"}],"predecessor-version":[{"id":307,"href":"https:\/\/blog.richardkiss.com\/index.php?rest_route=\/wp\/v2\/posts\/264\/revisions\/307"}],"wp:attachment":[{"href":"https:\/\/blog.richardkiss.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=264"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.richardkiss.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=264"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.richardkiss.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=264"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}