Problem
Given K bytes (message), compute N bytes (encoded message or codeword) such that the original K bytes can be recovered from any K subset of the N bytes. Assume we are only dealing with missing data, not corrupted data.
Motivation
Why is this useful? There are many practical applications of for this. It can be used to build a data storage system that tolerates N-K storage node failures with only N/K times the space. Over unreliable messaging systems, a message can be sent in N parts and then can be recovered by the receiver if any K out of N parts are received. The new Signal protocol proposes to do that in parts of the protocol.
How?
Conceptually, we can fit a curve through K points with a unique polynomial of degree less than K. Pick another N-K points along the curve. Together, we have N points on the same curve and any K points can be used to recalculate the same curve and recover any of the missing original K points.
Let's pick the output of the polynomial at x = 1 to K to be the K bytes in the message. The encoded message is the output of the polynomial at x = 1 to N. The first K bytes of the encoded message will be identical to the message (systematic). This is a useful property. When bytes 1 to K are not missing, no work needs to be done. If any of these bytes are missing, recompute only those bytes.
One way of doing this does not require explicit calculation of the coefficients of the polynomial directly. With K points (x and y values) on the curve, use Lagrange Interpolation to calculate the value of the polynomial at any x value.
Let's turn that into working code in plain Python:
def lagrange_interpolate(xs, ys):
    def f(x):
        n = len(xs)
        y = 0.0
        for i in range(n):
            t = ys[i]
            for j in range(n):
                if i != j:
                    t *= (x - xs[j]) / (xs[i] - xs[j])
            y += t
        return y
    assert len(xs) == len(ys)
    return f
K = 4
N = 6
a = list(range(1, K+1))
b = [82, 233, 217, 241] # message
poly = lagrange_interpolate(a, b) # polynomial
c = list(range(1, N+1))
d = [poly(x) for x in c] # encoded message
assert b == d[:K] # first K bytes are identical to message
e = c.copy()
f = d.copy()
for i in [3, 1]: # drop some data
    e.pop(i)
    f.pop(i)
poly2 = lagrange_interpolate(e[:K], f[:K]) # recalculate polynomial
g = list(range(1, K+1))
h = [round(poly2(x)) for x in g] # recovered original message
assert b == h, "Recovered message is not equal the original message"
if 1: # plot the data               
    import matplotlib.pyplot as plt
    cc = [x/10 for x in range(0, 100*N)]
    dd = [poly(x) for x in cc] # for plotting                                                                                                                                                                   gg = [x/10 for x in range(0, 100*N+1)]
    hh = [poly2(x) for x in gg] # for plotting                                                                                                                                                                   fig, axs = plt.subplots(4)
    axs[0].scatter(a, b, color="blue")
    for i in range(len(a)):
        axs[0].text(a[i], b[i], b[i])
    axs[1].scatter(cc, dd, color="pink")
    axs[1].scatter(a, b, color="blue")
    axs[1].scatter(c, d, color="red")
    for i in range(len(c)):
        axs[1].text(c[i], d[i], round(d[i]))
    axs[2].scatter(cc, dd, color="pink")
    axs[2].scatter(e, f, color="green")
    for i in range(len(e)):
        axs[2].text(e[i], f[i], round(f[i]))
    axs[3].scatter(gg, hh, color="yellow")
    axs[3].scatter(g, h, color="brown")
    for i in range(len(g)):
        axs[3].text(g[i], h[i], h[i])
    axs[0].set_title("Message")
    axs[1].set_title("Encoded message")
    axs[2].set_title("Encoded message with dropped data")
    axs[3].set_title("Recovered message")
    for ax in axs:
        ax.set_ylim([-500, 1500])
        ax.set_xlim([ 0, N+1])
        if ax != axs[-1]:
            ax.get_xaxis().set_visible(False)
    plt.subplots_adjust(hspace=0.5)
    plt.show()import random
# GF(2**8)
log_table = [0]*256 # zeroth entry is invalid
antilog_table = [0]*255 # yes, only 255 entries
def gf256_add(x, y):
    return x^y
gf256_sub = gf256_add
def gf256_smul(x, y): # long multiplication
    r = 0
    while y>0:
        if y&1:
            r ^= x # add
        if x&0x80:
            x = (x<<1)^0x11b # reduce by polynomial                                                                       
        else:
            x <<= 1
        y >>= 1
    return r
def init_tables():
    z = 1
    for x in range(255):
        antilog_table[x] = z
        log_table[z] = x
        z =  gf256_smul(z, 3)
def gf256_div(x, y):
    assert y != 0
    if x == 0:
        return 0
    return antilog_table[(log_table[x] - log_table[y])]
def gf256_mul(x, y):
    if x == 0 or y == 0:
        return 0
    return antilog_table[(log_table[x] + log_table[y]) % 255]
init_tables()
def lagrange_interpolate(xs, ys):
    def f(x):
        n = len(xs)
        y = 0
        for i in range(n):
            t = ys[i]
            for j in range(n):
                if i != j:
                    t = gf256_mul(t,
                                  gf256_div(gf256_sub(x, xs[j]),
                                            gf256_sub(xs[i], xs[j])))
            y = gf256_add(y, t)
        return y
    assert len(xs) == len(ys)
    return f
K = 20 # message length
N = 30 # encoded message length (message + redundancy) (<=255)
random.seed(324)
# a: 1..K. Fixed.
# b: message (K bytes)
a = list(range(1, K+1))
b = [random.randint(0, 255) for x in range(K)]
# c: 1..N. Fixed.
# d: encoded message (N bytes)
poly = lagrange_interpolate(a, b)
c = list(range(1, N+1))
d = [poly(x) for x in c]
assert b == d[:K] # first K bytes are identical to message
# e, f: simulate data loss by dropping random parts of c & d                                                      
e = c.copy()
f = d.copy()
while len(e) > K:
    i = random.randint(0, len(e)-1)
    e.pop(i)
    f.pop(i)
# g: 1..K. Fixed.                                                                 
# h: recovered message (K bytes)                                                               
assert(len(e) >= K)
poly2 = lagrange_interpolate(e[:K], f[:K])
g = list(range(1, K+1))
h = [poly2(x) for x in g]
assert b == h, "Recovered message is not equal the original message"
if 1:
    import matplotlib.pyplot as plt
    fig, axs = plt.subplots(4)
    axs[0].scatter(a, b, color="blue")
    axs[1].scatter(c, d, color="red")
    axs[2].scatter(e, f, color="green")
    axs[3].scatter(g, h, color="brown")
    axs[0].set_title("Message")
    axs[1].set_title("Encoded message")
    axs[2].set_title("Encoded message with dropped data")
    axs[3].set_title("Recovered message")
    for ax in axs:
        ax.set_ylim([0, 255])
        ax.set_xlim([ 0, N+1])
        if ax != axs[-1]:
            ax.get_xaxis().set_visible(False)
    plt.subplots_adjust(hspace=0.5)
    plt.show()If you want to know how Galois field (2**8) math works, find an Abstract Algebra textbook or read section 2 of the Rijndael AES paper.
