__CTF Solution__ __Some approximate solutions__ Most of the challengers viewed logreg solvers (e.g. scikit logreg) as a blackbox oracle, and used some stochastic approach to guess the missing dataset row, trying to optimize the distance between _‖ 𝚕𝚘𝚐𝚛𝚎𝚐(guessed dataset) − provided model ‖_ After trying multiple small modifications and iterating, they ended up with a solution vector, that gave a model within $10^{−5}$ of the provided one. xn=[3,14,4,-3,-13,11,-5,-16,16,16,-12, 0,9,1,-11-14] yn=0 But then, how to verify that this empirical result was really the golden solution? Enumerating the space of $32^{16}$ possibilities would be far too slow? Also, the distance ≤ 0.5 ⋅ $10^{−5}$ was small, but not zero: would there be a thin chance, that there exist another logreg solver, that would yield a closer solution? I bet some of the challengers got a real headache about all these thoughts? Some may have even thought that the whole problem was ill-defined! Or that the only way of verifying the solution was to ask the CTF organizers :) It is now time to straighten out the whole story! __The real solution__ ![formula](images/image2.png) ![formula](images/image3.png) ```python # Load trained model from csv trained_model_data = pd.read_csv("trained_LR_model.csv", index_col=False) theta_best_bar = trained_model_data[[f"theta_{i}" for i in range(k+1)]].to_numpy().reshape(-1) print("theta_best_bar: %s" % theta_best_bar) # Load attacker knowledge of N-1 points df = pd.read_csv("attacker_knowledge.csv", index_col=False) X = df[[f"V{i}" for i in range(1, k+1)]].to_numpy() y = df["target"].to_numpy() print("dimensions of X: %s x %s" % X.shape) print("dimension of y: %s" % y.shape) # Helper functions sigmoid = lambda z: 1/(1+np.exp(-z)) # gradient of logreg over the (partial) dataset X,y at theta_bar def gradient_log_reg(X, y, theta_bar): """Gradient of the logistic regression loss function (regularization=1). Args: X: Features. y: Labels. theta_bar: weights (with intercept coeff prepended). Returns: Computed gradient evaluated at given values. """ theta = theta_bar[1:] X_bar = np.hstack((np.ones((len(X),1)), X)) y_hat = sigmoid(X_bar@theta_bar) return (y_hat-y)@X_bar + np.hstack(([0], theta)) # the known vector of the awesome equality of the CTF # is the opposite of the partial gradient known_vector = -gradient_log_reg(X, y, theta_best_bar) print('This is the known vector: %s' % known_vector) ``` Let's recall the awesome equality (in red the unknowns, on the right what we just computed): ![formula](images/image4.png) The first coordinate reveals the value of 𝛼, and the other 16 coordinates (divided by 𝛼) reveal the missing datapoint 𝑥_𝑁! ```Python alpha = known_vector[0] recovered_xn = [known_vector[i+1]/alpha for i in range(k)] print("alpha: %s" % alpha) print("recovered x_500: %s" % recovered_xn) ``` ![formula](images/image5.png) ```Python final_xn = np.round(recovered_xn) print("final_xn: %s" % final_xn) ``` final_xn: [ 3. 14. 4. -3. -13. 11. -5. -16. 16. 16. -12. 0. 9. 1. -11. -14.] We can now safely rest assured that there is no other solution in the whole search space! Finally, since 𝛼 in the awesome equality is equal to sigmoid(something) − 𝑦_𝑁, and sigmoid yields values in the open real interval (0,1), we deduce that 𝛼 is positive if 𝑦_𝑁 is 0, and negative otherwise if 𝑦_𝑁 is 1. So the sign of 𝛼 discloses the class of the missing sample. ```Python final_yn = 1 if alpha<0 else 0 print("final yn: %s" % final_yn) ``` And this concludes this CTF exercice! On a side note, you also understand why it was important to provide 𝜃 with at least 5 decimal digits of precision. If we had decided to add some differential privacy noise (say 2% of noise to the coordinates of theta), the same problem would now have a ton of potential solutions (because the error per coordinate would be much larger than 16, instead of < 0.5). In other words, adding just a little amount of differential privacy noise would have made this reconstruction attack fail, a general principle that is good to keep in mind!