First things first - naming. It turns out it's a bit... problematic. The regression model I'm going to implement using a neural network could be named:

In terms of naming it's pretty much the same, and all of the names refer to neural networks with many layers of neutrons - that makes it "deep". Not sure why "deep learning" is so fancy and sounds better than "shallow learning"... But, hey! That's IT - everything is hype-driven! 😉 A few years ago we had a new BEST JS framework every week, now (2024) deep learning/AI and layoffs are trendy!

Okay, enough digressions. Let's briefly take a look at what the neural network looks like.

Usually, neural network graphs are simplified to circles and lines... but I wanted to show you the anatomy of a neuron, so I created my graph. So, starting from left: we have **input** with three features: x, x, and x.

All features are passed to **neurons** (purple circles) in *Layer 0*. The layers between input and output are called **hidden layers** because they're not "visible" to the user. Typically neural network is built of many hidden layers. *Layer 0* contains two neurons.

Each neuron takes features and multiplies them by **weights** (blue circles), which are summed (yellow square) with a **bias** (red circle). Then, the result is passed to the **activation function** (green box). Output out of the neuron is passed to the next neuron, in this case, to the neuron in *Layer 1.*

*Layer 1* in the example above is an **output layer**, since its output is the end result. The neuron in *Layer 1* takes outputs from the neurons from *Layer 0*, multiplies them by weights, sums with bias, passes the result to the activation function, and... we have it!

Each neuron has its own weights, bias, and activation function. Weights and bias are usually floats. The activation function is a math function. Its purpose is to introduce non-linearity, so the model can solve complex problems.

There are many activation functions in Nx library. The most common are *relu*, *sigmoid*, *linear,* and *softmax*. Linear and sigmoid sound familiar, don't they? What's interesting, if you apply linear as an activation function for all layers, you end up with... a linear function. It's a bad idea. Usually, the most commonly used activation function for hidden layers is relu (Rectified Linear Unit).

The choice of the activation functions is important, especially for the output layer. Here are my recommendations for some ML problems.

ML Problem | Output Activation Function |

Binary Classification | sigmoid |

Multiclass Classification | softmax |

Regression | linear |

Neural Networks are **supervised** machine learning methods, which means they learn in the pretty much same way as linear or logistic regressions. During the training process, the model adjusts weights and biases for each neuron to minimize the cost function.

I think we discussed the theory and have a solid background to get our hands dirty and implement the MLP model using Axon.

The plan is to create a regression neural network model predicting miles per gallon (MPG) for the given car features. It's exactly the same task as in Linear Regression with Elixir and Nx article. But this time instead of implementing everything from scratch, I'll use a dedicated library - Axon.

I'll reuse the data load processing function from the linear regression post. The dataset looks like this:

`# {# [passedemissions, cylinders, horsepower, displacement, weight, acceleration, modelyear],# [mpg]# }[ {[0, 8, 130, 307.0, 1.752, 12.0, 70], [18.0]}, {[0, 8, 165, 350.0, 1.8465, 11.5, 70], [15.0]}, {[0, 8, 150, 318.0, 1.718, 11.0, 70], [18.0]}, {[0, 8, 150, 304.0, 1.7165, 12.0, 70], [16.0]}, {[0, 8, 140, 302.0, 1.7245, 10.5, 70], [17.0]}, {[0, 8, 198, 429.0, 2.1705, 10.0, 70], [15.0]}, ...]`

We have a list of tuples consisting of features and labels. Let's split it to training and test sets. As before I use an 80-20 ratio.

`{train_data, test_data} = data |> Enum.shuffle() |> Enum.split(data |> length() |> Kernel.*(0.8) |> ceil())`

So far, so familiar... Now, things get different. Axon's training function is Axon.Loop.run/4 which takes Enum or Stream split to batches as an argument. Due to the ML data nature and its size, using Streams is a much better idea. So, let's prepare the data.

`batch_size = 4train_stream = train_data |> Stream.chunk_every(batch_size, batch_size, :discard) |> Stream.map(fn chunks -> {x_chunk, y_chunk} = Enum.unzip(chunks) {Nx.tensor(x_chunk), Nx.tensor(y_chunk)} end)test_stream = test_data |> Stream.chunk_every(batch_size, batch_size, :discard) |> Stream.map(fn chunks -> {x_chunk, y_chunk} = Enum.unzip(chunks) {Nx.tensor(x_chunk), Nx.tensor(y_chunk)} end)`

`batch_size`

determines the size of the examples batch. It's quite important to get this tuned, since increasing `batch_size`

speeds up learning, but increases memory consumption and sometimes gives poor results. Small batches, on the other hand, converge better, giving smoother gradient descent, but are significantly slower. `batch_size`

is one of the **hyperparameters** of the model, which means it's used for tuning and it's not being determined by the training itself.

Okay, it's a model creation time! Let's start with something "stupid" 😉 - just one neuron in the hidden layer and one in the output layer two neurons total

`model = Axon.input("car_features", shape: {nil, 7}) |> Axon.dense(1, activation: :relu) # hidden layer, just 1 neuron |> Axon.dense(1) # output layer# Result#Axon< inputs: %{"car_features" => {nil, 7}} outputs: "dense_1" nodes: 4>`

The model takes input with 7 features, passes them to the hidden layer with one neuron and the relu activation function then passes the result to the output layer with a linear activation function.

Axon provides nice functions for visualizing the model as a graph or table - let's try out the latter.

`Axon.Display.as_table(model, Nx.template({1, 7}, :f32)) |> IO.puts()# Result+------------------------------------------------------------------------------------------------------+| Model |+===================================+=============+==============+=================+===================+| Layer | Input Shape | Output Shape | Options | Parameters |+===================================+=============+==============+=================+===================+| car_features ( input ) | [] | {1, 7} | shape: {nil, 7} | || | | | optional: false | |+-----------------------------------+-------------+--------------+-----------------+-------------------+| dense_0 ( dense["car_features"] ) | [{1, 7}] | {1, 1} | | kernel: f32[7][1] || | | | | bias: f32[1] |+-----------------------------------+-------------+--------------+-----------------+-------------------+| relu_0 ( relu["dense_0"] ) | [{1, 1}] | {1, 1} | | |+-----------------------------------+-------------+--------------+-----------------+-------------------+| dense_1 ( dense["relu_0"] ) | [{1, 1}] | {1, 1} | | kernel: f32[1][1] || | | | | bias: f32[1] |+-----------------------------------+-------------+--------------+-----------------+-------------------+Total Parameters: 10Total Parameters Memory: 40 bytes`

Alright, so the model has 10 parameters total - 7 weights and 1 bias for the hidden layer. The output layers have 1 weight and 1 bias. The analogous linear regression model has 8 parameters - 7 weights and 1 bias, so it's pretty similar.

The model is ready to go, so now we're going to train it.

`trained_model_state = model |> Axon.Loop.trainer(:mean_squared_error, :adam) |> Axon.Loop.metric(:mean_absolute_error) |> Axon.Loop.run(train_stream, %{}, epochs: 30)# Console outputEpoch: 0, Batch: 150, loss: 785.9227295 mean_absolute_error: 22.4253941Epoch: 1, Batch: 143, loss: 462.5274353 mean_absolute_error: 9.1923170Epoch: 2, Batch: 136, loss: 336.9065247 mean_absolute_error: 7.2870026...Epoch: 29, Batch: 111, loss: 76.6344452 mean_absolute_error: 3.7401807`

I used `:mean_squared_error`

it as a loss function, `:adam`

as a gradient descent optimizer, `:mean_absolute_error`

(MAE) as a cost indicator and set the model to train through 30 epochs.

As you noticed in the result, MAE decreased from ~ 22 to ~4, which looks promising.

Training is done. Now it's time for testing. Let's start with checking the MAE for the test data set.

`model|> Axon.Loop.evaluator()|> Axon.Loop.metric(:mean_absolute_error)|> Axon.Loop.run(test_stream, trained_model_state)# ResultBatch: 77, mean_absolute_error: 3.8084450`

MAE for the test is similar to the result for training, which means that probably the model doesn't overfit.

Anyway, we need some more meaningful metrics like the **R ^{2} score**, the same as we used for the linear regression model. This time I'll use the r2_score/3 function from the Scholar library.

`{x_test, y_test} = test_data |> Enum.unzip() |> then(fn {x, y} -> {Nx.tensor(x), Nx.tensor(y)} end){_, predict_fn} = Axon.build(model)y_pred = predict_fn.(trained_model_state, x_test)Scholar.Metrics.Regression.r2_score(Nx.flatten(y_test), Nx.flatten(y_pred))|> Nx.to_number()|> IO.inspect(label: "Accuracy")# ResultAccuracy: 0.5886043906211853`

To make predictions you need to build the prediction function and use the trained model. It looks a bit odd at first, but you can get used to it.

**R ^{2} of ~0.59** is not bad, but not spectacular either. Let me remind you that linear regression with feature standardization resulted in an

As usual, I have a few tricks that we can use to quickly improve our regression model.

**Increase neurons/layers number**- our model has only 2 neurons! In terms of capability, it's more stupid than a roundworm - a "little-primitive-disgusting bug" (sorry biologists!) that proudly carries 302 neurons 😮! Generally speaking,**increasing the complexity of a neural network, increases its capability**. Capability**is not equal to performance**. And the more complex the model gets, the more expensive it is. So as always - it's a tradeoff.**Fix underfitting/overfitting**- Simply speaking**underfitting**usually means you have too little/bad training data.**Overfitting**occurs when the model does noticeably worse in tests than in training. I guess that's our case.**Tune hyperparameters**- try changing some parameters like number of epochs, batch size, change optimizer, etc.**Adjust architecture**- there are many types of neural network architectures working better for some kinds of problems, like CNN, RNN, etc. To be honest, I'm not aware of anything specific for regression.

Okay, let's introduce some tweaks and give it a shot!

`batch_size = 1# ...model = Axon.input("car_features", shape: {nil, 7}) |> Axon.dense(32, activation: :relu) |> Axon.dense(8, activation: :relu) |> Axon.dense(1)trained_model_state = model |> Axon.Loop.trainer(:mean_squared_error, :adam) |> Axon.Loop.metric(:mean_absolute_error) |> Axon.Loop.run(train_stream, %{}, epochs: 40)# ...# ResultBatch: 77, mean_absolute_error: 2.3542204Accuracy: 0.8551386594772339`

Decreasing the batch size (4 1) and increasing the number of neurons (2 -> 41) and epochs (30 40) improved the **R ^{2} score from ~0.59 to ~0.86**. MAE decreased from ~3.8 to ~2.4.

That's impressive for just 41 neurons. I bet the mentioned roundworm (remember, 302 neurons according to scientists) couldn't do this better! 😛

Let's take a look at how changing the two-neuron layer to two layers with 32 and 8 neurons changes the architecture of the model.

`+-------------------------------------------------------------------------------------------------------+| Model |+===================================+=============+==============+=================+====================+| Layer | Input Shape | Output Shape | Options | Parameters |+===================================+=============+==============+=================+====================+| car_features ( input ) | [] | {1, 7} | shape: {nil, 7} | || | | | optional: false | |+-----------------------------------+-------------+--------------+-----------------+--------------------+| dense_0 ( dense["car_features"] ) | [{1, 7}] | {1, 32} | | kernel: f32[7][32] || | | | | bias: f32[32] |+-----------------------------------+-------------+--------------+-----------------+--------------------+| relu_0 ( relu["dense_0"] ) | [{1, 32}] | {1, 32} | | |+-----------------------------------+-------------+--------------+-----------------+--------------------+| dense_1 ( dense["relu_0"] ) | [{1, 32}] | {1, 8} | | kernel: f32[32][8] || | | | | bias: f32[8] |+-----------------------------------+-------------+--------------+-----------------+--------------------+| relu_1 ( relu["dense_1"] ) | [{1, 8}] | {1, 8} | | |+-----------------------------------+-------------+--------------+-----------------+--------------------+| dense_2 ( dense["relu_1"] ) | [{1, 8}] | {1, 1} | | kernel: f32[8][1] || | | | | bias: f32[1] |+-----------------------------------+-------------+--------------+-----------------+--------------------+Total Parameters: 529Total Parameters Memory: 2116 bytes`

Total **number of parameters increased from 10 to 529**. This means more computations, time, and memory are required for the training.

In the end, the linear regression model with feature engineering, performed almost the same as the neural network model after tuning, achieving an accuracy of about 85%. Although the results look pretty much the same, there are totally different kinds of beasts.

Take a look at this oversimplified, yet still instructive table.

Liner Regression | Neural Network | |

ML Problems | Just regressions | Regressions, binary/multi-classifications |

Input/output relationship | Just linear (+ simple non-linear with feature engineering) / "simpler" | Linear and highly nonlinear / "complex" |

Data preparation | Very important | Just helpful |

Training time | Fast | Slow |

Resources consumption | Cheap | Expensive |

Finding input/output correlations | Easy | Impossible |

Achieved accuracy (R^{2}) | 68% (87% with feature engineering) | 85% |

Hype | 😐 | 🙂 (🤩 - when deep learning) |

It's hard to compare linear regression to neural networks, since the latter may solve different kinds of problems, so I'll focus just on the regression. In terms of choice between them, IMO the most important is the relationship between input and output. In other words, the **linearity of the data**.

When you deal with a complex problem when the features match with labels in some crazy pattern, basically you have no choice - neural network is the only viable choice. ANN has amazing capabilities - it can deal with super complex data without special preparation. It's kinda a silver bullet. BUT...

Neural Networks are greedy. Training and running predictions with ANN are much more expensive in both resources and time. That's where linear regression shines.

Linear regression is a great choice for linear or simple input/output relationships. It requires more work with feature engineering the data, but once it's set - it trains and works super fast.

Oh no, I forgot about the hype... Forget linear regression, there's no such a thing. Go with neur... Deep Learning! 🚀

Again, the Elixir ecosystem proves it can handle machine learning, like neural networks without any problem. Axon does the job well! TBH I don't have too much experience with ML in Python using PyTorch or TensorFlow, but Nx + Axon duet looks very solid and IMO it's a viable option. Especially you can **transfer Elixir ML models from/to Python using ONNX** (Open Neural Network Exchange) tools like AxonOnnx.

Elixir is still a bit exotic, I know. It's a big shame that it hasn't gotten the hype it deserves. But I've gotten used to that. Anyway, I'm very grateful to all the contributors for such a great ecosystem 🙏. And there's LiveView Native on the horizon... Can't wait! 💜

]]>First things first - what is this function, and what does it do? In short, Logistic Regression allows predicting the probability from 0 to 1 of occurring an event. Describing it more practically, it's used for **solving classification problems**.

For example, you can use logistic regression to determine how probable is that a recent email message you received is spam. In such a case, 0 would mean that the email is NOT spam, whereas 1 is definitely spam.

BTW, notice that the name Logistic "Regression" seems to be misleading, since it's often used as a binary classifier. The naming feels wrong...

Well, it returns **any value between 0 and 1**, so there are no finite classes of output. So in nature, it's a regression function. But often we pipe it to another function, checking, if the output is greater than 0.5. Then it's used as a classifier.

Logistic regression is related to linear regression. The biggest difference is the output values (`y`

) are limited to 0-1 spectrum. You can easily notice the difference on the graph.

The center part of the graph resembles a steep linear function, but as it moves from the center toward the edges, the line becomes squeezed around 0 and 1. The function used by logistic regression is the sigmoid function, which is characterized by its distinct sigmoid curve shape.

The sigmoid function contains a linear function component in the denominator.

$$f(x) = \frac{1}{1+e^{-ax + b}}$$

When you think "in functions", it's the same old linear regression piped to the **sigmoid** function, which, as you would expect, is covered by the Nx library - `Nx.sigmoid/1`

. So, we have a linear function, weights, and ready-to-go sigmoid function. It looks quite familiar. The difference is the output which can be anything between 0 and 1. In Machine Learning and statistics, we call it a **logit**.

Logit describes the probability, so it's kind of a scaled regression. Making a classifier out of it is pretty easy - after applying a sigmoid function, all you need to do is set some threshold like 0.5 and check if the value is greater than it (1 or "true") or lesser (0 or "false"). That's it!

In terms of the cost function, it turns out it's a totally different story than for linear regression. Recall that for linear regression Mean Squared Errors (MSE) method was the best option, which always converged to one, global minimum (Convex Function).

For logistic regression, MSE will have many local minima (it's non-convex), so it won't work. Fortunately, Binary Cross-entropy fits perfectly for classifiers as logistic regression. The formula of the loss function depends on the actual value, whether it's 1 or 0.

I used the following notation: `y`

is a prediction, `y'`

is an actual value and `n`

is a number of examples.

$$\frac{1}{n} \sum\limits_{i=1}^{n} - log(y_i), \text{ if } y_i'= 1$$

$$\frac{1}{n} \sum\limits_{i=1}^{n} - log(1 - y_i), \text{ if } y_i'= 0$$

Calculating the cost function in two steps, for the actual value of 0 and 1 is not too exciting. The good news is that there's a simplified formula that handles both cases.

$$-\frac{1}{n} \sum\limits_{i=1}^{n} y'_i log(y_i) + (1 - y')log(1 - y_i)$$

As you might expect, there's a vectorized version of the equation that you can put easily to Nx, though you have to be careful when typing.

$$-\frac{1}{n}(Y'^T*log(Y) + (1-Y')^T log(1-Y))$$

There's one more difference between linear and logistic regression which is checking the accuracy. Instead of Coefficient of Determination, for logistic regression, we use a super simple formula called **Positive Rate**. It checks how many predictions the model got right.

$$\frac{n_{correct}}{n}$$

Simple as that! In this case, the accuracy lies between 0 and 1. The closer to 1, the better. There's one issue with the metric though. It's not too useful for skewed data or when there are some special requirements on the "sensitivity" of the model. It's a bigger topic for another time, but just wanted to quickly mention here that another useful metric for classifiers is the Confusion matrix.

Alright, we covered already the theory, now is the time for Elixir and Nx in action!

Linear regression and logistic regression have a lot in common, that's why I won't implement everything from scratch, but use the code from the article about Linear Regression with Elixir and make some adjustments so it works as a classifier.

Our job is to predict if the car from this dataset will pass emissions tests, based on all features... except the useless `carname`

😉. Let's take a look at the data.

passedemissions | mpg | cylinders | displacement | horsepower | weight | acceleration | modelyear | carname |

FALSE | 18 | 8 | 307 | 130 | 1.752 | 12 | 70 | chevrolet chevelle malibu |

TRUE | 22 | 4 | 140 | 72 | 1.204 | 19 | 71 | chevrolet vega (sw) |

FALSE | 18 | 6 | 225 | 105 | 1.8065 | 16.5 | 74 | plymouth satellite sebring |

Okay, so the data is the same as for linear regression but this time we'll use `passedemissions`

as a label (`y'`

) and `mpg`

as an additional feature.

Let's take a look at the changes in the code.

`defmodule LogisticRegression do alias NimbleCSV.RFC4180, as: CSV import Nx.Defn ... def load_data(data_path) do data_path |> File.stream!() |> CSV.parse_stream() |> Stream.map(fn row -> [ passedemissions, mpg, cylinders, displacement, horsepower, weight, acceleration, modelyear, _carname ] = row {[ parse_float(mpg), parse_int(cylinders), parse_int(horsepower), parse_float(displacement), parse_float(weight), parse_float(acceleration), parse_int(modelyear) ], [parse_boolean(passedemissions)]} end) |> Enum.to_list() end defn gradient_descent(x, w, y, alpha) do y_pred = Nx.dot(x, w) |> Nx.sigmoid() diff = Nx.subtract(y_pred, y) gradient_descent = x |> Nx.transpose() |> Nx.dot(diff) |> Nx.multiply(1 / elem(x.shape, 0)) gradient_descent |> Nx.multiply(alpha) |> then(&Nx.subtract(w, &1)) end defn cost(x, w, y) do y_pred = x |> Nx.dot(w) |> Nx.sigmoid() term_one = y |> Nx.transpose() |> Nx.dot(Nx.log(y_pred)) term_two = y |> Nx.multiply(-1) |> Nx.add(1) |> Nx.transpose() |> Nx.dot( y_pred |> Nx.multiply(-1) |> Nx.add(1) |> Nx.log() ) term_one |> Nx.add(term_two) |> Nx.divide(elem(y.shape, 0)) |> Nx.multiply(-1) |> Nx.squeeze() end def predict(%Model{weights: w}, x) do x |> prepend_with_1s() |> Nx.dot(w) |> Nx.sigmoid() |> Nx.greater(0.5) end defn accuracy(w, x, y) do incorrect_count = x |> prepend_with_1s() |> Nx.dot(w) |> Nx.sigmoid() |> Nx.greater(0.5) |> Nx.subtract(y) |> Nx.abs() |> Nx.sum() |> Nx.flatten() y_count = elem(y.shape, 0) y_count |> Nx.subtract(incorrect_count) |> Nx.divide(y_count) endend`

First thing, notice each `y_pred`

is piped to `Nx.sigmoid()`

and next to `Nx.greater(0.5)`

(except when calculating gradient descent). I split the crazy cost function so it's easier to digest. `accuracy/3`

function if completely different than for linear regression. There are also some cosmetic changes, like making names more generic (*mse* -> *cost*, *r2* -> *accuracy*).

`model = %Model{ alpha: 0.2, epochs: 200}trained_model = LogisticRegression.train(model, x_std, y)LogisticRegression.test_cost(x_test_std, trained_model.weights, y_test)|> Nx.to_number()|> IO.inspect(label: "Test Cost")LogisticRegression.accuracy(trained_model.weights, x_test_std, y_test)|> Nx.to_number()|> IO.inspect(label: "Accuracy")# ResultTest Cost: 0.10394015908241272Accuracy: 0.9743589758872986`

**Accuracy like ~0.97?!** Considering this shallow learning with a dataset of just ~400 examples, it's pretty good!

Let's take a look at some graphs. First, cost over time or more precisely - epochs.

The initial cost is relatively low and goes down. It seems that cross-entropy works well for linear regression. Remember what I briefly mentioned about other metrics for classifiers, such as confusion matrix? Let's take a look.

Predicted | Actual | Name | Count |

1 | 1 | True Positive | 53 |

1 | 0 | False Positive | 2 |

0 | 0 | True Negative | 23 |

0 | 1 | False Negative | 0 |

The model predicted correctly true 53 times and false 23 times. It went wrong two times, classifying false as true. Let's think about this for a minute. The model made 78 predictions on the test set, achieving a pretty good accuracy of ~97%. But, two cars made it through the emissions test (false positives), although they shouldn't...

Let's imagine that there are huge penalties for allowing cars to pass the emission test when they shouldn't (do you know the "Diselgate" scandal?). In such a case, an accuracy of 97% is not that impressive. We need to deal with false positives.

There are two more useful metrics for checking the performance of the classifier ML model - Precision and Recall. The two metrics are on the two ends of the spectrum, one end is marking "true" only when the model is very confident (higher precision, lower recall) and the other end is higher sensitivity (lower precision, higher recall).

Let's peek at the formulas, fortunately, they're quite clear.

$$precision = \frac{n_{true\_positives}}{n_{total\_predicted\_postives}}$$

$$recall = \frac{n_{true\_positives}}{n_{total\_actual\_postives}}$$

Both metrics take values from 0 to 1. So, in our case, *precision =* ~0.964 and *recall* = 1. But our business requirement is to have high precision. Can we change anything in the model to increase precision? Yes, we can and we'll do!

Let's peek at the `predict`

function.

` def predict(%Model{weights: w}, x) do x |> prepend_with_1s() |> Nx.dot(w) |> Nx.sigmoid() # > logit e.g. 0.435, 0.975, etc. |> Nx.greater(0.5) # <- threshold # > classification, if greater than threshold then 1, 0 otherwise end`

Aha! We can easily adjust the threshold. `0.5`

is quite a universal value and makes sense in many cases. But we'd like to increase precision - be more confident that true will really be true. So we need to increase the threshold. Let's give it a shot with value `0.8`

and see what happens.

`# Threshold 0.8# ResultTraining Cost: 0.14220784604549408Test Cost: 0.2054743468761444Accuracy: 0.8717948794364929True Positive: 46False Postive: 0True Negative: 22False Negative: 10Precision: 1Recall: 0.6764705882`

Increasing the threshold from 0.5 to 0.8 caused the following effects:

The training cost didn't change - gradient descent uses a logit, so

**the threshold doesn't affect the training**, just predictingThe test cost increased and accuracy dropped to ~0.87

There are no false positives and precision reached the perfect score

Recall went down from 1 to ~0.68 and there are 10 false negatives now

Even though the accuracy has decreased, the business goal has been achieved - the model does not approve any cars that should not pass the emission test 👍.

The confusion matrix can be shown as a heatmap. It's quite easy to generate with VegaLite lib.

`# Confusion Matrixalias VegaLite, as: VlVl.new(title: "Confusion Matrix", width: 600, height: 600)|> Vl.data_from_values(%{ predicted: Nx.to_flat_list(predictions), actual: Nx.to_flat_list(actual)})|> Vl.mark(:rect)|> Vl.encode_field(:x, "predicted")|> Vl.encode_field(:y, "actual")|> Vl.encode(:color, aggregate: :count)`

I added the labels, so everything should be crystal clear.

Implementing logistic regression based on the linear regression model wasn't too tough. The trickiest part was to write the cost function without any bugs - it's easy to make a mistake in the formula.

The logistic regression model we discussed did very well on the test data set, achieving an accuracy of 97%. After adjusting the threshold to increase precision and eliminate false positives, the accuracy dropped slightly to 87%. That's impressive, especially considering the small dataset of ~400 examples in total.

Linear/Logistic regressions are very important since they are the foundation blocks of neural networks. Won't be mistaken by naming them "shallow learning models". It doesn't sound fancy, but they're pretty powerful and much "cheaper" than deep learning methods.

Linear Regression | Logistic Regression | |

Type | Regression | Classifier (technically regression, when not classifying logit) |

Output | - + | 0 or 1 (technically 0 1, when not classifying logit) |

Cost Function | Mean Squared Error | Binary Cross-entropy |

Performance Metrics | R Score | Positive Rate (also Confusion Matrix, Precision, Recall, F1 Score) |

Achieved accuracy for the dataset | 68% (87% with feature engineering) | 97% |

Getting as much high-quality data as possible is essential in Machine Learning. But the question is "What high-quality does mean?". In terms of features, it means that they should be interpretable by the ML model (numbers) and meaningful for the training process.

"Meaningful?" Yeah, in short, it means that meaningful features should be correlated with labels. Another thing is that features should match the ML model, since e.g. it seems that for linear regression parabolic-curve features are not too useful... Until you enhance your features!

**Feature engineering is a process of transforming and creating features based on your current dataset.** Feature scaling we did in the previous article is one of the most common and effective feature engineering techniques.

Let's analyze the dataset for MPG (Miles Per Gallon) predictions.

passedemissions | mpg | cylinders | displacement | horsepower | weight | acceleration | modelyear | carname |

FALSE | 18 | 8 | 307 | 130 | 1.752 | 12 | 70 | chevrolet chevelle malibu |

FALSE | 15 | 8 | 350 | 165 | 1.8465 | 11.5 | 70 | buick skylark 320 |

Here are two examples from the dataset. There's one label (`mpg`

) and 8 possible features. Which features seem to be useful for Linear Regression? Hard to say... But it's much easier to say, which is irrelevant: `carname`

.

There are two problems with `carname`

: it's hard to convert to number and what's even more important: does `carname`

affect MPG at all? Naming a car like *"Super-Duper Eco X123"* will make it more fuel efficient? For car dealers: definitely 😉 For data engineers: nope. We can skip it at all.

`carname`

was an easy one. The other features are more tricky, so let's make the analysis easier and plot some graphs.

Now I'll show you graphs for each feature (`x`

) for MPG (`y`

).

`passedemissions`

are cumulated for `x=0`

and `x=1`

. Originally, the values were `FALSE`

and `TRUE`

, so I mapped them to `0`

and `1`

respectively. Hmm... there's some correlation, since for `x=0`

MPG is lower than for `x=1`

, but it doesn't seem to be useful for the linear regression model.

This is more meaningful since it looks like more `cylinders`

= lower MPG. BTW, I haven't known about 5-cylinder cars before.

`displacement`

was the primary feature I used in the previous article and as you can see there's a strong correlation.

`horsepower`

is pretty similar to `displacement`

, which makes sense.

I was curious about `weight`

and as I thought, it indeed affects MPG.

`acceleration`

doesn't look promising, since MPG values are spread chaotically all over the graph.

Last but not least, `modelyear`

. This one is interesting since it varies for a given year, but on the other hand, you can see a trend that newer cars are more fuel efficient and have relatively higher MPG values.

It seems that `displacement`

, `horsepower`

and `weight`

look meaningful, but their shape doesn't look like a straight line. But, does it have to be a straight line? It turns out that linear regression **supports also different shapes of the functions**!

The shape of the points for the mentioned features resembles a few functions, like the square root function. We'll give it a shot! But what about `cylinders`

and `modelyear`

? They look a bit useful and I can imagine drawing a straight line through them, so classic linear function will do the trick.

The table below shows the conclusions in a more compact way.

Feature (x) | Looks Useful? | Function Shape |

passedemissions | No | - |

cylinders | Somehow yes | Straight line (`y = wx + b` ) |

displacement | Yes | Square root (`y = wx + zx + b` ) |

horsepower | Yes | Square root (`y = wx + zx + b` ) |

weight | Yes | Square root (`y = wx + zx + b` ) |

acceleration | No | - |

modelyear | Somehow yes | Straight line (`y = wx + b` ) |

Now is the time for feature engineering in practice - I'm going to achieve a more "slide-ish" shape where it makes sense.

Let's start with `displacement`

. First, I'm going to rerun the model from the previous article for standardized features and `displacement`

as it is.

`LinearRegression.r2(trained_model.weights, x_test_std, y_test)|> Nx.to_number()|> IO.inspect(label: "Accuracy")# ResultAccuracy: 0.6336648464202881`

Accuracy is ~0.63. And the shape of the prediction line is as expected - totally straight.

Now we'll **improve it by adding a new feature** - the square root of x.

`x_w_sqrt = Nx.concatenate([x, Nx.sqrt(x)], axis: 1)x_test_w_sqrt = Nx.concatenate([x_test, Nx.sqrt(x_test)], axis: 1)# Result#Nx.Tensor< f32[314][2] EXLA.Backend<host:0, 0.1855147666.821166100.63419> [ [53.0, 7.280109882354736], [83.0, 9.110433578491211], [60.0, 7.745966911315918], [90.0, 9.486832618713379], ... ]>`

Simple, isn't it? Just remember that from now on, you need to add the new feature to all feature sets - for training, test, and predictions. And for this data set we'll get...

`# features - [displacement, sqrt(displacement)]LinearRegression.r2(trained_model.weights, x_test_w_sqrt_std, y_test)|> Nx.to_number()|> IO.inspect(label: "Accuracy")# ResultAccuracy: 0.7282562255859375`

The accuracy went up from ~0.63 to ~0.73 using the very same data and some feature engineering! Nice! And how it will impact the prediction function shape? Let's see...

Looks pretty good! I hope you can feel it. To have a better idea of how feature engineering and feature selection impact the accuracy I did a few additional tests with different feature combinations.

Features | Accuracy | R (MSE) |

all | 0.7895334959030151 | 16.324716567993164 |

all + sqrt() | 0.8667313456535339 | 10.336909294128418 |

all except `passedemissions` and `acceleration` | 0.7957956194877625 | 15.838998794555664 |

all + sqrt() except `passedemissions` and `acceleration` | 0.8683594465255737 | 10.210624694824219 |

`passedemissions` | 0.5237807035446167 | 36.937686920166016 |

`cylinders` | 0.6247956156730652 | 29.10251808166504 |

`displacement` | 0.6442053318023682 | 27.597017288208008 |

`displacement` + sqrt() | 0.7580759525299072 | 18.76470375061035 |

`horsepower` | 0.6671119928359985 | 25.820274353027344 |

`horsepower` + sqrt() | 0.738216757774353 | 20.305068969726562 |

`weight` | 0.6993056535720825 | 23.32318878173828 |

`weight` + sqrt() | 0.7324317097663879 | 20.7537841796875 |

`acceleration` | 0.2165735960006714 | 60.76603317260742 |

`modelyear` | 0.3578674793243408 | 49.8066520690918 |

First note: **features with square root feature perform better** than the original set. `acceleration`

was the most useless feature (accuracy around 0.22, R equals almost 61!). But performance with and without this feature was pretty much the same. The regression handles such features by setting their weights to be close to zero, so they're insignificant.

The biggest surprise for me here is that `passedemmisions`

did better than `modealyear`

- accuracy of ~0.52 vs 0.36.

Feature engineering is about improving feature set by scaling or creating new ones. Common practice is to apply some functions to a feature like raising to a given power (polynomial regression).

Feature engineering is quite fun because it requires both soft skills like creativity and intuition, and hard mathematical skills, like identifying mathematical function shapes and formulas.

]]>This post describes the whole process step-by-step of creating an ML model, testing it, and making some predictions. I'll use many useful techniques such as preparing the data, explaining Gradient Descent, writing a test, etc.

Would "a stupid straight line" be helpful in some serious Machine Learning stuff? Let's figure this out!

Our goal will be to analyze the CSV file with car data and try to predict the car's efficiency expressed as *Miles Per Galon* (**MPG**) based on some features using Linear Regression.

You can download the file from here. It contains almost 400 examples. Let's take a look at the data we have.

passedemissions | mpg | cylinders | displacement | horsepower | weight | acceleration | modelyear | carname |

FALSE | 18 | 8 | 307 | 130 | 1.752 | 12 | 70 | chevrolet chevelle malibu |

TRUE | 22 | 4 | 140 | 72 | 1.204 | 19 | 71 | chevrolet vega (sw) |

FALSE | 18 | 6 | 225 | 105 | 1.8065 | 16.5 | 74 | plymouth satellite sebring |

The second column contains our label - we'd like to predict MPG based on the other factors. But... what may be useful for it? Can we use more than one feature for the Linear Regression?

Let's start with the Linear Function formula.

$$y = wx + b$$

`y`

is the value we'd like to predict based on `x`

feature, which is multiplied by `w`

(**weight**) and increased/decreased by `b`

(**bias**). If we consider taking horsepower as the only feature, we can show it in the friendlier form.

$$MPG = w \times horsepower + b$$

Our ultimate goal is to calculate somehow `w`

and `b`

, so we got the smaller possible error for the prediction. It's easier to digest it by looking at the graph.

For all data points we have , we'd like to determine the slope (**a**) and offset (**b**) of the prediction line , so we got possibly as short red dash lines as possible.

If we get the red lines (**Residual Errors)** then we can calculate the overall error called **Mean Squared Error (MSE)** using the following formula.

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i' - y_i)^2$$

Where `y'`

is the actual value (green points like ), and `y`

is the prediction (line ).

MSE lies in the range from 0 going up to some big values. "Sky is the limit" - but in this case, we're heading in the other direction - down to 0. Zero means a perfect solution! But in practice, it's unreachable. The rule of thumb: **the lower the MSE, the better**.

Learning the ML model is about iteration through all `x`

examples to find `w`

and `b`

parameters for which the cost function result is as smallest as possible. There are different optimal cost functions for different algorithms. MSE is the optimal choice for linear regression, in the Gradient Descent section you'll why.

So, we have the data... A functional programmer's intuition says *"I see! We can iterate through all*`x`

*-*`y'`

*pairs using a reducer, do the math, and accumulate the result!"*. Yes! That's valid, but... Let's not do this. Not in the ML world. Let's make use of our fancy tensors!

Instead of iterating through each `x`

and doing `n`

calculations in the result, we'll take the vectorized approach. To make this work, we need to put all the data into tensors (particularly matrices) and **do the math in just one shot!**

The vectorized version of linear regression looks like this:

$$Y = \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_n \end{bmatrix} \dot{} \begin{bmatrix} b\\ w\\ \end{bmatrix} = \begin{bmatrix} b + wx_1\\ b + wx_2\\ \vdots \\ b+ wx_n \end{bmatrix}$$

Magic or math - doesn't matter how you call it 😉. It's **matrix multiplication**. The first matrix contains all `x`

values in one column and 1-s in the other. The purpose of the "1-s" column is to get `b`

in the result. As you can see, for `x1`

you get `wx1 + b`

which is exactly what we want.

For almost 400 records we have in our car models CSV file when `w`

and `b`

are known, we can get all `y`

values (MPG) by doing calculations once. Now you can feel the power of proper machine learning!

Gradient descent is one of the most common techniques for learning ML models. Let's show it on the graph, so it's easier to analyze what's going on.

The ultimate goal is to determine **weights** (`w`

and `b`

for linear regression), so the cost function, MSE, is as close as possible to the optimal value . To make any progress, it needs to calculate MSE for some weights. MSE is an arbitrary value and in isolation, it means almost nothing. That's why gradient descent is based on **slopes** (red lines) rather than on the values themself.

Let's analyze some interesting parts of the graph:

- The slope is going down hard, which means we're far from the optimal solution, but heading in the right direction

- Now it's still going down, but more gently - we're getting closer...

- Optimal solution - MSE is as small as it could be, weights are optimal

- The slope is gently rising, so we're quite close, but we overshoot with the current weights, which means that the next step of the change should be in the other direction (if weights have been increased, then should be decreased or the other way around)

- Similar as above, but worse 🙂

Remember the MSE formula?

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y'_i - y_i)^2$$

We can determine the slope by calculating the derivatives for `a`

and `b`

.`:`

$$\frac{d(MSE)}{da}= -\frac{2}{n} \sum_{i=1}^{n} x_i(y'_i - y_i)$$

$$\frac{d(MSE)}{db}= -\frac{2}{n} \sum_{i=1}^{n} (y'_i - y_i)$$

The formulas look a bit unappealing, but they make sense when described verbally: *For*`b`

*take the sum of all differences between actual and expected results. For* `w`

*do the same but additionally, multiply the differences by*`x`

. *Multiply both results by -2 and divide by the numbers of all examples.*

As you can guess, we can calculate derivates for all weights at once using a vectorized formula.

$$D_{MSE} = -\frac{2}{n}(F^T \cdot D)$$

Where F^{T} is a transposed `features`

matrix and `D`

is the differences (y'_{i} - y_{i}) matrix.

An extended, generic version of the formula above looks like this:

$$D_{MSE} = - \frac{2}{n} (\begin{bmatrix} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_n\\ \end{bmatrix} \dot{} \begin{bmatrix} y'_1-y_1\\ y'_2-y_2\\ \vdots\\ y'_n - y_n \end{bmatrix} ) = - \frac{2}{n} \begin{bmatrix} (y'_1-y_1) + (y'_2-y_2) \cdots + (y'_n-y_n)\\ x_1(y'_1-y_1) + x_2(y'_2-y_2) \cdots + x_n(y'_n-y_n)\\ \end{bmatrix}$$

We still haven't answered a pretty big question: **Can we use multiple features**, so besides just horsepower it also involves other data like cylinders or displacement? **Yes, we can! 🙌**

The standard, one-feature linear regression we discussed is called **univariate**. A version with multiple features is a **multiple linear regression** (sometimes also called **multivariate**, but it seems to be something else) and takes the following form.

$$y = w_1x_1 + w_2x_2 + \dotso + w_nx_n + b$$

Or in our particular case for horsepower, cylinders, and displacement.

$$y = w_1 \times horsepower + w_2 \times cylinders + w_3 \times displacement + b$$

More features equals more data. Let's take a look at how we can handle multivariate linear regression using tensors.

$$Y = \begin{bmatrix} 1 & h_1 & c_1 & d_1\\ 1 & h_2 & c_2 & d_2\\ \vdots & \vdots & \vdots & \vdots\\ 1 & h_n & c_n & d_n \end{bmatrix} \dot{} \begin{bmatrix} b\\ w_h\\ w_c\\ w_d\\ \end{bmatrix} = \begin{bmatrix} b + w_hh_1 + w_cc_1 + w_dd_1\\ b + w_hh_2 + w_cc_2 + w_dd_2\\ \vdots \\ b + w_hh_n + w_cc_n + w_dd_n \end{bmatrix}$$

I renamed `x`

to the first character of the particular feature so it's easier to read. As you can see, handling two additional features requires adding two additional columns with values and two weights. Simple as that!👌

Now it's time for the actual Elixir code! The approach I took is a "raw"/hard way, meaning everything is written from scratch. I think I'd rather use Scholar in prod, but implementing everything is way more instructive... and interesting 😉

First of all, we need to decide what data may be meaningful. Certainly, we need data from the **mpg** column for training since it's the label in our case.

In terms of features, it's totally arbitrary what data you'll decide to use. So let's sharpen our instinct and try to answer the question: *What data may affect how many miles the car can drive on one gallon?*

I'm going to write more about how to meaningfully choose features. For now, let's start with the engine `displacement`

, which seems to be appropriate. We'll see!

Parsing CSV and retrieving data from it is fairly simple. I wrote a simple function with a hardcoded path to the file... Yes, it could be prettier and more generic, but I'll stick with simpler code so everything is clear.

`defmodule LinearRegression do alias NimbleCSV.RFC4180, as: CSV def load_data(data_path) do data_path |> File.stream!() |> CSV.parse_stream() |> Stream.map(fn row -> [ passedemissions, mpg, cylinders, displacement, horsepower, weight, acceleration, modelyear, _carname ] = row {[ parse_boolean(passedemissions), parse_int(cylinders), parse_int(horsepower), parse_float(displacement), parse_float(weight), parse_float(acceleration), parse_int(modelyear) ], [parse_float(mpg)]} end) |> Enum.to_list() end def parse_boolean("TRUE"), do: 1 def parse_boolean("FALSE"), do: 0 defp parse_float(string_float) do string_float |> Float.parse() |> elem(0) end defp parse_int(string_int) do string_int |> String.to_integer() endend`

`load_data/1`

takes a path to the file as an argument. The function loads the CSV file, and parses values to appropriate types. All values could be parsed to floats since Nx tensors use one data type for all values.

Now we'll load the data and split it into two sets: train and test.

`data = LinearRegression.load_data(data_path){train_data, test_data} = data |> Enum.shuffle() |> Enum.split(data |> length() |> Kernel.*(0.8) |> ceil())train_count = length(train_data)test_count = length(test_data)x = Enum.map(train_data, &elem(&1, 0)) |> Nx.tensor() # take only displacement column |> Nx.slice([0, 2], [train_count, 1])y = Enum.map(train_data, &elem(&1, 1)) |> Nx.tensor()x_test = Enum.map(test_data, &elem(&1, 0)) |> Nx.tensor() # take only displacement column |> Nx.slice([0, 2], [test_count, 1])y_test = Enum.map(test_data, &elem(&1, 1)) |> Nx.tensor()`

It loads the data, shuffles, and splits it into 80%-20% sets. Splitting the data should be intuitive - we need some to train the model and next to test it. But why bother with shuffling? **Shuffling data helps countereffect the data bias**.

In this example, the data seems to be ordered by the time of releasing the particular model. This introduces bias because if you quickly peek at MPG, it's relatively lower for older models, and higher for the more recent ones. To make sure we have pretty much similar data for training and test sets, you always should shuffle it before splitting.

Do you remember the formula we're going to use to calculate `Y`

?

$$Y = \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_n \end{bmatrix} \dot{} \begin{bmatrix} b\\ w\\ \end{bmatrix} = \begin{bmatrix} b + wx_1\\ b + wx_2\\ \vdots \\ b+ wx_n \end{bmatrix}$$

We need to prepend `X`

with the column of 1s.

`defmodule LinearRegression do import Nx.Defn defn prepend_with_1s(x) do ones = Nx.broadcast(1, {elem(Nx.shape(x), 0), 1}) Nx.concatenate([ones, x], axis: 1) endend`

Notice it is `defn`

, not ordinary `def`

function. It's a special form of Nx function optimized for tensor calculations. TBH I'm not sure if it's the most elegant way of achieving it but... it works for me 😉.

Now let's take a look at the "meat" - linear regression implementation in Elixir.

Unlike Python, Elixir is a functional programming language and doesn't support the idea of classes and instances. How to handle the state?

I could use plain `map`

, but I'm not a big fan of using unmeaningful "bags" for data. I'm going to use a similar, simple but more powerful data type - `struct`

.

`defmodule Model do defstruct weights: nil, alpha: 0.1, mse_history: [], epochs: 100, epoch: 0 @type t :: %__MODULE__{ weights: Nx.t() | nil, alpha: float(), mse_history: [float()], epochs: integer(), epochs: integer() }end`

The most important part of the Model structure is `weights`

attribute, finding optimal weights is the ultimate goal. `alpha`

is the learning rate, `mse_history`

is a list of MSE values, for visualization purposes. `epochs`

determines how many training iterations we want to run and `epoch`

is the number of the current iteration.

`defmodule LinearRegression do def train(%Model{epochs: epochs, epoch: epoch} = model, x, y) when epoch <= epochs - 1 do {x, model} = if epoch == 0 do new_x = prepend_with_1s(x) {new_x, %{model | weights: Nx.broadcast(0, {elem(new_x.shape, 1), 1})}} else {x, model} end w = gradient_descent(x, model.weights, y, model.alpha) mse = mse(x, w, y) |> Nx.to_number() IO.puts("Epoch: #{model.epoch}, MSE: #{mse}, weights: #{w |> Nx.to_flat_list() |> inspect}\n") model = model |> Map.put(:weights, w) |> Map.update(:epoch, 1, &(&1 + 1)) |> Map.update(:mse_history, [], &[mse | &1]) train(model, x, y) end def train(%Model{} = model, _x, _y) do model |> Map.update(:mse_history, [], &Enum.reverse/1) endend`

`train`

function prepends `x`

with 1-s in the first iteration and sets `weights`

to a tensor of 0-s. The ordinary iteration contains the calculation of new `weights`

using gradient descent and MSE. Then it updates the `Model`

struct and recursively calls itself with the new `model`

.

Finally, when it runs out of iterations, it reverses `mse_history`

to be in the right order and returns `model`

.

`defmodule LinearRegression do defn gradient_descent(x, w, y, alpha) do y_pred = Nx.dot(x, w) diff = Nx.subtract(y_pred, y) gradient_descent = x |> Nx.transpose() |> Nx.dot(diff) |> Nx.multiply(1 / elem(x.shape, 0)) gradient_descent |> Nx.multiply(alpha) |> then(&Nx.subtract(w, &1)) end defn mse(x, w, y) do x |> Nx.dot(w) |> Nx.subtract(y) |> Nx.pow(2) |> Nx.sum() |> Nx.divide(Nx.shape(x) |> elem(0)) endend`

Here are `gradient_descent`

and `mse`

functions. Notice that `gradient_descent`

value gets multiplied by `alpha`

, the learning rate. Basically, it determines how much `weights`

are going to be changed. In other words, how git steps gradient descent is going to do in other to find the optimal solution.

The ML model is ready, so let's train it! In this approach, we're going to use only `displacement`

as `x`

.

`model = %Model{ alpha: 0.1, epochs: 200}trained_model = LinearRegression.train(model, x_std, y)# Result%Model{ weights: #Nx.Tensor< f32[2][1] EXLA.Backend<host:0, 0.3524400666.4193124372.233052> [ [NaN], [NaN] ] >, alpha: 0.01, mse_history: [5914053.5, 86293921792.0, 1259188042858496.0, 1.8373901428369392e19, 2.681093018229369e23, 3.9122144076470293e27, 5.708641532863881e31, 8.329962648052612e35, :infinity, :infinity, :infinity, :infinity, :infinity, :infinity, :infinity, :infinity, :infinity, :nan, ...], epochs: 200, epoch: 200}`

Oops... It doesn't look promising. Something went wrong... Notice that `MSE`

values went to infinity! It seems that `alpha`

, the learning rate was too big. Let's give it another shot, with a smaller `alpha`

.

`model = %Model{ alpha: 0.0001, epochs: 200}trained_model = LinearRegression.train(model, x, y)# Result%Model{ weights: #Nx.Tensor< f32[2][1] EXLA.Backend<host:0, 0.3524400666.4193124372.233454> [ [0.09403946250677109], [0.18162088096141815] ] >, alpha: 0.0001, mse_history: [227.74842834472656, 209.40399169921875, 208.5283966064453, 208.4827117919922, 208.47647094726562, 208.4720916748047, 208.46778869628906, 208.46353149414062, 208.459228515625, 208.45492553710938, 208.45065307617188, ...], epochs: 200, epoch: 200}`

For `alpha=0.0001`

it calculated weights: `b=0.09403946250677109`

and `w=0.18162088096141815`

and got `MSE`

~208. Is it good? We won't know until we perform some tests!

To test the accuracy of our linear regression model we'll use the coefficient of determination, also known as **R ^{2}**. The formula looks quite simple.

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

Let's decipher the SS factors. The first one is the Sum of Squares of Residuals (also Residual Sum of Squares - RSS). It tells us about the distances of actual points to the prediction line. They're shown on the graph at the beginning of this post as red dashed lines ().

The lower the SS_{res}, the better.

$$SS_{res} = \sum_{i=1}^{n} (y'_i - y_i)^2$$

SS_{tot} is the Total Sum of Squares and it represents the distance between the actual values (`labels`

) and the mean value line. In other words, the higher the SS_{tot}, the more variable `labels`

are, making it more difficult to accurately predict the value.

$$SS_{tot} = \sum_{i=1}^{n} (y_i - \overline{y})^2$$

Can we calculate these factors using tensors? Yes! This time vectorized formulas are quite simple.

$$SS_{res} = \sum_{i=1}^{n} (Y' - Y)^2$$

$$SS_{tot} = \sum_{i=1}^{n} (Y - \overline{y})^2$$

The value of the coefficient of determination tells us how much the linear regression model is accurate. Now we'll analyze possible outcomes:

**R**- Ideal solution! 🦄 You probably won't see it in real life 😉^{2}= 1**0 < R**- Generally speaking, the closer to 1 the better, but anything above 0 is better than using the mean (y) for each guess^{2}< 1**R**- This is the same as using mean (y) for each guess. It's pretty much useless 🤷^{2}= 0**R**- The model totally sucks and it's worse than using the mean for each guess! 🙉 The lower, the worse! And it may go down to oblivion...^{2 }< 0

It isn't fair, because we have literally the infinity (specifically negative infinity) of poor solutions and only 0-1 of somehow good ones. As one said: *It is, what it is.* 🤷

Okay, now it's a good time for some code.

The `r2`

function is pretty neat and readable. Nx supports all operations we need out of the box. 😉

`defmodule LinearRegression do defn r2(w, x, y) do x = prepend_with_1s(x) y_pred = Nx.dot(x, w) # SS_res res = Nx.subtract(y, y_pred) |> Nx.pow(2) |> Nx.sum() # SS_tot tot = Nx.subtract(y, Nx.mean(y)) |> Nx.pow(2) |> Nx.sum() # Coefficient of Determination Nx.subtract(1, Nx.divide(res, tot)) endend`

Finally, let's test the trained model! We'll calculate `r2`

and `MSE`

for the test set and will see how it went.

`LinearRegression.test_mse(x_test, trained_model.weights, y_test)|> Nx.to_number()|> IO.inspect(label: "TEST MSE")LinearRegression.r2(trained_model.weights, x_test, y_test)|> Nx.to_number()|> IO.inspect(label: "Accuracy")# ResultTEST MSE: 236.6131134033203Accuracy: -2.4155285358428955`

`r2`

below 0 means, it's worthless! Why? Take a look at the MSE over epochs graph.

Okay, it seems it stopped learning pretty early.

Dots are the data values, the line is our model. Shame. The prediction line is almost flat, close to 0. Is Python just better than Elixir in ML? 🙉

There are many possible reasons why the model doesn't perform training too well. The most common are:

**Too few training examples**- the more the better. In our case, we used ~320 examples, which is very poor**Improper features**(`x`

) - some features are misleading, sometimes there are too many or too few. We used just one feature`displacement`

. We can use more and we will.**Features have "hard" values**- when the features have weird, various values, it's much tougher for the model to find optimal weights. For instance,`displacement`

values land between 40 and 230,`horsepower`

- 70-440,`weight`

0.9-2.5.**Improper learning rate**- when it's too low, the model won't train fast enough, when it's too big, it'll overfit and go to infinity

We can't do anything about 1., but we can use more features and do something with values. The 4th point seems to be worth trying, but it's a topic for another post.

Adding more features seems to be a good idea, but they vary from each other, so the output probably will be even worse. That's why we'll try to make values more friendly for the ML model and put them on a similar scale first.

`defmodule LinearRegression do defn standardize_x(x, mean, std_dev) do x |> Nx.subtract(mean) |> Nx.divide(std_dev) endendx_mean = LinearRegression.x_mean(x)x_std_dev = LinearRegression.x_std_dev(x)x_std = LinearRegression.standardize_x(x, x_mean, x_std_dev)x_test_std = LinearRegression.standardize_x(x_test, x_mean, x_std_dev)`

In this case, we'll use standardization, known also as standard score or Z-score. Keep in mind that **when you apply feature scaling for training, you need to scale test and prediction features as well**! That's why you need to keep `x_mean`

and `x_std`

for making predictions.

Ready for tests with standardized features? Remember, we'll use just `displacement`

but standardized. Okaay... let's go!

`model = %Model{ alpha: 0.1, epochs: 200}trained_model = LinearRegression.train(model, x_std, y)LinearRegression.test_mse(x_test_std, trained_model.weights, y_test)|> Nx.to_number()|> IO.inspect(label: "TEST MSE")LinearRegression.r2(trained_model.weights, x_test_std, y_test)|> Nx.to_number()|> IO.inspect(label: "Accuracy")# ResultTEST MSE: 19.801504135131836Accuracy: 0.6604369878768921%Model{ weights: #Nx.Tensor< f32[2][1] EXLA.Backend<host:0, 0.3524400666.4193124372.238386> [ [23.306041717529297], [-6.024066925048828] ] >,...}`

Now we're talking! `MSE`

is ~20, `r2`

~0.66, which means that accuracy is about 66%. Not too shabby!

It looks like it found the optimal weights in just ~30 iterations. Nice!

And here is the prediction line. Looks pretty good!

One more test to go. This time we'll use `displacement`

, `horsepower`

and `weight`

. My Gut tells me all of them are meaningful for the fuel consumption, let's check if it affects the model performance. We'll apply the standardization as in the previous point. Below `x_test`

tensor before and after standardization.

`# Before standardization#Nx.Tensor< f32[78][3] EXLA.Backend<host:0, 0.3524400666.4193124372.233872> [ [96.0, 122.0, 1.149999976158142], [138.0, 351.0, 1.9774999618530273], [170.0, 360.0, 2.3269999027252197], ... ]># After standardization#Nx.Tensor< f32[78][3] EXLA.Backend<host:0, 0.3524400666.4193124372.238814> [ [-1.4333690404891968, -0.9660688042640686, -0.7269482612609863], [-1.2810755968093872, -0.8996133208274841, -0.8212781548500061], [-0.8495774865150452, -0.8996133208274841, -1.3208767175674438], ... ]>`

And here are the results of the final test.

`TEST MSE: 18.814205169677734Accuracy: 0.679672122001648%Model{ weights: #Nx.Tensor< f32[4][1] EXLA.Backend<host:0, 0.3524400666.4193124372.240355> [ [23.568984985351562], [-1.7338991165161133], [-1.9251408576965332], [-3.1087865829467773] ] >, alpha: 0.1,...}`

**Accuracy is ~68%**, just a little bit better. It's worth peeking at the graph showing predictions with respect to `displacement`

.

It isn't too straight, isn't it? 😉

We've already discussed everything except the most important part, at least from the end-user perspective - using the model to **make predictions**. Let's do it!

`def predict(%Model{weights: w}, x) do x |> prepend_with_1s() |> Nx.dot(w)end`

`predict`

takes the trained mode, prepares the features, and multiplies them (matrix multiplication) by `weights`

. Simple as that!

Friendly reminder: **use mean and standard deviation calculated for training features and standardize the prediction features**. In the other case, it won't work too well, delicately speaking...

`x_std = LinearRegression.standardize_x(x, x_mean, x_std_dev)y_pred = LinearRegression.predict(trained_model, x_std)Nx.concatenate([y_pred, y], axis: 1)# Result#Nx.Tensor< f32[392][2] EXLA.Backend<host:0, 0.3093086062.301596692.234332> [ [18.191600799560547, 18.0], [15.069494247436523, 15.0], [17.279438018798828, 18.0], [17.566789627075195, 16.0], [18.015308380126953, 17.0], [9.754270553588867, 15.0], [8.194624900817871, 14.0], [8.847555160522461, 14.0], [7.699560165405273, 14.0], [12.580235481262207, 15.0], [14.630147933959961, 15.0], [15.787235260009766, 14.0], [14.542545318603516, 15.0], [12.289548873901367, 14.0], [27.534603118896484, 24.0], [24.272138595581055, 22.0], [24.361839294433594, 18.0], [25.539608001708984, 21.0], ... ]>`

Finally, we made use of the trained model and made predictions for the whole data set. As you can see in the result tensor, sometimes the model is almost perfect (see the first two value pairs), and other times is rather poor (9.75 vs 15). Anyway, it's not bad overall!

The graph above shows the actual MPG (`y'`

) as a blue line and predictions (`y`

) as red dots. Before plotting the graph, I sorted it by `y'`

to make it more readable. The model performs pretty well for smaller `y'`

and worse for `y'`

> 35.

Can it achieve better accuracy than 68%? I think so! It just needs some tuning, we'll do this in a future post. For now, it's time to wrap up

To be honest, this post is way more extensive than I planned, but I wanted to explain everything so it's clear for you and... future me 😉 Now I'll try to shorten everything in this bullet points list:

Linear Regression is pretty decent in predictions for some cases, especially when using multiple weights. In this case, it ended up with

**R**(in other words, an accuracy of 68%) which is nice!^{2}of about 0.68Machine Learning is all about data, numbers, and math operations. That's why using dedicated libraries like Elixir

**Nx**and operating on numbers using matrices is so important.**Features scaling as standardization**helps you to get better results and get closer to the optimal solution much faster. If you decide to standardize the training features, remember to do the same for prediction features as well, that's why you need to store the mean and standard deviation determined for the training feature set.The

**Gradient Descent**technique is a bit tricky, but also widely used in the ML world. It's worth to learn it.Optimizing the ML model is essential, especially when it's going to be used for big data

**Determining the accuracy of the model is crucial**, so it's important to find the right method of testing for the particular model. For linear regression, we used the**coefficient of determination**(R^{2})The point of this post was to write the model from scratch and learn how things work on a low level

Bonus:

**Nx (& Elixir) did very well!**💜 Functions are quite intuitive to use. Writing code was fun and... much easier than describing everything in this post! 😅

I decided to go with Elixir 💜 and its TensorFlow-like library - Nx! The foundations are the same for all languages and libraries. There's a lot of math. That's a bit scary. But the journey and results are so exciting! So don't worry and let's start the ML journey!

This might sound trivial but have you thought about what ML is about? The ultimate goal of the ML model is to **guess the result for the given input**. There're plenty of real-life examples from many different fields, for instance:

What's the predicted

**gender**(result), for the given**height**and**weight**(input)?*Tell me what**number**(result) is on**the image**(input)*Is this

**email**(input)**spam**(result)?AI, tell me

**everything you know**(result) about**Elixir programming language**(input)

I'd like to highlight the word **guess** here. ML problems are often complex and almost never you will get 100% certainty that the answer is correct. Roughly we can say that accuracy above **80%** is good enough. But it strongly depends on the solved problem.

In the ML world, we call an **input** set of **features**. The **result** is called a **label**. The individual set of features is called an **example**. So, for the first example of predicting gender based on height and weight, it would be something like that:

`features = [ #[height (cm), weight (kg)] [180, 75], # 1st example [159, 56], # 2nd example ...]labels = [ #[gender 0 - man, 1 - woman] [0], # for the 1st example [1], # for the 2nd example ...]`

As you can see, order matters in both directions. Individual features should be placed in specific columns (vertical order). And remember that labels are linked to a particular set of features (horizontal order).

Notice that in the example above we represent gender as a number, that's because ML speaks in numbers. Transforming non-numerical values to numbers is called **encoding**.

To predict the result correctly you need a pre-trained **ML model**. A model is an algorithm, a super math formula. For most of the problems, it's so tough for humans to determine it so we incorporate machines - our computers 😉

When we solve the math problem, it's about applying the math formula, calculating factors, and then we can calculate the result. Let's dig deeper for one of the simpler and most useful functions - linear function.

`y = ax + b`

`y`

is the result, `x`

is an input. `a`

and `b`

are factors we call **weights** in ML. If you have at least two **examples** (*x-y* pairs), you can figure out weights (*a* and *b*) and with you, you can calculate any *y* for the given *x*.

That's simple, isn't it? We don't need ML and all the hard stuff at all, right? 😉 Well, in the math field or ideal world - **yes**! 👍 But the real world is a bit more complicated...

Math formulas won't solve many real problems themself, because they're too complex. But, with meaningful data (features with labels), we can apply some math operations and let the machine figure out the **weights** that can be used for **predictions**.

Let's suppose that we'd like to predict a person's **weight** (label) based on **height** (input). To simplify the task, let's consider **only women**. We can assume that this relationship is somehow linear - larger height = larger weight.

Would linear function work for this? Nope. But linear regression will do. I'll describe it more in the next post. For now, let's assume that it's a kind of "average linear function". This will work for the real data. 👍

We'll use this dataset for our quick analysis. I generated the graph below in Apple Numbers, where you can see how it looks in practice and what's the calculated equation.

Numbers calculated that for the given data `y = ax + b`

, `a=0.0578`

and `b=95.853`

(see top left corner). It calculated the weights (`a`

and `b`

)! We may say that the "machine learned" and figured it out itself.

Dots are spread out all over the `y`

(weight) axis and only a few of them are close to the line representing the linear function we'd used for predictions. It looks like the accuracy is pretty bad. Why?

Is it because the height-weight relationship isn't linear? Long story short: **height alone is insufficient data to accurately determine weight**. It makes sense, in our dataset women with 172cm height weigh between 62-116kg. We can't help it. 🤷

In machine learning, **good quality data** (without invalid, fake, or inappropriate values) and the **number of samples** (the more, the better) **are essential for increasing the accuracy of predictions**.

The dataset we used also contains the *index* column, with values ranging from 1-5, determining if the weight is relatively good (3), too low (1), or too high (5). Let's reevaluate the height-weight relationship using linear regression in spreadsheet, but only for data with an index of 3.

Now `a=0.6232`

and `b=-40.047`

. And as you can see, the dots are much closer to the prediction line. Much better! 👌

In the weight-height example, we used Apple Numbers for calculating *weights* (remember? `a`

and `b`

). In day-to-day work, we'd write a code loading the data and **doing calculations based on the data**. We call this step **training**. It's like: *"Okay, here's the data, calculate the weights so the accuracy is good enough"*.

We tested accuracy manually, checking the graphs. It's always cool to plot a graph and do a sanity check 😉. But in practice, we write a math formula checking the error for some features and labels. Then, we use the calculated formula to predict the known *label* for the given test *features*.

To make this work, we split the features-label data we have into **training** and **test** sets, usually, it's 90% for training and 10% for testing.

**Quality and quantity of the data used for ML are essential to get good accuracy**Providing

**meaningful****features significantly increases the accuracy.**The analysis worked poorly for just height. After involving the index, it gave much better results. Imagine how providing BMI, body fat percentage, or waist size could affect the accuracyEven with super polished data,

**it's still predicting**- almost never you'll get 100% accuracyBonus: A spreadsheet app may also do some simple ML-ish stuff for you 😉

The foundation building brick is a **tensor**. It's a data structure that looks like a number or more often - an array. Tensors are created from plain values by the dedicated ML library, like Nx for Elixir.

Check tensor types based on **dimensions** in the table below. The most common tensors in ML are 2D tensors - **matrices**.

Dimension | Type | Example |

0 | Scalar | `123` |

1 | Vector | `[1, 2, 3]` |

2 | Matrix | `[[1], [2], [3]]` |

n | n-dimensional tensor | `[[[1], [2]],[[3], [4]]]` |

And let's check out what the tensors look like in Nx.

`> scalar = Nx.tensor(1.0)#Nx.Tensor< f32 1.0>> vector = Nx.tensor([1, 2, 3])#Nx.Tensor< s64[3] [1, 2, 3]>> matrix = Nx.tensor([[1], [2], [3]])#Nx.Tensor< s64[3][1] [ [1], [2], [3] ]>`

`Nx.tensor/1`

function returns a `Nx.Tensor`

struct, which looks a bit... dull 🤷 It doesn't seem to be a big deal compared to plain numbers or lists. But **IT IS a big deal!** Why? For these two reasons:

You can easily

**perform any of many ML operations**from the Nx library on tensors**Nx is optimized for "crunching numbers"**and it's much faster than using plain numbers or lists. Also, you can use your GPU for calculations which speed things up even more 🚀

Let's make it clear: You **can** do Machine Learning using plain numbers and lists. But it's much more painful and less performant than learning and using Nx.

We use tensors for all data we use in our ML project: *features*, *labels,* and *weights.*

Okay, now you know that tensor is essential in ML. Let's take a look closer at how it does look under the hood. We use the `matrix`

tensor as an example.

`> matrix = Nx.tensor([[1], [2], [3]])> Map.from_struct(matrix)%{ data: %Nx.BinaryBackend{ state: <<1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0>> }, type: {:s, 64}, names: [nil, nil], shape: {3, 1}}`

Now you can see that data is stored as a binary - that's where the performance comes from. Each tensor has a **type** for the stored numbers. In the example above it's `{:s, 64}`

or `s64`

which is a *signed integer 64-bit*. You can specify the type explicitly when creating a tensor.

Another attribute `names`

stores names for the axes. Names work like aliases and are optional.

The **shape** is very important. It informs you about the **size of each axis**. `{3, 1}`

means it has 3 rows and 1 column. Matrices and shapes are used a lot so keep in mind this mantra: *"row-column, row-column, row-column, ..."*

The shape is **essential for many math operations** that are performed on tensors, like matrix multiplication. Believe me, tensors are going to be transformed a lot - concatenated, reshaped, multiplied, split, etc. But more on this in the next post 😉

I know there was a lot of "talking" and just a few lines of code, but I think it might be useful before diving deeper. I hope this post may encourage you to take a closer look at ML and maybe even with Nx and Elixir..? 😉

But before the end, let's wrap up some concepts:

The main purpose of Machine Learning (from an end-user perspective) is to

**predict some information**based on the given dataYou can try to predict (better or worse) almost everything based on any data, as long as you are able to create an ML model for it (from a data-engineer perspective)

The better, more meaningful, and larger the data you provide, the better results you'll get

You will (almost) never achieve 100% accuracy in your predictions; roughly speaking, accuracy above 80% is considered good enough, but it depends on the particular case

Although it's possible to solve ML problems using plain data structures provided by a programming language, it's definitely worth learning a dedicated library like PyTorch, TensorFlow (Python), TensorFlow.js (JavaScript), or the mentioned Nx (Elixir)

Nx and Elixir work great with numbers and ML 💜

Let's assume we store the state for each filter set as an array. So when a checkbox is checked, its value is added to the array. It looks like this:

`board = []// checking "All Inclusive" and "Breakfast"...board = ['allInvlusive', 'breakfast']// unchecking "Breakfast"board = ['allInclusive']# etc.`

Additionally, we wanted to push the array with the state after each change to Redux, using `onChange`

callback. I hope it's quite clear, isn't it? Let's go to code and 1st iteration!

Let's start simple and implement storing state using `useState`

hook and calling `onChange`

callback under `useEffect`

. `selectedBoard`

is **the array** I mentioned earlier. It contains board options (IDs). Updating the state is performed by `setSelectedBoard`

with the jsxhelper function `toggleItem`

.

`const [selectedBoard, setSelectedBoard] = useState<string[]>([])useEffect( () => { onChange(selectedBoard) }, [selectedBoard, onChange])`

`<input type="checkbox" name={ board.name } value={ board.id } checked={ isItemChecked(selectedBoard, board.id) } onChange={ e => toggleItem(setSelectedBoard, e.target.value) }/>`

Nothing fancy. It just works 🙂 The problem is that we'll have unnecessary code repetition in the future: with this approach, we have to copy `useState`

and `useEffect`

block for each filter component...

Fortunately, we can easily fix this, by extracting the hook logic to a more generic **custom hook**. What we'll do in the 2nd step.

The second step is all about extracting hook-related code to a separate file, wrapping it under `useCheckboxes`

function (naming is up to you, by convention it starts with `use`

) and return an object with functions related to the state.

Actually, `useCheckboxes`

could return anything, like the array, instead of the object. In this case, it makes sense to use an object, since I expect `useChecbkoxes`

to be growing, so it's much easier to import its functionalities via object keys.

As you can see, it uses the same `useState`

and `useEffect`

logic with the generic name `Items`

.

`export const useCheckboxes = ({ onChange, defaultItems }: Props) => { const [selectedItems, setSelectedItems] = useState<string[]>(defaultItems) const isItemChecked = (itemId: string): boolean => { return selectedItems.includes(itemId) } const toggleItem = (itemId: string): void => { setSelectedItems((prevState: string[]) => { if (prevState.includes(itemId)) { return prevState.filter(item => item !== itemId) } else { return [...prevState, itemId] } }) useEffect( () => { onChange(selectedItems) }, [selectedItems, onChange] ) return { isItemChecked, toggleItem }`

To use `useCheckboxes`

for the Board component, we just need to import the functions and make some minor updates.

`const { isItemChecked, toggleItem } = useCheckboxes({ onChange, defaultItems: [] })...<input type="checkbox" name={ board.name } value={ board.id } checked={ isItemChecked(board.id) } onChange={ e => toggleItem(e.target.value) }/>`

And that's it! Now we have a generic `useCheckboxes`

which can be easily used anywhere, without code repetition! Now, the code is much easier to maintain.

Also, it brings us a great bonus - now the checkbox logic can be easily tested.

`import { renderHook, act } from '@testing-library/react-hooks'import { useCheckboxes } from '../useCheckboxes'const onChange = jest.fn()it('#isItemCheck returns correct value', () => { const { result } = renderHook(() => useCheckboxes({ onChange, defaultItems: ['XYZ'] })) expect(result.current.isItemChecked('XYZ')).toBeTruthy() expect(result.current.isItemChecked('ABC')).toBeFalsy()})it('#toggleItem toggles the given value and calls onChange', () => { const { result } = renderHook(() => useCheckboxes({ onChange, defaultItems: ['XYZ'] })) expect(onChange).toHaveBeenCalledTimes(1) expect(result.current.isItemChecked('XYZ')).toBeTruthy() expect(result.current.isItemChecked('ABC')).toBeFalsy() act(() => result.current.toggleItem('ABC')) expect(onChange).toHaveBeenCalledTimes(2) expect(result.current.isItemChecked('ABC')).toBeTruthy() expect(result.current.isItemChecked('XYZ')).toBeTruthy()})`

That's pretty nice! But, that's not the end. Let's suppose that we need to add some more features, like clearing an individual/all items (also without calling `onChange`

in some cases), pulling state from Redux, etc.

Handling state became more complex, so we can do one more refactor - use React `useReducer`

. And we'll do it!

`useReducer`

hook manages state very similarly to `Redux`

. It dispatches actions, which are handled by the reducer, which returns a new state.

The biggest advantage of this approach is we have a finite set of actions and we can get rid of `useEffect`

, which by its asynchronous nature sometimes leads to unpredictable behavior (especially when managing a complex state).

As I mentioned, the reducer is a Redux-like function taking `state`

and `action`

as arguments. For clarity, I listed just part of the code.

`export enum ActionType { TOGGLE_ITEM = 'TOGGLE_ITEM', CLEAR_ITEM = 'CLEAR_ITEM', ...}export const useCheckboxesReducer = (state: string[], action: useCheckboxesReducerAction) => { switch (action.type) { case ActionType.TOGGLE_ITEM: { const { itemId } = action const newState = state.includes(itemId) ? state.filter(item => item !== itemId) : [...state, itemId] action.onChange(newState) return newState } case ActionType.CLEAR_ITEM: { const newState = state.filter(item => item !== action.itemId) action.onChange(newState) return newState } ...`

BTW, I like using `enum`

for declaring action types. It's easy to export/import etc. Bear in mind, it's a TypeScript feature.

Now writing separate reducers pays off - `useChecboxes`

becomes a very lean function 🚀. Its main goal is dispatching actions with demanded payload.

`export const useCheckboxes = ({ onChange, defaultItems }: Props): HookReturn => { const [checkedItems, dispatch] = useReducer(useCheckboxesReducer, defaultItems) const isItemChecked = (itemId: string): boolean => checkedItems.includes(itemId) const toggleItem = (itemId: string): void => dispatch({ type: ActionType.TOGGLE_ITEM, itemId, onChange }) const clearItem = (itemId: string): void => dispatch({ type: ActionType.CLEAR_ITEM, itemId, onChange }) ... return { isItemChecked, toggleItem, clearItem, ... }}`

We went on a journey from simple state management with `useState`

and `useEffect`

, through extracting it to custom hook, to more complex code with `useReducer`

.

We DRY-up the code and made it more maintainable and easy to test. Is this a good approach? I think so. But, only when you consider this as a path. Custom React Hook using `useReducer`

definitely is not a silver bullet and you should end up with that solution, only when it makes sense. Wondering when? 🤔

My thoughts on this:

Managing a simple state, related to just one component? Basic

`useState`

and`useEffect`

can make it.The state logic is relatively simple, but you use this in many components? Extract it to a custom hook!

When managing the state is like bull riding, rewrite the code with

`useReducer`

.

Just be pragmatic 👍

]]>Let's assume we want to some tags to a blog post. Tags are already available in the database. We have a checkbox list of the tags. All we need is to bind them in the database.

Since one post may have many tags and a tag can belong to many posts, it's a classic many-to-many relationship. But we certainly don't want to create a joint record for each post-tag association... Let's look at what Ecto has for us.

Our strategy is to apply changes to a post, fetch tags, and "put" them into the post. Getting this in the Elixir-ish way...

`attrs_from_form|> post_changeset|> fetch_tags_by_ids|> put_tags_to_post|> insert_to_db`

As you can see, we have two additional steps due to the association of the tags: `fetch_tags_by_ids`

and `put_tags_to_post`

. They appear to be strongly coupled, so we should combine them into one function. Let's proceed with `put_tags`

and have it perform both fetching tags and adding them to the post.

Now let's translate this to real Code.

`@spec create_post(map()) :: {:ok, %Post{}} | {:error, %Ecto.Changeset{}}def create_post(attrs \\ %{}) do %Post{tags: []} |> Post.changeset(attrs) |> put_tags(attrs) |> Repo.insert()end@spec put_tags(%Ecto.Changeset{}, map()) :: %Ecto.Changeset{}defp put_tags(changeset, %{"tag_ids" => tag_ids}) do tags = get_tags(tag_ids) changeset |> Ecto.Changeset.put_assoc(:tags, tags)end@spec get_tags([binary()]) :: [%Tag{}]def get_tags(ids) do from(t in Tag, where: t.id in ^ids) |> Repo.allend`

The key function here is Ecto.Changeset.put_assoc. It takes post changeset, name of the association, and collection of tags we want to put to the post. Keep in mind that `put_assoc`

**works on the whole collection**. In other words, it would completely replace the old collection with a new one! By default, it won't make any change though, and raise an error.

To make this work, we need to set on_replace on the parent (Post) schema to `:delete`

, since we want to have the ability to remove tags from the post.

`schema "posts" do ... many_to_many :tags, Blog.Content.Tag, join_through: Blog.Content.PostTag, on_replace: :deleteend`

Now it works! And we're... almost done. Almost? There's one more issue. What if the post doesn't have any tags? In such a case, the current implementation of put_tags won't work. It even doesn't get pattern matched, because of the lack of `"tag_ids"`

argument.

Fortunately, we can easily fix this. The following one-liner function does the work. Remember, to put this below the previous implementation of the function, in the other case it'll always get pattern-matched!

`defp put_tags(changeset, _), do: changeset`

Hmm... put_tags name doesn't fit too much to this particular case. It's a matter of style, but let's do the last step, and rename it to maybe_put_tags to make it more clear.

And that's it! The final implementation looks like this. Just remember about `on_replace: :delete`

option on your tags association in Post schema.

`@spec create_post(map()) :: {:ok, %Post{}} | {:error, %Ecto.Changeset{}}def create_post(attrs \\ %{}) do %Post{tags: []} |> Post.changeset(attrs) |> maybe_put_tags(attrs) |> Repo.insertend@spec maybe_put_tags(%Ecto.Changeset{}, map()) :: %Ecto.Changeset{}defp maybe_put_tags(changeset, %{"tag_ids" => tag_ids}) do tags = get_tags(tag_ids) changeset |> Ecto.Changeset.put_assoc(:tags, tags)enddefp maybe_put_tags(changeset, _), do: changeset@spec get_tags([binary()]) :: [%Tag{}]def get_tags(ids) do from(t in Tag, where: t.id in ^ids) |> Repo.allend`

]]>`position: sticky`

), at the moment when it gets pinned. It was a filter section, which I wanted to wrap up into a small bar. Thanks to that, users on mobiles have easy access to filters, even if they scrolled down to the very end of the items list.The question is: **how to detect when the "sticky" element gets pinned**? The answer is: `IntersectionObserver`

.

Let's consider the super simple case: we have `.filters-panel`

div, containing the filter form. When a user scrolls down, the filters go out of view, we'd like to add `pinned`

CSS class to `.filters-pannel`

. The `.observer-point`

element below is not here by accident. We'll discuss its purpose later.

`<div class="filters-panel"> <!-- When it goes out of the viewport, add `pinned` class --> <form class="filters-form"> <!-- FILTERS --> </div> <button class="show-filters">Show filters</button> <!-- Hidden by default --></div><div class="observer-point"></div>`

Thanks to the `pinned`

class, we can set the filters panel to hide the filter form, show the button `Show filters`

and do some other styling. Roughly, the CSS (actually, let's go with Sass) could look like the one below.

`.filters-panel +mobile &.pinned position: sticky form.filters-form display: none button.show-filters display: block.observer-point height: 0px`

You got the idea, right? So now, let's get this working!

Probably the simplest JS for this looks like this:

`const observer = new IntersectionObserver((entries) => { const sortingPanel = document.querySelector(".filters-panel") if (entries[0].isIntersecting) { sortingPanel.classList.remove("pinned") } else { sortingPanel.classList.add("pinned") }})observer.observe(document.querySelector('.observer-point'))`

It could be explained like this: *when the `.observer-point`

element disappears, add `pinned`

class to `.filters-panel`

. Otherwise, remove it. *

Pretty simple, but why does it observe some `.observer-point`

instead of `.filters-pannel`

directly? Basically (depending on the implementation), without `.observer-point`

, it may run into an infinity loop like: *"Filters element disappears? Pin it! Oh, it is shown now... So "unpin it". Disappeared? Pin it!"* etc.

This issue causes flickering - an ugly one! A simple trick of observing an additional element like `.observer-point`

solves the issue. Remember to set its height to `0px`

. Otherwise, it won't work.

Last but not least. When you have another element fixed to the top (most likely a navbar), just initialize the `IntersectionObserver`

with `margin-top`

set to the negative navbar's height. You can do this by passing `rootMargin`

property as a second argument.

`const observer = new IntersectionObserver((entries) => { ...}, { rootMargin: '-60px 0px 0px 0px' }) // assuming the navbar's height is 60px`

With this trick, `.observer-point`

**will be detected before disappearing under the navbar**.

I believe that's the simplest, but practical example of the use of **IntersectionObserver**. It has much more functionalities, you can read about them in the documentation.

I think it's a bit less intuitive than using scroll event and doing calculations on the fly... But it's more elegant and efficient. `scroll`

event is so spammy - it's better to avoid it when you can 🙂

CSV is the most universal way of storing data in a text file. You can open a CSV with almost any application: *Excel*, *Numbers*, *Notepad*. CSV for programming languages is a piece of cake too.

**CSV files work great as bridges between different platforms** and tools because of their universality and simplicity. Thats why dealing with CSV format is a must-have for a software developer.

Okay, enough talking. Lets look at how simple is **importing data from CSV to Ruby on Rails** app.

Im a big fan of practical examples. And dogs. Lets combine these hobbies and import dog data from CSV to a database using Ruby!

Well import dog breeds from this CSV file. At this moment, it contains 361 rows of data plus a header row. Lets assume that we want to import all columns except `id`

, since `ActiveRecord`

will deal with it.

`csv_text = File.read(path_to_csv_file)csv = CSV.parse(csv_text, headers: true)csv.each do |row| DogBreed.create!(row.to_hash.except("id"))end`

Just 5 lines of the code and were done! Nice! But thats a bit boring. Yeah. Lets do the same thing with PostgreSQL and its secret weapon - `COPY`

!

Ruby (on Rails) makes us happy and lazy at once. Using `CSV.parse`

is a no-brainer. And it's great! But the general rule is:

Database engines deal better with data.

Okay, lets get to the Postgres CSV import.

`ActiveRecord::Base.connection.execute( <<-SQL CREATE TEMPORARY TABLE temp_dog_breeds ( id smallserial, name varchar, section varchar, provisional date, country varchar, url varchar, image varchar, pdf varchar ); COPY temp_dog_breeds FROM '#{path_to_csv_file}' WITH (FORMAT CSV, HEADER); INSERT INTO dog_breeds ( name, section, provisional, country, url, image, pdf, updated_at, created_at ) SELECT name, section, provisional, country, url, image, pdf, NOW(), NOW() FROM temp_dog_breeds; DROP TABLE temp_dog_breeds; SQL)`

The code does copy data from CSV to the database using COPY command. Its a pretty simple function, unfortunately, the data in CSV is not compliant with the `dog_breeds`

schema. The CSV file contains `id`

which we don't want to copy and lacks `updated_at`

and `created_at`

time stamps.

Thats why I created a temporary table as a middleware. So the workflow is like that:

Create a temp table analogous to the CSV data

Copy CSV data to the temp table

Populate the desired table using the temp table data with additional time stamps and without the id

Delete the temp table

Actually, we could cheat a bit and simplify this operation above. We could just remove `id`

column within CSV and add `updated_at`

and `created_at`

columns 😇. In that case, dealing with the temporary table would be unnecessary.

But hardcoding timestamp is a kinda hack. And we had a nice occasion for using the temporary table.

Is it worth crafting a SQL query for importing CSV data, instead of using Ruby? Its more complex. But do you remember that databases deal better with data than programming languages?

I benchmarked this and here is the result.

`61.990 (24.2%) i/s - 581.000 in 10.008736s# 10 times with 2 warmup roundsComparison: SQL: 62.0 i/s ruby: 0.6 i/s - 112.11x slower`

Postgres was about 112 times faster!

Nice! I didnt check the RAM usage, but you can expect that it was much lower for SQL as well 🚀

Theres no doubt that Postgres `COPY`

**smashes Ruby in terms of CSV import performance**. You should always import CSV data with SQL then! **Well... no**.

For **simple cases** like the one with dog breeds, **I'd go with Ruby**. Its a small dataset, we can expect that it doesnt change too frequently over time. Be pragmatic.

When it comes to **big datasets** and **frequent CSV imports**, **Postgres is much better**. But remember it costs you complexity and time. Not only for crafting the query but also for maintaining it.

In a nutshell, every operation under the transaction block has to be successful. What if something went wrong? Then every operation is rolled back. In other words, SQL transaction makes operations **atomic**, so they're treated as an indivisible whole.

"Do everything, but when something is wrong, do nothing"

A classic example is sending and withdrawing money. If a crappy ATM doesn't withdraw your money, you should keep your bank account balance unchanged. That's where the transactions shine.

In a Rails world, we could describe a transaction as:

`ActiveRecord::Base.transaction do me.send_money_to!(id: "Mr-Robot", amount: 100) # 🐛 - "I can break it!" mr_robot.credit!(100)end`

In the example above, I send $100 to Mr. Robot. It subtracts 100 from my balance and adds it to Mr Robot's account. Thanks to the transaction, if 🐛 activates at the 2nd line, it'll roll back the action above and give me back my 100 bucks. I won't lose the money! But Mr. Robot won't get them too, though.

To trigger rollback, an error has to be raised. Remember to use methods raising exceptions, prepended by convention with "!"

Rails transaction is nothing fancy. Under the hood, it uses SQL transaction block. So it's just `BEGIN`

, SQL operations, and at the end `COMMIT`

(successful case) or `ROLLBACK`

(well, less successful case).

`BEGIN-- do some SQL stuff-- ... and hereCOMMIT -- when everything was fine!-- when not, then...ROLLBACK`

Let's look at another application for Rails transactions. It's pretty trivial but shows that atomic actions are not reserved just for database operations.

Let's suppose we want to subscribe a user to a newsletter. We want to activate the user's subscription and call a third-party service to add the user to a mailing list.

`ActiveRecord::Base.transation do user.confirm_subscription! response = MailerApi.add_user_to_mailing_list( user: user, list_id: "newsletter-123" ) # 🐛 - "Will screw up this at some point, promise!" raise ActiveRecord::Rollback if response.error? # The API call went wrong? Rollback!end`

Calls to 3rd party services are particularly vulnerable. In the case above, when something went wrong with the API call, it triggers rolling back by `ActiveRecord::Rollback`

exception. Remember, a transaction with "safe" actions (not raising errors, without **!**) is **totally useless**.

Be aware, that the transaction block is performed as a whole by a database. This means it keeps the database connection open and blocks the database for the whole transaction's lifetime.

That's why the example above it's not the best in terms of performance and concurrency! 🙉

Anyway, make sure to keep your transaction blocks **as lean as possible**.