Tek bir instance(gözlem) için likelihood tahmini genelleştirilmiş haliyle şöyle olur
Görsel: H.S Ölmez - Sabanci University
" ] }, { "cell_type": "markdown", "metadata": { "id": "ldFYN2zJejvT" }, "source": [ "Veya Kaggle master Kaan hocamızın gösterdiği gibi beta yerine weight anlamında w'ler de kullanılabilir, ki bu şematik gösterim Neural Networks(Sinir Ağları/Derin Öğrenme) anlatımında da karşımıza çıkacak." ] }, { "cell_type": "markdown", "metadata": { "id": "B-G7vxiMejvU" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "Lg1wM2qzejvU" }, "source": [ "### Gradient Descent" ] }, { "cell_type": "markdown", "metadata": { "id": "6Air_6H-ejvU" }, "source": [ "Bunu LinReg notebookunda görmüştük, oraya tekrar bakabilirsiniz. Regresyon değil de classification bağlamında görmek için Kaan hocamızın yine yukarıdaki linkine bakabilirsiniz. Bunlara ek olarak aşağıdaki linklerden de gerek GD detayını gerek manuel implementasyonu görebilirsiniz.\n", "\n", "- https://towardsdatascience.com/logistic-regression-explained-and-implemented-in-python-880955306060\n", "- https://realpython.com/logistic-regression-python/" ] }, { "cell_type": "markdown", "metadata": { "id": "bNPeXu-vejvU" }, "source": [ "\"Bana kısaca sen anlat\" diyenler için şöyle özetleyeyim.\n", "\n", "- Öncelikle \"katsayıların ilk değerlerine ne verelim\" sorusuyla başlanır. Bunun için bazı teknikler var, diyelim ki 0.01 verdik ve $\\beta_0$(bias) için de 0 dedik.(Not: Neural Networklerdekinin aksine LogReg'de ağırlıklar 0 verilerek başlatılabilir)\n", "- x'ler ile betalar(weightler) çarpılır ve toplanır. Çıkan sonuç, bir aktivasyon fonksiyonu olan sigmoid fonksiyonuna sokulur. Diyelim ki eğitim setinde bir instance'ın classını 0(not-churn/not-spam v.s) tahminledik ve gerçekten de 0'mış(veya 1 dedik ve gerçekten 1 çıktı), o zaman kaybımız(`loss`) 0'dır. Bu işlemin, yani tahminle gerçek değer arasındaki farkın hesaplanma sürecinin, adı **forward propagation**'dır.\n", "- Tüm instancelar için bu loss'ların toplamına da `cost` deniyor. Nihai amaç, cost'un minimize olması.\n", "- Sonra başa dönüp betalar ve bias güncellenir, ki buna da **backward propagation** denir. Güncelleme işlemi de türev alarak gradient descent yöntemiyle yapıyoruz, ta ki eğim(yani türev) 0 olana kadar." ] }, { "cell_type": "markdown", "metadata": { "id": "B52TinbCejvV" }, "source": [ "### Cost function olarak LogLoss(binary cross entropi)" ] }, { "cell_type": "markdown", "metadata": { "id": "clnRnNF0ejvV" }, "source": [ "Cost functionımız yukarıdaki negatif log-likelihood veya diğer adıyla binary cross entropi fonksiyonudur. Buna negative log loss da denmektedir. Yukarıda bahsedilen tüm proses boyunca bu metrik minimize edilmeye çalışılır.\n", "\n", "$$\\large Negatif Log Likelihood = NLL = -\\sum_{i=1}^N{y_i}.log(p_i)$$" ] }, { "cell_type": "markdown", "metadata": { "id": "oDjm4FUbejvV" }, "source": [ "Aşağıda daha detaylı bilgiler edinebilrsiniz ancak özet olarak şunu söyleyebiliriz. Regresyon analizlerinden genelde SSE(Sum of Squared Errors), classficationda ise Log Loss/CrossEntropy Loss optimize edilmeye çalışılır. Tabi classficationda classlara farklı ağırlıklar vererek bu cost functionları modifiye etmek de mümkündür. Bunun nasıl yapıldığını Uçtan uca ML projesinde görebilirsiniz.\n", "\n", "Bu konu da önemli bir konu olup ilave okumalar yapmanızı öneririm\n", "\n", "- https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81\n", "- https://towardsdatascience.com/intuition-behind-log-loss-score-4e0c9979680a\n", "- https://towardsdatascience.com/linear-regression-using-gradient-descent-97a6c8700931\n", "- https://towardsdatascience.com/common-loss-functions-in-machine-learning-46af0ffc4d23\n", "- https://towardsdatascience.com/understanding-the-3-most-common-loss-functions-for-machine-learning-regression-23e0ef3e14d3\n", "- https://www.analyticsvidhya.com/blog/2019/08/detailed-guide-7-loss-functions-machine-learning-python-code/\n", "- https://www.analyticsvidhya.com/blog/2020/11/binary-cross-entropy-aka-log-loss-the-cost-function-used-in-logistic-regression/\n", "- https://www.data4v.com/log-loss-as-a-performance-metric/\n", "- https://medium.com/konvergen/cross-entropy-and-maximum-likelihood-estimation-58942b52517a\n", "- https://medium.com/@phuctrt/loss-functions-why-what-where-or-when-189815343d3f\n", "- https://algorithmia.com/blog/introduction-to-loss-functions\n", "- https://towardsdatascience.com/understanding-sigmoid-logistic-softmax-functions-and-cross-entropy-loss-log-loss-dbbbe0a17efb\n", "- https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a\n", "- http://www.awebb.info/probability/2017/05/18/cross-entropy-and-log-likelihood.html" ] }, { "cell_type": "markdown", "metadata": { "id": "dd-ZrrKpejvW" }, "source": [ "Son olarak şunu da söylemekte fayda var. GridSearch içinde de scoring parametresine de accuracy/precision gibi metriclere ek olarak **neg_log_loss** da verebiliyoruz. Yani neg_log_loss hem Logistic Regresyonunu optimize etmeye çalıştığı bir **fonksiyondur** hem de bir **evaluation metriğidir**. Daha detay bilgi için Uçtan uca ML projesi(PartI) içinde GridSearch bölümündeki Önmeli Husulara bakabilirsiniz." ] }, { "cell_type": "markdown", "metadata": { "id": "0FPD8y31ejvW" }, "source": [ "## Manuel Implementasyon" ] }, { "cell_type": "markdown", "metadata": { "id": "xygORgcUejvW" }, "source": [ "Manuel implementasyonu Kaan hocamızın Kaggle sayfasında bulabilirsiniz." ] }, { "cell_type": "markdown", "metadata": { "id": "2hZjj5UmejvW" }, "source": [ "## Varsayımlar" ] }, { "cell_type": "markdown", "metadata": { "id": "qulgOFgkejvX" }, "source": [ "- LinReg'in aksine Residual'ların normal dağılımı ve homoscdedasticity gerekmez\n", "- LinReg'in aksine prediktör ve target arasında Lineer ilişki gerekmez\n", "- LinReg'de olduğu gibi featurelar arasında multicollinearity, featureların öneminin yorumlanmasında sorun teşkil edebilir\n", "- LinReg'de olduğu gibi instanceların birbirinden bağımsız olması beklenir\n", "- LinReg'de olduğu gibi featureler arasında collinearity olmaması gerekir(Tahmin sonucunu değiştirmez, ama featureların önemini yorumlamada önemlidir)--> bunla ilgili kaynaklara LinReg notebookundan bakabilirsiniz.\n", "- Instance sayısı feature sayısının en az 10-15 katı olmalıdır" ] }, { "cell_type": "markdown", "metadata": { "id": "74KGbw_kejvX" }, "source": [ "## Önemli husular" ] }, { "cell_type": "markdown", "metadata": { "id": "RGWBm1gIejvX" }, "source": [ "**Genel**\n", "- Çok kritik bi detay değil ama mülakatlarda çıkabilir diye tekrar belirtmekte fayda var: Sınıflandırma algoritması değildir, sınıflandırmada kullanılan lineer regresyon algoritma türüdür\n", "- fit çizgisi S şeklindedir, ama decision boundry lineerdir\n", "- Maximum Likelihood(MLE) maximize edilmeye çalışılır(veya negative log likelihood cost function minimize edilemesi)\n", "- Target'ı LabelEncode etmeye gerek yoktur\n", "- Data linearly separable durumdaysa, MLE fonksiyonu sınıflar arasındaki ayrımı ortaya koymak için katsayıların gittikçe büyümesine neden olur. Bunu önlemek için **Penalty** kullanmak gerekir. Bu regülarizasyon cezasının seviyesini belirlemek için de lambda(sklearn'de bunun tersi olan \"C\" var) değeri kullanılır. Daha yüksek lambda(yani daha düşük C), daha güçlü regülarizasyon demektir. Bu değer, deneme yanılmayla optimize edilir.\n", "- defaultu binary classfication içindir ama multi_class parametresi **multinomial** yapılarak multi-class tahminleme yapılabilir.\n", "- **SGDClassfication**: SGD, genel olarak bir optimizasyon yöntemidir. Bu anlamda, SGDClassifer da, regularizasyon içeren bir linear modelin SGD(Stochastic Gradient Descent) ile optimize edilmiş halidir. sklearn'de LogReg için böyle bir classifer var: SGDClassifier(alpha=k, penalty='l2', loss='log')\n", "\n", "\n", "**Avantajlar**\n", "- computation comlexity:O(nd), hızlı eğitilir.\n", "- online/realtime kullanımı vardır.\n", "- Interpretability'si yüksektir(katsayılar aracılığıyla)\n", "\n", "**Dezavantajlar**\n", "- İyi bir optimizasyon elde etmek için yüksek sayıda veriye ihtiyaç duyar, az veride başarılı olmayabilir\n", "- Outlierlara karşı duyarlıdır, dikkatlice ele alınması gerekir.\n", "- Scaling'e duyarlıdır(Tahmin sonucunu değiştirmez, yorumlamada önemli)\n", "- lineer decision boundry'si vardır. linearly sperable olmayan datalarda, ki çoğunlukla öyle olacaktır, performansı düşüktür. O yüzden de diğer algoritmalara göre genelde daha düşük bir accuracy vardır. Çoğunlukla ana model olmak yerine baseline/benchmark model olarak seçilir." ] }, { "cell_type": "markdown", "metadata": { "id": "QPqYJx88ejvY" }, "source": [ "# Kod Pratiği" ] }, { "cell_type": "markdown", "metadata": { "id": "f224eptHejvY" }, "source": [ "## Data temini, analizi(EDA) ve preprocessing" ] }, { "cell_type": "markdown", "metadata": { "id": "s2d_bWbMejvY" }, "source": [ "Titanic verisetini inceleyeceğiz, bu birçok eğitimde anlatılan bir veri setidir, ve Kaggle'da da bulunmaktadır.\n", "\n", "Çeşitli bilgileri verilen yolcuların hayatta kalıp kalmadığı bilgisi var. Bu bilgileri kullnarak bir model oluşturacağız ve yeni gelen veri setindeki bir kişinin hayatta kalıp kalmadığını tahmin etmeye çalışacağız.\n", "\n", "**Önemli not**: İki ayrı veri seti verilmiş durumda. train ve test diye. Böyle iki parça halinde verilen setlerde genelde test setinde label olmaz, ve bunu bizim tahmin etmemiz, sonra da sonuçları bir yere yüklememiz istenir. Gerçek değerleri biz bilmeyiz, onlar bu verisetini yaratanların elindedir. Bu örnekte de durum böyle. Bu durum biraz kafanızı karıştırabilir. Şöyle yapalım, burdaki test setini saha verisi olarak düşünün. Biz okuyacağımız train verisini ise elimizdeki ana veri gibi düşünüp, onu yine kendi içinde train ve teste ayıracağız." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2022-02-04T18:39:24.706251Z", "start_time": "2022-02-04T18:39:23.190631Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 224 }, "executionInfo": { "elapsed": 3497, "status": "ok", "timestamp": 1729948685023, "user": { "displayName": "Volkan Yurtseven", "userId": "15726953944641946140" }, "user_tz": -180 }, "id": "w9f2MdUTejvY", "outputId": "07c3c43b-2af2-4604-f6fd-89644a1d22f2" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "summary": "{\n \"name\": \"df\",\n \"rows\": 891,\n \"fields\": [\n {\n \"column\": \"PassengerId\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 257,\n \"min\": 1,\n \"max\": 891,\n \"num_unique_values\": 891,\n \"samples\": [\n 710,\n 440,\n 841\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Survived\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Pclass\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 3,\n \"num_unique_values\": 3,\n \"samples\": [\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 891,\n \"samples\": [\n \"Moubarek, Master. Halim Gonios (\\\"William George\\\")\",\n \"Kvillner, Mr. Johan Henrik Johannesson\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Sex\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"female\",\n \"male\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Age\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 14.526497332334044,\n \"min\": 0.42,\n \"max\": 80.0,\n \"num_unique_values\": 88,\n \"samples\": [\n 0.75,\n 22.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"SibSp\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 8,\n \"num_unique_values\": 7,\n \"samples\": [\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Parch\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 6,\n \"num_unique_values\": 7,\n \"samples\": [\n 0,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Ticket\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 681,\n \"samples\": [\n \"11774\",\n \"248740\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Fare\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 49.693428597180905,\n \"min\": 0.0,\n \"max\": 512.3292,\n \"num_unique_values\": 248,\n \"samples\": [\n 11.2417,\n 51.8625\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Cabin\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 147,\n \"samples\": [\n \"D45\",\n \"B49\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Embarked\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"S\",\n \"C\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", "type": "dataframe", "variable_name": "df" }, "text/html": [ "\n", "| \n", " | PassengerId | \n", "Survived | \n", "Pclass | \n", "Name | \n", "Sex | \n", "Age | \n", "SibSp | \n", "Parch | \n", "Ticket | \n", "Fare | \n", "Cabin | \n", "Embarked | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "1 | \n", "0 | \n", "3 | \n", "Braund, Mr. Owen Harris | \n", "male | \n", "22.0 | \n", "1 | \n", "0 | \n", "A/5 21171 | \n", "7.2500 | \n", "NaN | \n", "S | \n", "
| 1 | \n", "2 | \n", "1 | \n", "1 | \n", "Cumings, Mrs. John Bradley (Florence Briggs Th... | \n", "female | \n", "38.0 | \n", "1 | \n", "0 | \n", "PC 17599 | \n", "71.2833 | \n", "C85 | \n", "C | \n", "
| 2 | \n", "3 | \n", "1 | \n", "3 | \n", "Heikkinen, Miss. Laina | \n", "female | \n", "26.0 | \n", "0 | \n", "0 | \n", "STON/O2. 3101282 | \n", "7.9250 | \n", "NaN | \n", "S | \n", "
| 3 | \n", "4 | \n", "1 | \n", "1 | \n", "Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n", "female | \n", "35.0 | \n", "1 | \n", "0 | \n", "113803 | \n", "53.1000 | \n", "C123 | \n", "S | \n", "
| 4 | \n", "5 | \n", "0 | \n", "3 | \n", "Allen, Mr. William Henry | \n", "male | \n", "35.0 | \n", "0 | \n", "0 | \n", "373450 | \n", "8.0500 | \n", "NaN | \n", "S | \n", "
| \n", " | Type | \n", "Nunique(Excl.Nulls) | \n", "#of Missing | \n", "MostFreqItem | \n", "MostFreqCount | \n", "First | \n", "
|---|---|---|---|---|---|---|
| PassengerId | \n", "int64 | \n", "891 | \n", "0 | \n", "1 | \n", "1 | \n", "1 | \n", "
| Survived | \n", "int64 | \n", "2 | \n", "0 | \n", "0 | \n", "549 | \n", "0 | \n", "
| Pclass | \n", "int64 | \n", "3 | \n", "0 | \n", "3 | \n", "491 | \n", "3 | \n", "
| Name | \n", "object | \n", "891 | \n", "0 | \n", "Braund, Mr. Owen Harris | \n", "1 | \n", "Braund, Mr. Owen Harris | \n", "
| Sex | \n", "object | \n", "2 | \n", "0 | \n", "male | \n", "577 | \n", "male | \n", "
| Age | \n", "float64 | \n", "89 | \n", "177 | \n", "24.0 | \n", "30 | \n", "22.0 | \n", "
| SibSp | \n", "int64 | \n", "7 | \n", "0 | \n", "0 | \n", "608 | \n", "1 | \n", "
| Parch | \n", "int64 | \n", "7 | \n", "0 | \n", "0 | \n", "678 | \n", "0 | \n", "
| Ticket | \n", "object | \n", "681 | \n", "0 | \n", "347082 | \n", "7 | \n", "A/5 21171 | \n", "
| Fare | \n", "float64 | \n", "248 | \n", "0 | \n", "8.05 | \n", "43 | \n", "7.25 | \n", "
| Cabin | \n", "object | \n", "148 | \n", "687 | \n", "B96 B98 | \n", "4 | \n", "NaN | \n", "
| Embarked | \n", "object | \n", "4 | \n", "2 | \n", "S | \n", "644 | \n", "S | \n", "
| \n", " | count | \n", "
|---|---|
| Cabin | \n", "\n", " |
| B96 B98 | \n", "4 | \n", "
| G6 | \n", "4 | \n", "
| C23 C25 C27 | \n", "4 | \n", "
| C22 C26 | \n", "3 | \n", "
| F33 | \n", "3 | \n", "
| F2 | \n", "3 | \n", "
| E101 | \n", "3 | \n", "
| D | \n", "3 | \n", "
| C78 | \n", "2 | \n", "
| C93 | \n", "2 | \n", "
| \n", " | Survived | \n", "
|---|---|
| Sex | \n", "0.540200 | \n", "
| Pclass | \n", "0.336684 | \n", "
| CabinGrup | \n", "0.320034 | \n", "
| Fare | \n", "0.257307 | \n", "
| Embarked | \n", "0.173099 | \n", "
| \n", " | proportion | \n", "
|---|---|
| Survived | \n", "\n", " |
| 0 | \n", "0.616162 | \n", "
| 1 | \n", "0.383838 | \n", "
| \n", " | \n", " | proportion | \n", "count | \n", "
|---|---|---|---|
| Embarked | \n", "Survived | \n", "\n", " | \n", " |
| C | \n", "1 | \n", "0.553571 | \n", "93 | \n", "
| 0 | \n", "0.446429 | \n", "75 | \n", "|
| Q | \n", "0 | \n", "0.610390 | \n", "47 | \n", "
| 1 | \n", "0.389610 | \n", "30 | \n", "|
| S | \n", "0 | \n", "0.663043 | \n", "427 | \n", "
| 1 | \n", "0.336957 | \n", "217 | \n", "
| \n", " | Survived | \n", "Pclass | \n", "Age | \n", "SibSp | \n", "Parch | \n", "Fare | \n", "
|---|---|---|---|---|---|---|
| count | \n", "891.000000 | \n", "891.000000 | \n", "714.000000 | \n", "891.000000 | \n", "891.000000 | \n", "891.000000 | \n", "
| mean | \n", "0.383838 | \n", "2.308642 | \n", "29.699118 | \n", "0.523008 | \n", "0.381594 | \n", "32.204208 | \n", "
| std | \n", "0.486592 | \n", "0.836071 | \n", "14.526497 | \n", "1.102743 | \n", "0.806057 | \n", "49.693429 | \n", "
| min | \n", "0.000000 | \n", "1.000000 | \n", "0.420000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
| 25% | \n", "0.000000 | \n", "2.000000 | \n", "20.125000 | \n", "0.000000 | \n", "0.000000 | \n", "7.910400 | \n", "
| 50% | \n", "0.000000 | \n", "3.000000 | \n", "28.000000 | \n", "0.000000 | \n", "0.000000 | \n", "14.454200 | \n", "
| 75% | \n", "1.000000 | \n", "3.000000 | \n", "38.000000 | \n", "1.000000 | \n", "0.000000 | \n", "31.000000 | \n", "
| max | \n", "1.000000 | \n", "3.000000 | \n", "80.000000 | \n", "8.000000 | \n", "6.000000 | \n", "512.329200 | \n", "
| \n", " | Survived | \n", "Pclass | \n", "Sex | \n", "Age | \n", "SibSp | \n", "Parch | \n", "Fare | \n", "Embarked | \n", "CabinGrup | \n", "
|---|---|---|---|---|---|---|---|---|---|
| 78 | \n", "1 | \n", "2 | \n", "male | \n", "0.83 | \n", "0 | \n", "2 | \n", "29.0000 | \n", "S | \n", "Z | \n", "
| 305 | \n", "1 | \n", "1 | \n", "male | \n", "0.92 | \n", "1 | \n", "2 | \n", "151.5500 | \n", "S | \n", "C | \n", "
| 469 | \n", "1 | \n", "3 | \n", "female | \n", "0.75 | \n", "2 | \n", "1 | \n", "19.2583 | \n", "C | \n", "Z | \n", "
| 644 | \n", "1 | \n", "3 | \n", "female | \n", "0.75 | \n", "2 | \n", "1 | \n", "19.2583 | \n", "C | \n", "Z | \n", "
| 755 | \n", "1 | \n", "2 | \n", "male | \n", "0.67 | \n", "1 | \n", "1 | \n", "14.5000 | \n", "S | \n", "Z | \n", "
| 803 | \n", "1 | \n", "3 | \n", "male | \n", "0.42 | \n", "0 | \n", "1 | \n", "8.5167 | \n", "C | \n", "Z | \n", "
| 831 | \n", "1 | \n", "2 | \n", "male | \n", "0.83 | \n", "1 | \n", "1 | \n", "18.7500 | \n", "S | \n", "Z | \n", "
| \n", " | Survived | \n", "Pclass | \n", "Sex | \n", "Age | \n", "SibSp | \n", "Parch | \n", "Fare | \n", "Embarked | \n", "CabinGrup | \n", "
|---|---|---|---|---|---|---|---|---|---|
| Survived | \n", "1.0000 | \n", "NaN | \n", "0.5402 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
| Pclass | \n", "NaN | \n", "1.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0.594217 | \n", "NaN | \n", "0.598269 | \n", "
| Sex | \n", "0.5402 | \n", "NaN | \n", "1.0000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
| Age | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
| SibSp | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
| Parch | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
| Fare | \n", "NaN | \n", "0.594217 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1.000000 | \n", "NaN | \n", "0.576878 | \n", "
| Embarked | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1.0 | \n", "NaN | \n", "
| CabinGrup | \n", "NaN | \n", "0.598269 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0.576878 | \n", "NaN | \n", "1.000000 | \n", "
| \n", " | Survived | \n", "Pclass | \n", "Sex | \n", "Age | \n", "SibSp | \n", "Parch | \n", "Fare | \n", "Embarked | \n", "CabinGrup | \n", "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "0 | \n", "3 | \n", "male | \n", "22.0 | \n", "1 | \n", "0 | \n", "7.2500 | \n", "S | \n", "Z | \n", "
| 1 | \n", "1 | \n", "1 | \n", "female | \n", "38.0 | \n", "1 | \n", "0 | \n", "71.2833 | \n", "C | \n", "C | \n", "
| 2 | \n", "1 | \n", "3 | \n", "female | \n", "26.0 | \n", "0 | \n", "0 | \n", "7.9250 | \n", "S | \n", "Z | \n", "
| 3 | \n", "1 | \n", "1 | \n", "female | \n", "35.0 | \n", "1 | \n", "0 | \n", "53.1000 | \n", "S | \n", "C | \n", "
| 4 | \n", "0 | \n", "3 | \n", "male | \n", "35.0 | \n", "0 | \n", "0 | \n", "8.0500 | \n", "S | \n", "Z | \n", "
HalvingRandomSearchCV(cv=RepeatedKFold(n_repeats=10, n_splits=5, random_state=1),\n",
" error_score='raise',\n",
" estimator=Pipeline(steps=[('log',\n",
" FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})),\n",
" ('ct',\n",
" ColumnTransformer(n_jobs=-1,\n",
" remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(d...\n",
" 'clf__alpha': array([1.e+04, 1.e+03, 1.e+02, 1.e+01, 1.e+00, 1.e-01, 1.e-02, 1.e-03,\n",
" 1.e-04, 1.e-05]),\n",
" 'clf__class_weight': [{0: 1, 1: 2},\n",
" {0: 1, 1: 4},\n",
" {0: 1, 1: 6},\n",
" {0: 1, 1: 8},\n",
" {0: 1, 1: 10},\n",
" 'balanced'],\n",
" 'clf__solver': ['svd', 'cholesky',\n",
" 'lsqr', 'sparse_cg',\n",
" 'sag', 'saga'],\n",
" 'clf__tol': [0.001, 0.0001],\n",
" 'ct__numerics__ouh': [OutlierHandler(featureindices=[0,\n",
" 3]),\n",
" None],\n",
" 'ct__numerics__scl': [StandardScaler(),\n",
" MinMaxScaler()]}],\n",
" scoring='accuracy', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. HalvingRandomSearchCV(cv=RepeatedKFold(n_repeats=10, n_splits=5, random_state=1),\n",
" error_score='raise',\n",
" estimator=Pipeline(steps=[('log',\n",
" FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})),\n",
" ('ct',\n",
" ColumnTransformer(n_jobs=-1,\n",
" remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(d...\n",
" 'clf__alpha': array([1.e+04, 1.e+03, 1.e+02, 1.e+01, 1.e+00, 1.e-01, 1.e-02, 1.e-03,\n",
" 1.e-04, 1.e-05]),\n",
" 'clf__class_weight': [{0: 1, 1: 2},\n",
" {0: 1, 1: 4},\n",
" {0: 1, 1: 6},\n",
" {0: 1, 1: 8},\n",
" {0: 1, 1: 10},\n",
" 'balanced'],\n",
" 'clf__solver': ['svd', 'cholesky',\n",
" 'lsqr', 'sparse_cg',\n",
" 'sag', 'saga'],\n",
" 'clf__tol': [0.001, 0.0001],\n",
" 'ct__numerics__ouh': [OutlierHandler(featureindices=[0,\n",
" 3]),\n",
" None],\n",
" 'ct__numerics__scl': [StandardScaler(),\n",
" MinMaxScaler()]}],\n",
" scoring='accuracy', verbose=1)Pipeline(steps=[('log',\n",
" FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})),\n",
" ('ct',\n",
" ColumnTransformer(n_jobs=-1, remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(drop='first',\n",
" handle_unknown='ignore'))]),\n",
" [1, 6, 7]),\n",
" ('numerics',\n",
" Pipeline(steps=[('imp',\n",
" SimpleImputer(strategy='median')),\n",
" ('ouh',\n",
" OutlierHandler(featureindices=[0,\n",
" 3])),\n",
" ('scl',\n",
" StandardScaler())]),\n",
" [2, 3, 4, 5])])),\n",
" ('clf',\n",
" RidgeClassifier(alpha=0.001, class_weight='balanced',\n",
" random_state=42, solver='cholesky'))])FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})ColumnTransformer(n_jobs=-1, remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(drop='first',\n",
" handle_unknown='ignore'))]),\n",
" [1, 6, 7]),\n",
" ('numerics',\n",
" Pipeline(steps=[('imp',\n",
" SimpleImputer(strategy='median')),\n",
" ('ouh',\n",
" OutlierHandler(featureindices=[0,\n",
" 3])),\n",
" ('scl', StandardScaler())]),\n",
" [2, 3, 4, 5])])[1, 6, 7]
OneHotEncoder(drop='first', handle_unknown='ignore')
[2, 3, 4, 5]
SimpleImputer(strategy='median')
OutlierHandler(featureindices=[0, 3])
StandardScaler()
['Pclass']
passthrough
RidgeClassifier(alpha=0.001, class_weight='balanced', random_state=42,\n",
" solver='cholesky')| \n", " | param_ct__numerics__scl | \n", "param_ct__numerics__ouh | \n", "param_clf__tol | \n", "param_clf__penalty | \n", "param_clf__learning_rate | \n", "param_clf__l1_ratio | \n", "param_clf__eta0 | \n", "param_clf__early_stopping | \n", "param_clf__class_weight | \n", "param_clf__alpha | \n", "param_clf | \n", "param_clf__solver | \n", "param_clf__C | \n", "mean_test_score | \n", "std_test_score | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19 | \n", "StandardScaler() | \n", "OutlierHandler(featureindices=[0, 3]) | \n", "0.0001 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "balanced | \n", "0.001 | \n", "RidgeClassifier(random_state=42) | \n", "cholesky | \n", "NaN | \n", "0.790235 | \n", "0.043490 | \n", "
| 17 | \n", "StandardScaler() | \n", "OutlierHandler(featureindices=[0, 3]) | \n", "0.0001 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "balanced | \n", "0.001 | \n", "RidgeClassifier(random_state=42) | \n", "cholesky | \n", "NaN | \n", "0.778736 | \n", "0.072210 | \n", "
| 18 | \n", "StandardScaler() | \n", "OutlierHandler(featureindices=[0, 3]) | \n", "0.0001 | \n", "elasticnet | \n", "constant | \n", "1.0 | \n", "0.0001 | \n", "False | \n", "balanced | \n", "0.010 | \n", "SGDClassifier(loss='log_loss', max_iter=4000, ... | \n", "NaN | \n", "NaN | \n", "0.773740 | \n", "0.045706 | \n", "
| 16 | \n", "StandardScaler() | \n", "OutlierHandler(featureindices=[0, 3]) | \n", "0.0001 | \n", "elasticnet | \n", "constant | \n", "1.0 | \n", "0.0001 | \n", "False | \n", "balanced | \n", "0.010 | \n", "SGDClassifier(loss='log_loss', max_iter=4000, ... | \n", "NaN | \n", "NaN | \n", "0.738529 | \n", "0.075656 | \n", "
| 3 | \n", "StandardScaler() | \n", "OutlierHandler(featureindices=[0, 3]) | \n", "0.0001 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "balanced | \n", "0.001 | \n", "RidgeClassifier(random_state=42) | \n", "cholesky | \n", "NaN | \n", "0.734667 | \n", "0.136529 | \n", "
| \n", " | MAX of mean_test_score | \n", "MIN of mean_fit_time | \n", "
|---|---|---|
| param_clf | \n", "\n", " | \n", " |
| RidgeClassifier | \n", "0.790235 | \n", "0.032552 | \n", "
| SGDClassifier | \n", "0.773740 | \n", "0.031389 | \n", "
| LogisticRegression | \n", "0.659494 | \n", "0.056477 | \n", "
RandomizedSearchCV(cv=RepeatedKFold(n_repeats=10, n_splits=5, random_state=1),\n",
" error_score='raise',\n",
" estimator=Pipeline(steps=[('log',\n",
" FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})),\n",
" ('ct',\n",
" ColumnTransformer(n_jobs=-1,\n",
" remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(drop...\n",
" 'clf__alpha': array([1.e+04, 1.e+03, 1.e+02, 1.e+01, 1.e+00, 1.e-01, 1.e-02, 1.e-03,\n",
" 1.e-04, 1.e-05]),\n",
" 'clf__class_weight': [{0: 1, 1: 2},\n",
" {0: 1, 1: 4},\n",
" {0: 1, 1: 6},\n",
" {0: 1, 1: 8},\n",
" {0: 1, 1: 10},\n",
" 'balanced'],\n",
" 'clf__solver': ['svd', 'cholesky',\n",
" 'lsqr', 'sparse_cg',\n",
" 'sag', 'saga'],\n",
" 'clf__tol': [0.001, 0.0001],\n",
" 'ct__numerics__ouh': [OutlierHandler(featureindices=[0,\n",
" 3]),\n",
" None],\n",
" 'ct__numerics__scl': [StandardScaler(),\n",
" MinMaxScaler()]}],\n",
" scoring='accuracy', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=RepeatedKFold(n_repeats=10, n_splits=5, random_state=1),\n",
" error_score='raise',\n",
" estimator=Pipeline(steps=[('log',\n",
" FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})),\n",
" ('ct',\n",
" ColumnTransformer(n_jobs=-1,\n",
" remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(drop...\n",
" 'clf__alpha': array([1.e+04, 1.e+03, 1.e+02, 1.e+01, 1.e+00, 1.e-01, 1.e-02, 1.e-03,\n",
" 1.e-04, 1.e-05]),\n",
" 'clf__class_weight': [{0: 1, 1: 2},\n",
" {0: 1, 1: 4},\n",
" {0: 1, 1: 6},\n",
" {0: 1, 1: 8},\n",
" {0: 1, 1: 10},\n",
" 'balanced'],\n",
" 'clf__solver': ['svd', 'cholesky',\n",
" 'lsqr', 'sparse_cg',\n",
" 'sag', 'saga'],\n",
" 'clf__tol': [0.001, 0.0001],\n",
" 'ct__numerics__ouh': [OutlierHandler(featureindices=[0,\n",
" 3]),\n",
" None],\n",
" 'ct__numerics__scl': [StandardScaler(),\n",
" MinMaxScaler()]}],\n",
" scoring='accuracy', verbose=1)Pipeline(steps=[('log',\n",
" FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})),\n",
" ('ct',\n",
" ColumnTransformer(n_jobs=-1, remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(drop='first',\n",
" handle_unknown='ignore'))]),\n",
" [1, 6, 7]),\n",
" ('numerics',\n",
" Pipeline(steps=[('imp',\n",
" SimpleImputer(strategy='median')),\n",
" ('ouh',\n",
" OutlierHandler(featureindices=[0,\n",
" 3])),\n",
" ('scl',\n",
" StandardScaler())]),\n",
" [2, 3, 4, 5])])),\n",
" ('clf',\n",
" SGDClassifier(alpha=1e-05, class_weight='balanced',\n",
" early_stopping=True, eta0=0.01, l1_ratio=0.5,\n",
" learning_rate='constant', loss='log_loss',\n",
" max_iter=4000, penalty='l1', random_state=42))])FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})ColumnTransformer(n_jobs=-1, remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(drop='first',\n",
" handle_unknown='ignore'))]),\n",
" [1, 6, 7]),\n",
" ('numerics',\n",
" Pipeline(steps=[('imp',\n",
" SimpleImputer(strategy='median')),\n",
" ('ouh',\n",
" OutlierHandler(featureindices=[0,\n",
" 3])),\n",
" ('scl', StandardScaler())]),\n",
" [2, 3, 4, 5])])[1, 6, 7]
OneHotEncoder(drop='first', handle_unknown='ignore')
[2, 3, 4, 5]
SimpleImputer(strategy='median')
OutlierHandler(featureindices=[0, 3])
StandardScaler()
['Pclass']
passthrough
SGDClassifier(alpha=1e-05, class_weight='balanced', early_stopping=True,\n",
" eta0=0.01, l1_ratio=0.5, learning_rate='constant',\n",
" loss='log_loss', max_iter=4000, penalty='l1', random_state=42)| \n", " | param_ct__numerics__scl | \n", "param_ct__numerics__ouh | \n", "param_clf__tol | \n", "param_clf__penalty | \n", "param_clf__learning_rate | \n", "param_clf__l1_ratio | \n", "param_clf__eta0 | \n", "param_clf__early_stopping | \n", "param_clf__class_weight | \n", "param_clf__alpha | \n", "param_clf | \n", "param_clf__solver | \n", "param_clf__C | \n", "mean_test_score | \n", "std_test_score | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29 | \n", "StandardScaler() | \n", "OutlierHandler(featureindices=[0, 3]) | \n", "0.0010 | \n", "l1 | \n", "constant | \n", "0.5 | \n", "0.0100 | \n", "True | \n", "balanced | \n", "0.00001 | \n", "SGDClassifier(loss='log_loss', max_iter=4000, ... | \n", "NaN | \n", "NaN | \n", "0.788054 | \n", "0.034758 | \n", "
| 21 | \n", "MinMaxScaler() | \n", "OutlierHandler(featureindices=[0, 3]) | \n", "0.0010 | \n", "l1 | \n", "adaptive | \n", "0.5 | \n", "0.0100 | \n", "True | \n", "balanced | \n", "0.00001 | \n", "SGDClassifier(loss='log_loss', max_iter=4000, ... | \n", "NaN | \n", "NaN | \n", "0.786992 | \n", "0.030304 | \n", "
| 67 | \n", "StandardScaler() | \n", "None | \n", "0.0010 | \n", "l2 | \n", "optimal | \n", "0.5 | \n", "0.0010 | \n", "False | \n", "balanced | \n", "0.00100 | \n", "SGDClassifier(loss='log_loss', max_iter=4000, ... | \n", "NaN | \n", "NaN | \n", "0.786829 | \n", "0.039258 | \n", "
| 94 | \n", "StandardScaler() | \n", "None | \n", "0.0010 | \n", "l1 | \n", "adaptive | \n", "0.5 | \n", "0.0100 | \n", "False | \n", "{0: 1, 1: 2} | \n", "0.01000 | \n", "SGDClassifier(loss='log_loss', max_iter=4000, ... | \n", "NaN | \n", "NaN | \n", "0.786247 | \n", "0.037753 | \n", "
| 39 | \n", "StandardScaler() | \n", "OutlierHandler(featureindices=[0, 3]) | \n", "0.0001 | \n", "elasticnet | \n", "constant | \n", "0.5 | \n", "0.0001 | \n", "False | \n", "balanced | \n", "0.00010 | \n", "SGDClassifier(loss='log_loss', max_iter=4000, ... | \n", "NaN | \n", "NaN | \n", "0.785801 | \n", "0.035637 | \n", "
| \n", " | MAX of mean_test_score | \n", "MIN of mean_fit_time | \n", "
|---|---|---|
| param_clf | \n", "\n", " | \n", " |
| SGDClassifier | \n", "0.788054 | \n", "0.034028 | \n", "
| RidgeClassifier | \n", "0.785356 | \n", "0.037027 | \n", "
| LogisticRegression | \n", "0.743727 | \n", "0.078997 | \n", "
| \n", " | 29 | \n", "
|---|---|
| param_ct__numerics__scl | \n", "StandardScaler() | \n", "
| param_ct__numerics__ouh | \n", "OutlierHandler(featureindices=[0, 3]) | \n", "
| param_clf__tol | \n", "0.001 | \n", "
| param_clf__penalty | \n", "l1 | \n", "
| param_clf__learning_rate | \n", "constant | \n", "
| param_clf__l1_ratio | \n", "0.5 | \n", "
| param_clf__eta0 | \n", "0.01 | \n", "
| param_clf__early_stopping | \n", "True | \n", "
| param_clf__class_weight | \n", "balanced | \n", "
| param_clf__alpha | \n", "0.00001 | \n", "
| param_clf | \n", "SGDClassifier(loss='log_loss', max_iter=4000, ... | \n", "
| param_clf__solver | \n", "NaN | \n", "
| param_clf__C | \n", "NaN | \n", "
| mean_test_score | \n", "0.788054 | \n", "
| std_test_score | \n", "0.034758 | \n", "
HalvingRandomSearchCV(cv=RepeatedKFold(n_repeats=10, n_splits=5, random_state=1),\n",
" error_score='raise',\n",
" estimator=Pipeline(steps=[('log',\n",
" FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})),\n",
" ('ct',\n",
" ColumnTransformer(n_jobs=-1,\n",
" remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(d...\n",
" 'clf__class_weight': ['balanced'],\n",
" 'clf__early_stopping': [True,\n",
" False],\n",
" 'clf__eta0': [0.005, 0.01, 0.02],\n",
" 'clf__l1_ratio': [0.3, 0.5, 0.7],\n",
" 'clf__learning_rate': ['adaptive'],\n",
" 'clf__penalty': ['l1'],\n",
" 'clf__tol': [0.0005, 0.001, 0.002,\n",
" 0.005],\n",
" 'ct__numerics__ouh': [OutlierHandler(featureindices=[0,\n",
" 3]),\n",
" None],\n",
" 'ct__numerics__scl': [StandardScaler()]}],\n",
" scoring='accuracy', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. HalvingRandomSearchCV(cv=RepeatedKFold(n_repeats=10, n_splits=5, random_state=1),\n",
" error_score='raise',\n",
" estimator=Pipeline(steps=[('log',\n",
" FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})),\n",
" ('ct',\n",
" ColumnTransformer(n_jobs=-1,\n",
" remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(d...\n",
" 'clf__class_weight': ['balanced'],\n",
" 'clf__early_stopping': [True,\n",
" False],\n",
" 'clf__eta0': [0.005, 0.01, 0.02],\n",
" 'clf__l1_ratio': [0.3, 0.5, 0.7],\n",
" 'clf__learning_rate': ['adaptive'],\n",
" 'clf__penalty': ['l1'],\n",
" 'clf__tol': [0.0005, 0.001, 0.002,\n",
" 0.005],\n",
" 'ct__numerics__ouh': [OutlierHandler(featureindices=[0,\n",
" 3]),\n",
" None],\n",
" 'ct__numerics__scl': [StandardScaler()]}],\n",
" scoring='accuracy', verbose=1)Pipeline(steps=[('log',\n",
" FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})),\n",
" ('ct',\n",
" ColumnTransformer(n_jobs=-1, remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(drop='first',\n",
" handle_unknown='ignore'))]),\n",
" [1, 6, 7]),\n",
" ('numerics',\n",
" Pipeline(steps=[('imp',\n",
" SimpleImputer(strategy='median')),\n",
" ('ouh', None),\n",
" ('scl',\n",
" StandardScaler())]),\n",
" [2, 3, 4, 5])])),\n",
" ('clf',\n",
" SGDClassifier(alpha=0.005, class_weight='balanced',\n",
" early_stopping=True, eta0=0.02, l1_ratio=0.3,\n",
" learning_rate='adaptive', loss='log_loss',\n",
" max_iter=4000, penalty='l1', random_state=42))])FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})ColumnTransformer(n_jobs=-1, remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(drop='first',\n",
" handle_unknown='ignore'))]),\n",
" [1, 6, 7]),\n",
" ('numerics',\n",
" Pipeline(steps=[('imp',\n",
" SimpleImputer(strategy='median')),\n",
" ('ouh', None),\n",
" ('scl', StandardScaler())]),\n",
" [2, 3, 4, 5])])[1, 6, 7]
OneHotEncoder(drop='first', handle_unknown='ignore')
[2, 3, 4, 5]
SimpleImputer(strategy='median')
None
StandardScaler()
['Pclass']
passthrough
SGDClassifier(alpha=0.005, class_weight='balanced', early_stopping=True,\n",
" eta0=0.02, l1_ratio=0.3, learning_rate='adaptive',\n",
" loss='log_loss', max_iter=4000, penalty='l1', random_state=42)RandomizedSearchCV(cv=RepeatedKFold(n_repeats=10, n_splits=5, random_state=1),\n",
" error_score='raise',\n",
" estimator=Pipeline(steps=[('log',\n",
" FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})),\n",
" ('ct',\n",
" ColumnTransformer(n_jobs=-1,\n",
" remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(drop...\n",
" 'clf__class_weight': ['balanced'],\n",
" 'clf__early_stopping': [True, False],\n",
" 'clf__eta0': [0.005, 0.01, 0.02],\n",
" 'clf__l1_ratio': [0.3, 0.5, 0.7],\n",
" 'clf__learning_rate': ['adaptive'],\n",
" 'clf__penalty': ['l1'],\n",
" 'clf__tol': [0.0005, 0.001, 0.002,\n",
" 0.005],\n",
" 'ct__numerics__ouh': [OutlierHandler(featureindices=[0,\n",
" 3]),\n",
" None],\n",
" 'ct__numerics__scl': [StandardScaler()]}],\n",
" scoring='neg_log_loss', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=RepeatedKFold(n_repeats=10, n_splits=5, random_state=1),\n",
" error_score='raise',\n",
" estimator=Pipeline(steps=[('log',\n",
" FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})),\n",
" ('ct',\n",
" ColumnTransformer(n_jobs=-1,\n",
" remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(drop...\n",
" 'clf__class_weight': ['balanced'],\n",
" 'clf__early_stopping': [True, False],\n",
" 'clf__eta0': [0.005, 0.01, 0.02],\n",
" 'clf__l1_ratio': [0.3, 0.5, 0.7],\n",
" 'clf__learning_rate': ['adaptive'],\n",
" 'clf__penalty': ['l1'],\n",
" 'clf__tol': [0.0005, 0.001, 0.002,\n",
" 0.005],\n",
" 'ct__numerics__ouh': [OutlierHandler(featureindices=[0,\n",
" 3]),\n",
" None],\n",
" 'ct__numerics__scl': [StandardScaler()]}],\n",
" scoring='neg_log_loss', verbose=1)Pipeline(steps=[('log',\n",
" FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})),\n",
" ('ct',\n",
" ColumnTransformer(n_jobs=-1, remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(drop='first',\n",
" handle_unknown='ignore'))]),\n",
" [1, 6, 7]),\n",
" ('numerics',\n",
" Pipeline(steps=[('imp',\n",
" SimpleImputer(strategy='median')),\n",
" ('ouh', None),\n",
" ('scl',\n",
" StandardScaler())]),\n",
" [2, 3, 4, 5])])),\n",
" ('clf',\n",
" SGDClassifier(alpha=0.001, class_weight='balanced', eta0=0.02,\n",
" l1_ratio=0.3, learning_rate='adaptive',\n",
" loss='log_loss', max_iter=4000, penalty='l1',\n",
" random_state=42, tol=0.0005))])FunctionTransformer(func=<function logTransformer at 0x7e7147611b40>,\n",
" kw_args={'col_name': 'Fare'})ColumnTransformer(n_jobs=-1, remainder='passthrough',\n",
" transformers=[('nominals',\n",
" Pipeline(steps=[('ohe',\n",
" OneHotEncoder(drop='first',\n",
" handle_unknown='ignore'))]),\n",
" [1, 6, 7]),\n",
" ('numerics',\n",
" Pipeline(steps=[('imp',\n",
" SimpleImputer(strategy='median')),\n",
" ('ouh', None),\n",
" ('scl', StandardScaler())]),\n",
" [2, 3, 4, 5])])[1, 6, 7]
OneHotEncoder(drop='first', handle_unknown='ignore')
[2, 3, 4, 5]
SimpleImputer(strategy='median')
None
StandardScaler()
['Pclass']
passthrough
SGDClassifier(alpha=0.001, class_weight='balanced', eta0=0.02, l1_ratio=0.3,\n",
" learning_rate='adaptive', loss='log_loss', max_iter=4000,\n",
" penalty='l1', random_state=42, tol=0.0005)| \n", " | param_ct__numerics__scl | \n", "param_ct__numerics__ouh | \n", "param_clf__tol | \n", "param_clf__penalty | \n", "param_clf__learning_rate | \n", "param_clf__l1_ratio | \n", "param_clf__eta0 | \n", "param_clf__early_stopping | \n", "param_clf__class_weight | \n", "param_clf__alpha | \n", "param_clf | \n", "mean_test_score | \n", "std_test_score | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 53 | \n", "StandardScaler() | \n", "None | \n", "0.0005 | \n", "l1 | \n", "adaptive | \n", "0.7 | \n", "0.02 | \n", "False | \n", "balanced | \n", "0.001 | \n", "SGDClassifier(loss='log_loss', max_iter=4000, ... | \n", "-0.451633 | \n", "0.042714 | \n", "
| 24 | \n", "StandardScaler() | \n", "None | \n", "0.0005 | \n", "l1 | \n", "adaptive | \n", "0.3 | \n", "0.02 | \n", "False | \n", "balanced | \n", "0.001 | \n", "SGDClassifier(loss='log_loss', max_iter=4000, ... | \n", "-0.451633 | \n", "0.042714 | \n", "
| 84 | \n", "StandardScaler() | \n", "None | \n", "0.0005 | \n", "l1 | \n", "adaptive | \n", "0.5 | \n", "0.02 | \n", "False | \n", "balanced | \n", "0.002 | \n", "SGDClassifier(loss='log_loss', max_iter=4000, ... | \n", "-0.451745 | \n", "0.041281 | \n", "
| 77 | \n", "StandardScaler() | \n", "OutlierHandler(featureindices=[0, 3]) | \n", "0.0010 | \n", "l1 | \n", "adaptive | \n", "0.3 | \n", "0.02 | \n", "False | \n", "balanced | \n", "0.001 | \n", "SGDClassifier(loss='log_loss', max_iter=4000, ... | \n", "-0.452186 | \n", "0.042767 | \n", "
| 14 | \n", "StandardScaler() | \n", "OutlierHandler(featureindices=[0, 3]) | \n", "0.0010 | \n", "l1 | \n", "adaptive | \n", "0.5 | \n", "0.02 | \n", "False | \n", "balanced | \n", "0.001 | \n", "SGDClassifier(loss='log_loss', max_iter=4000, ... | \n", "-0.452186 | \n", "0.042767 | \n", "
RidgeClassifier(alpha=0.001, class_weight='balanced', random_state=42,\n",
" solver='cholesky')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RidgeClassifier(alpha=0.001, class_weight='balanced', random_state=42,\n",
" solver='cholesky')