fmda/data/scaling_tutorial.ipynb

   1 {
   2  "cells": [
   3   {
   4    "cell_type": "markdown",
   5    "id": "37200d84-2865-422d-b961-a28da3aa0367",
   6    "metadata": {},
   7    "source": [
   8     "# Data Scaling Tutorial\n",
   9     "\n",
  10     "This notebook is meant to introduce some data scaling methods used in machine learning. Scaling features in ML is used for many reasons. Some techniques within ML critically depend on features of on a common scale, such as L1/L2 regularization or nearest-neighbors techniques. In Neural Networks, scaling allows the network to learn the relative contributions of each feature without being dominated by the scale of one feature or another.\n",
  11     "\n",
  12     "*Note:* data can be transformed and inverse-transformed using the methods below, but exact results cannot be recovered due to approximation errors."
  13    ]
  14   },
  15   {
  16    "cell_type": "markdown",
  17    "id": "dab0f8a1-ce3c-4931-98b7-6b4c35e9c2aa",
  18    "metadata": {},
  19    "source": [
  20     "## Environment"
  21    ]
  22   },
  23   {
  24    "cell_type": "code",
  25    "execution_count": 1,
  26    "id": "af2710ac-2ae8-4e16-bc42-9d4da2c9ef35",
  27    "metadata": {},
  28    "outputs": [
  29     {
  30      "name": "stderr",
  31      "output_type": "stream",
  32      "text": [
  33       "C:\\Users\\jhirs\\anaconda3\\lib\\site-packages\\requests\\__init__.py:109: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (5.1.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n",
  34       "  warnings.warn(\n"
  35      ]
  36     }
  37    ],
  38    "source": [
  39     "# Environment\n",
  40     "import numpy as np\n",
  41     "import pandas as pd\n",
  42     "import tensorflow as tf\n",
  43     "import matplotlib.pyplot as plt\n",
  44     "import sys\n",
  45     "# Local modules\n",
  46     "sys.path.append('..')\n",
  47     "import reproducibility\n",
  48     "from utils import hash2\n",
  49     "from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
  50     "from moisture_rnn_pkl import pkl2train\n",
  51     "from moisture_rnn import create_rnn_data2"
  52    ]
  53   },
  54   {
  55    "cell_type": "markdown",
  56    "id": "6d31aae1-b030-455d-8919-3b49d56c9e28",
  57    "metadata": {},
  58    "source": [
  59     "## Setup & Data Read\n"
  60    ]
  61   },
  62   {
  63    "cell_type": "code",
  64    "execution_count": 2,
  65    "id": "0f00156a-8f3c-4ce1-8c7a-4c1a17dc37d7",
  66    "metadata": {
  67     "scrolled": true
  68    },
  69    "outputs": [],
  70    "source": [
  71     "file='test_CA_202401.pkl'\n",
  72     "train = pkl2train([file])"
  73    ]
  74   },
  75   {
  76    "cell_type": "code",
  77    "execution_count": 3,
  78    "id": "cef48924-6fd5-4c51-8be0-50ec99cc1de8",
  79    "metadata": {},
  80    "outputs": [
  81     {
  82      "data": {
  83       "text/plain": [
  84        "{'batch_size': 32,\n",
  85        " 'timesteps': 5,\n",
  86        " 'optimizer': 'adam',\n",
  87        " 'rnn_layers': 1,\n",
  88        " 'rnn_units': 6,\n",
  89        " 'dense_layers': 1,\n",
  90        " 'dense_units': 1,\n",
  91        " 'activation': ['linear', 'linear'],\n",
  92        " 'centering': [0.0, 0.0],\n",
  93        " 'dropout': [0.2, 0.2],\n",
  94        " 'recurrent_dropout': 0.2,\n",
  95        " 'reset_states': True,\n",
  96        " 'epochs': 100,\n",
  97        " 'learning_rate': 0.001,\n",
  98        " 'phys_initialize': False,\n",
  99        " 'stateful': True,\n",
 100        " 'verbose_weights': True,\n",
 101        " 'verbose_fit': False,\n",
 102        " 'features_list': ['Ed', 'Ew', 'rain'],\n",
 103        " 'scale': False,\n",
 104        " 'scaler': 'minmax',\n",
 105        " 'train_frac': 0.5,\n",
 106        " 'val_frac': 0.1}"
 107       ]
 108      },
 109      "execution_count": 3,
 110      "metadata": {},
 111      "output_type": "execute_result"
 112     }
 113    ],
 114    "source": [
 115     "import yaml\n",
 116     "\n",
 117     "with open(\"../params.yaml\") as file:\n",
 118     "    params = yaml.safe_load(file)[\"rnn\"]\n",
 119     "params"
 120    ]
 121   },
 122   {
 123    "cell_type": "code",
 124    "execution_count": 4,
 125    "id": "d2f33573-a683-42df-84c4-3def2c71ea91",
 126    "metadata": {},
 127    "outputs": [
 128     {
 129      "name": "stdout",
 130      "output_type": "stream",
 131      "text": [
 132       "Not scaling data\n"
 133      ]
 134     }
 135    ],
 136    "source": [
 137     "case = 'KRNC1_202401'\n",
 138     "rnn_dat = create_rnn_data2(train[case], params)\n",
 139     "X = rnn_dat[\"X_train\"]"
 140    ]
 141   },
 142   {
 143    "cell_type": "code",
 144    "execution_count": 5,
 145    "id": "378a92b1-a626-420f-825b-13e7f7d51556",
 146    "metadata": {},
 147    "outputs": [
 148     {
 149      "data": {
 150       "text/plain": [
 151        "array([[15.65665995, 14.24907313,  0.        ],\n",
 152        "       [16.37073623, 14.95777203,  0.        ],\n",
 153        "       [16.8830433 , 15.46613268,  0.        ],\n",
 154        "       [16.58511884, 15.17121406,  0.        ],\n",
 155        "       [15.42272608, 14.02588457,  0.        ]])"
 156       ]
 157      },
 158      "execution_count": 5,
 159      "metadata": {},
 160      "output_type": "execute_result"
 161     }
 162    ],
 163    "source": [
 164     "X[0:5, :]"
 165    ]
 166   },
 167   {
 168    "cell_type": "markdown",
 169    "id": "12737986-1847-42eb-8339-104e2197b4f7",
 170    "metadata": {},
 171    "source": [
 172     "## Min-Max Scaler\n",
 173     "\n",
 174     "Rescales data to a given range, (0, 1) by default in `sklearn`. If $x$ is a feature vector, we calculate the transformation $x'$ by:\n",
 175     "\n",
 176     "$$\n",
 177     "x' = \\frac{x-\\min\\{x\\}}{\\max\\{x\\}-\\min\\{x\\}}\n",
 178     "$$\n",
 179     "\n",
 180     "Notice that $x'=0$ if $x$ equals the minimum, and $x'=1$ if $x$ equals the maximum, as desired."
 181    ]
 182   },
 183   {
 184    "cell_type": "markdown",
 185    "id": "0aea996e-2fe3-4b58-8362-655ea73c5c22",
 186    "metadata": {},
 187    "source": [
 188     "### Manual Calculation"
 189    ]
 190   },
 191   {
 192    "cell_type": "code",
 193    "execution_count": 6,
 194    "id": "33d6c3c1-d087-4651-bfc1-9bd4ef161407",
 195    "metadata": {},
 196    "outputs": [
 197     {
 198      "name": "stdout",
 199      "output_type": "stream",
 200      "text": [
 201       "X column mins: [9.18025688 7.97581012 0.        ]\n",
 202       "X column maxs: [34.47706758 32.29556689  1.80660525]\n"
 203      ]
 204     }
 205    ],
 206    "source": [
 207     "min = X.min(axis=0)\n",
 208     "max = X.max(axis=0)\n",
 209     "print(f\"X column mins: {min}\")\n",
 210     "print(f\"X column maxs: {max}\")"
 211    ]
 212   },
 213   {
 214    "cell_type": "code",
 215    "execution_count": 7,
 216    "id": "5141c3ac-cd3d-46ae-992e-97e3f3594d7d",
 217    "metadata": {},
 218    "outputs": [
 219     {
 220      "name": "stdout",
 221      "output_type": "stream",
 222      "text": [
 223       "[[0.25601658 0.25794925 0.        ]\n",
 224       " [0.2842445  0.28709012 0.        ]\n",
 225       " [0.30449635 0.30799332 0.        ]\n",
 226       " [0.29271919 0.29586661 0.        ]\n",
 227       " [0.24676902 0.248772   0.        ]]\n"
 228      ]
 229     }
 230    ],
 231    "source": [
 232     "X_scaled = (X - min) / (max - min)\n",
 233     "\n",
 234     "print(X_scaled[0:5, :])"
 235    ]
 236   },
 237   {
 238    "cell_type": "markdown",
 239    "id": "11af6579-38f8-4af7-b00e-e90c9e7e6529",
 240    "metadata": {},
 241    "source": [
 242     "The scaled data should have column mins & maxes approximately equal to 0 and 1, respectively."
 243    ]
 244   },
 245   {
 246    "cell_type": "code",
 247    "execution_count": 8,
 248    "id": "fc879733-1fcb-4b09-9357-e2b06fc5c9ea",
 249    "metadata": {},
 250    "outputs": [
 251     {
 252      "name": "stdout",
 253      "output_type": "stream",
 254      "text": [
 255       "X-scaled column mins: [0. 0. 0.]\n",
 256       "X-scaled column maxs: [1. 1. 1.]\n"
 257      ]
 258     }
 259    ],
 260    "source": [
 261     "print(f\"X-scaled column mins: {X_scaled.min(axis=0)}\")\n",
 262     "print(f\"X-scaled column maxs: {X_scaled.max(axis=0)}\")"
 263    ]
 264   },
 265   {
 266    "cell_type": "markdown",
 267    "id": "20eb432d-d998-49a6-b9af-ef5627ffca3e",
 268    "metadata": {},
 269    "source": [
 270     "### Using `sklearn`"
 271    ]
 272   },
 273   {
 274    "cell_type": "code",
 275    "execution_count": 9,
 276    "id": "c7e8757c-a0d2-403c-96c9-f0983dc641a5",
 277    "metadata": {},
 278    "outputs": [],
 279    "source": [
 280     "scaler = MinMaxScaler()\n",
 281     "scaler.fit(X)\n",
 282     "X_scaled2 = scaler.transform(X)"
 283    ]
 284   },
 285   {
 286    "cell_type": "code",
 287    "execution_count": 10,
 288    "id": "654315e5-54b1-4e3c-9028-9e7fd7f70b70",
 289    "metadata": {},
 290    "outputs": [
 291     {
 292      "name": "stdout",
 293      "output_type": "stream",
 294      "text": [
 295       "[[0.25601658 0.25794925 0.        ]\n",
 296       " [0.2842445  0.28709012 0.        ]\n",
 297       " [0.30449635 0.30799332 0.        ]\n",
 298       " [0.29271919 0.29586661 0.        ]\n",
 299       " [0.24676902 0.248772   0.        ]]\n"
 300      ]
 301     }
 302    ],
 303    "source": [
 304     "print(X_scaled2[0:5, :])"
 305    ]
 306   },
 307   {
 308    "cell_type": "markdown",
 309    "id": "7c82a9bb-d5b0-42a5-ba80-7e8f12bcfc21",
 310    "metadata": {},
 311    "source": [
 312     "### Compare Difference\n",
 313     "\n",
 314     "The difference between the methods should be approximately zero, or close to machine-epsilon."
 315    ]
 316   },
 317   {
 318    "cell_type": "code",
 319    "execution_count": 11,
 320    "id": "6b28dd98-95be-43c8-8e08-26f05060a8d8",
 321    "metadata": {},
 322    "outputs": [
 323     {
 324      "data": {
 325       "text/plain": [
 326        "2.220446049250313e-16"
 327       ]
 328      },
 329      "execution_count": 11,
 330      "metadata": {},
 331      "output_type": "execute_result"
 332     }
 333    ],
 334    "source": [
 335     "np.max(np.abs(X_scaled - X_scaled2))"
 336    ]
 337   },
 338   {
 339    "cell_type": "markdown",
 340    "id": "eba2f69d-b3c6-4f08-bbe0-4b191970223a",
 341    "metadata": {},
 342    "source": [
 343     "## Standard Scaler\n",
 344     "\n",
 345     "Scale features to mean 0 and standard deviation 1, equivalent to z-scores. This method assumes features are approximately normally distributed and will lead to strange results if not. If $x$ is a feature vector of length $N$, we calculate the standard transformation $x'$ by:\n",
 346     "\n",
 347     "$$\n",
 348     "x' = \\frac{x-\\mu}{s}\n",
 349     "$$\n",
 350     "\n",
 351     "\n",
 352     "\n",
 353     "$$\\text{Where},\\quad \\mu = \\frac{1}{N}\\sum_{i=1}^n x_i \\quad \\text{And,}\\quad s = \\sqrt{\\sum_{i=1}^n\\frac{(x_i-\\mu)^2}{N}}$$"
 354    ]
 355   },
 356   {
 357    "cell_type": "markdown",
 358    "id": "a0f43caf-618f-472f-b0d5-dba0db187bea",
 359    "metadata": {},
 360    "source": [
 361     "### Manual Calculation"
 362    ]
 363   },
 364   {
 365    "cell_type": "code",
 366    "execution_count": 12,
 367    "id": "ee69773e-10d5-4da2-8694-95f15dba681f",
 368    "metadata": {},
 369    "outputs": [
 370     {
 371      "name": "stdout",
 372      "output_type": "stream",
 373      "text": [
 374       "X column means: [17.37172913 15.95227122  0.07087056]\n",
 375       "X column sds: [4.14878625 4.04256916 0.32151782]\n"
 376      ]
 377     }
 378    ],
 379    "source": [
 380     "mu = X.mean(axis=0)\n",
 381     "s = X.std(axis=0)\n",
 382     "print(f\"X column means: {mu}\")\n",
 383     "print(f\"X column sds: {s}\")"
 384    ]
 385   },
 386   {
 387    "cell_type": "code",
 388    "execution_count": 13,
 389    "id": "1b9808a6-4d51-4b8d-b1ee-946623edcb5f",
 390    "metadata": {},
 391    "outputs": [
 392     {
 393      "name": "stdout",
 394      "output_type": "stream",
 395      "text": [
 396       "[[-0.41339059 -0.42131576 -0.22042497]\n",
 397       " [-0.24127367 -0.24600672 -0.22042497]\n",
 398       " [-0.11779007 -0.12025485 -0.22042497]\n",
 399       " [-0.1896001  -0.19320811 -0.22042497]\n",
 400       " [-0.46977668 -0.47652534 -0.22042497]]\n"
 401      ]
 402     }
 403    ],
 404    "source": [
 405     "X_scaled = (X - mu)/s\n",
 406     "print(X_scaled[0:5, :])"
 407    ]
 408   },
 409   {
 410    "cell_type": "markdown",
 411    "id": "fcb5a10e-b7ff-4a12-ba3a-2b52199caef8",
 412    "metadata": {},
 413    "source": [
 414     "The resulting scaled data should have column means approximately equal to zero and column standard deviations approximately equal to one."
 415    ]
 416   },
 417   {
 418    "cell_type": "code",
 419    "execution_count": 14,
 420    "id": "46b90923-692f-4d68-8d2a-29182da17dd2",
 421    "metadata": {},
 422    "outputs": [
 423     {
 424      "name": "stdout",
 425      "output_type": "stream",
 426      "text": [
 427       "X-scaled column means: [ 9.78053617e-16 -1.63890066e-16 -1.96271570e-16]\n",
 428       "X-scaled column sds: [1. 1. 1.]\n"
 429      ]
 430     }
 431    ],
 432    "source": [
 433     "print(f\"X-scaled column means: {X_scaled.mean(axis=0)}\")\n",
 434     "print(f\"X-scaled column sds: {X_scaled.std(axis=0)}\")"
 435    ]
 436   },
 437   {
 438    "cell_type": "markdown",
 439    "id": "f79b0a9b-c206-486e-8103-b1a029fde330",
 440    "metadata": {},
 441    "source": [
 442     "### Using `sklearn`"
 443    ]
 444   },
 445   {
 446    "cell_type": "code",
 447    "execution_count": 15,
 448    "id": "cbab5ddc-dbab-433c-9084-1dc0061b37f2",
 449    "metadata": {},
 450    "outputs": [],
 451    "source": [
 452     "scaler = StandardScaler()\n",
 453     "scaler.fit(X)\n",
 454     "X_scaled2 = scaler.transform(X)"
 455    ]
 456   },
 457   {
 458    "cell_type": "code",
 459    "execution_count": 16,
 460    "id": "e2fbe910-33e4-4847-86ff-9abf0e661273",
 461    "metadata": {},
 462    "outputs": [
 463     {
 464      "name": "stdout",
 465      "output_type": "stream",
 466      "text": [
 467       "[[-0.41339059 -0.42131576 -0.22042497]\n",
 468       " [-0.24127367 -0.24600672 -0.22042497]\n",
 469       " [-0.11779007 -0.12025485 -0.22042497]\n",
 470       " [-0.1896001  -0.19320811 -0.22042497]\n",
 471       " [-0.46977668 -0.47652534 -0.22042497]]\n"
 472      ]
 473     }
 474    ],
 475    "source": [
 476     "print(X_scaled2[0:5, :])"
 477    ]
 478   },
 479   {
 480    "cell_type": "markdown",
 481    "id": "4ad5c920-144a-44d4-83b0-e4184cbacb96",
 482    "metadata": {},
 483    "source": [
 484     "### Compare Difference\n",
 485     "\n",
 486     "The difference between the methods should be approximately zero, or close to machine-epsilon."
 487    ]
 488   },
 489   {
 490    "cell_type": "code",
 491    "execution_count": 17,
 492    "id": "eaee74cb-d978-4b8e-9fc5-121671afc087",
 493    "metadata": {},
 494    "outputs": [
 495     {
 496      "data": {
 497       "text/plain": [
 498        "0.0"
 499       ]
 500      },
 501      "execution_count": 17,
 502      "metadata": {},
 503      "output_type": "execute_result"
 504     }
 505    ],
 506    "source": [
 507     "np.max(np.abs(X_scaled - X_scaled2))"
 508    ]
 509   },
 510   {
 511    "cell_type": "markdown",
 512    "id": "62673faf-061c-415d-83b8-f786a87a69ad",
 513    "metadata": {},
 514    "source": [
 515     "## References\n",
 516     "\n",
 517     "- `MinMaxScaler` from Scikitlearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html\n",
 518     "\n",
 519     "- `StandardScaler` from Scikitlearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html"
 520    ]
 521   },
 522   {
 523    "cell_type": "code",
 524    "execution_count": null,
 525    "id": "d985c603-1a97-4e03-88a5-d868dcc38659",
 526    "metadata": {},
 527    "outputs": [],
 528    "source": []
 529   }
 530  ],
 531  "metadata": {
 532   "kernelspec": {
 533    "display_name": "Python 3 (ipykernel)",
 534    "language": "python",
 535    "name": "python3"
 536   },
 537   "language_info": {
 538    "codemirror_mode": {
 539     "name": "ipython",
 540     "version": 3
 541    },
 542    "file_extension": ".py",
 543    "mimetype": "text/x-python",
 544    "name": "python",
 545    "nbconvert_exporter": "python",
 546    "pygments_lexer": "ipython3",
 547    "version": "3.9.12"
 548   }
 549  },
 550  "nbformat": 4,
 551  "nbformat_minor": 5
 552 }