{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Approaches to Data Processing \n",
    "\n",
    "#### OLTP \n",
    "online transaction processing\n",
    "- purpose: supports daily transactions\n",
    "- design: application oriented\n",
    "- data: up to date, operational\n",
    "- size: snapshots, GBs, traditional database\n",
    "- queries: simple transactions and frequent updates\n",
    "- users: thousands\n",
    "\n",
    "#### OLAP\n",
    "online analytical processing\n",
    "- purpose: report and analyze data\n",
    "- design: subject oriented\n",
    "- data: consolidated, historical\n",
    "- size: archive, TBs, data warehouse\n",
    "- queries: complex, aggregate queries and limited updates\n",
    "- users: hundreds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Storing Data\n",
    "\n",
    "#### structured data\n",
    "- follows a schema\n",
    "- defined datatypes and relationships\n",
    "- SQL, tables \n",
    "- easier to analyze, less flexible\n",
    "#### unstructured data\n",
    "- schemaless\n",
    "- makes up majority of data\n",
    "- photos, logs, MP3\n",
    "- harder to analyze, more flexible\n",
    "#### semistructured data\n",
    "- does not follow larger schema\n",
    "- self-describing structure\n",
    "- NoSQL, XML, JSON\n",
    "\n",
    "#### data warehouses\n",
    "- optimized for analytics\n",
    "- organized for reading and aggregating data\n",
    "- usually read-only\n",
    "- contains data from multiple sources\n",
    "- Massively Parallel Processing (MPP)\n",
    "- denormalized schema and dimensional modeling\n",
    "- Amazon Redshift, Azure SQL Data Warehouse, Google Big Query\n",
    "- schema is schema on write (predefined)\n",
    "    - data mart\n",
    "        - subset of a data warehouse\n",
    "        - dedicated to a specific topic\n",
    "\n",
    "#### data lake\n",
    "- storing data of all structures = flexibility and scalability\n",
    "- for analyzing big data\n",
    "- lower cost than a warehouse\n",
    "- can store petabytes of data\n",
    "- schema is schema on read (determined when processed)\n",
    "- uncatalogged data lake is a data swamp\n",
    "- run big data analytics through spark or hadoop\n",
    "- Google Cloud Storage, AWS Data Lake, Azure Data Lake\n",
    "#### dataflow approaches\n",
    "##### ETL\n",
    "extract, transform, load\n",
    "- traditional form\n",
    "- data is transformed before storage\n",
    "##### ELT\n",
    "extract, load, transform\n",
    "- more common among data lakes (big data)\n",
    "- data is stored in native form, then transformed as needed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Database Design\n",
    "- determines how data is logically stored\n",
    "- how data will be read and updated\n",
    "- uses **_database models_** - high-level specs for database structure\n",
    "    - most popular: relational model\n",
    "    - other options: NoSQL models, object oriented model, network model\n",
    "- uses **_schemas_** - blueprint of the database\n",
    "- defines tables, fields, relationships, indexes and views\n",
    "- when inserting data in relational databases, schemas must be respected\n",
    "#### Step One: Data modeling\n",
    "- process of creating a data model for the data to be stored\n",
    "- **_three levels to a data model:_**\n",
    "     - conceptual data model\n",
    "         - Tools: data structure diagrams aka entity-relational diagrams and UML diagrams\n",
    "    - logical data model\n",
    "        - Tools: database models and schemas aka relational model and star schema\n",
    "    - physical data model\n",
    "        - Tools: partitions, CPUs, indexes, backup systems, tablespaces\n",
    "##### dimensional models\n",
    "- adaptation of relational model for warehouse design\n",
    "- optimized for OLAP queries and aggregating data, not updating (OLTP)\n",
    "- uses a **_star schema_** (one-dimensional) or an extension of the star schema called a **_snowflake schema_**(multi-dimensional)\n",
    "- easy to interpret and extend\n",
    "- organized by what is being analyzed and how often the entities change\n",
    "- **_two types of tables_**\n",
    "    - FACT TABLES\n",
    "        - decided by business use case\n",
    "        - holds records of a metric\n",
    "        - changes regularly\n",
    "        - connects to dimensions via foreign keys\n",
    "    - DIMENSION TABLES\n",
    "        - holds descriptions of attributes\n",
    "        - does not change as often\n",
    "#### Normalization\n",
    "- database design technique\n",
    "- divides tables into smaller tables and connects them via relationships\n",
    "- the goal is to reduce redundancy and increase data integrity\n",
    "- saves space by eliminating data redundancy\n",
    "- queries are longer and more complex\n",
    "- enforces data consistency\n",
    "- safer updating, removing and inserting\n",
    "- easier to redesign by extending\n",
    "- OLTP preferred method, not OLAP\n",
    "\n",
    "        #adding foreign key to a star table\n",
    "        ALTER TABLE main_table ADD CONSTRAINT new_normalized_table\n",
    "        FOREIGN KEY (new_normalized_key) REFERENCES star_table_name (star_table_key);\n",
    "        \n",
    "        #Create a new table with a distinct column\n",
    "        CREATE TABLE new_normalized_table (\n",
    "            new_column DATATYPE  NOT NULL\n",
    "        );\n",
    "\n",
    "        #Insert data from star table \n",
    "        INSERT INTO new_normalized_table\n",
    "        SELECT DISTINCT star_table_column FROM star_table_name;\n",
    "\n",
    "        #Add a primary key \n",
    "        ALTER TABLE new_normalized_table ADD COLUMN new_primary_key SERIAL PRIMARY KEY;\n",
    "        \n",
    "##### normal forms (from least to most normalized)\n",
    "- first normal form (1NF)\n",
    "    - each record must be unique - no duplicate rows\n",
    "    - each cell must hold one value\n",
    "- second normal form (2NF)\n",
    "    - must satisfy 1NF\n",
    "    - primary key is one column\n",
    "    - if there is a composite (two or more columns) primary key, then each non-key column must be dependent on all the keys\n",
    "- third normal form (3NF)\n",
    "    - must satisfy 2NF\n",
    "    - no transitive dependencies: non-key columns cannot depend on other non-key columns\n",
    "- elementary normal key form (EKNF)\n",
    "- boyce-codd normal form (BCNF)\n",
    "- fourth normal form (4NF)\n",
    "- essential tuple normal form (ETNF)\n",
    "- fifth normal form (5NF)\n",
    "- domain-key normal form (DKNF)\n",
    "- sixth normal form (6NF)\n",
    "##### anomolies if not normalized enough\n",
    "- update anomoly\n",
    "    - data inconsistency caused by redundancy when updating\n",
    "- insertion anomoly\n",
    "    - unable to add a record due to missing attributes\n",
    "-deletion anomoly\n",
    "    - deletion of record unintentionally causes loss of data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Database Views\n",
    "##### unmaterialized\n",
    "- virtual table that is not part of the physical schema\n",
    "- query is not data, but is stored in memory\n",
    "- data is aggregated from data in tables\n",
    "- can be queried like a regular table\\\n",
    "- no need to retype common queries or alter schemas\n",
    "##### materialized \n",
    "- physically materialized\n",
    "- stores the query results\n",
    "- accesses the stored query results\n",
    "- refreshed or rematerialized when prompted or scheduled\n",
    "- good for long running queries\n",
    "- good for OLAP type data\n",
    "- refreshing schedules handled through a scheduler like cron\n",
    "#### create an unmaterialized view\n",
    "    CREATE VIEW view_name AS\n",
    "    SELECT col1. col2\n",
    "    FROM table_name\n",
    "    WHERE condition;\n",
    "#### create a materialized view \n",
    "    CREATE MATERIALIZED VIEW mv_name AS\n",
    "    SELECT col1. col2\n",
    "        FROM table_name\n",
    "        WHERE condition;\n",
    "#### refresh a materialized view\n",
    "    REFRESH MATERIALIZED VIEW mv_name;\n",
    "#### delete a view\n",
    "    DROP VIEW view_name [CASCADE | RESTRICT]\n",
    "- RESTRICT returns an error if there are objects that depend on the view\n",
    "- CASCADE drops view and any object that depends on that view\n",
    "#### redefine a view\n",
    "    CREATE OR REPLACE VIEW view_name AS new_query\n",
    "- if view_name exists, it is replaced\n",
    "- new_query must generate the same column names, order and data types as the old query\n",
    "- column output may be different\n",
    "- new columns can be added at the end\n",
    "#### query an unmaterialized view\n",
    "    SELECT * FROM view_name\n",
    "#### viewing all view information\n",
    "    #postgreSQL - will need to find reletive command per database type\n",
    "    SELECT * FROM INFORMATION_SCHEMA.views\n",
    "    #exclude system views\n",
    "    WHERE table_schema NOT IN ('pg_catalog', 'information_schema'); \n",
    "#### managing views\n",
    "- grant access\n",
    "\n",
    "        GRANT (type of privilege)\n",
    "        #type of privilege: SELECT, INSERT, UPDATE, DELETE\n",
    "        ON object #table, view, schema\n",
    "        TO role FROM role #user or group of users\n",
    "- revoke access \n",
    "\n",
    "        REVOKE (type of privilege) \n",
    "        #type of privileges: SELECT, INSERT, UPDATE, DELETE\n",
    "        FROM object #table, view, schema\n",
    "        TO role FROM role #user or group of users\n",
    "**avoid modifying data through views, create new views and delete old views**\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Database Roles and Access Control\n",
    "#### roles\n",
    "- entity that contains information that\n",
    "    - defines the role's privileges (login, db creation, writing to tables)\n",
    "    - interacts with the authentication system (passwords)\n",
    "- roles can be assigned to one or more users\n",
    "- roles are global across db cluster installation\n",
    "##### create role\n",
    "    CREATE ROLE user_name or group_name\n",
    "    WITH PASSWORD 'password_string'\n",
    "    VALID UNTIL 'YYYY-MM-DD';\n",
    "##### create admin role\n",
    "    CREATE ROLE admin CREATEDB;\n",
    "##### alter role\n",
    "    ALTER ROLE admin CREATEROLE;\n",
    "##### grant privilege\n",
    "        GRANT (type of privilege)\n",
    "        #type of privilege: SELECT, INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER, CREATE, TEMPORARY, EXECUTE, USAGE\n",
    "        ON object #table, view, schema\n",
    "        TO role FROM role #user or group of \n",
    "##### revoke privilege\n",
    "        REVOKE (type of privilege) \n",
    "        #type of privilege: SELECT, INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER, CREATE, TEMPORARY, EXECUTE, USAGE\n",
    "        FROM object #table, view, schema\n",
    "        TO role FROM role #user or group of users\n",
    "##### assign or remove a role to a group\n",
    "    GRANT group_name TO user_name;\n",
    "    REVOKE group_name FROM user_name;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Table partitioning\n",
    "- distributes data over several separate entities\n",
    "- used for both OLAP and OLTP\n",
    "- will need to copy data from existing tables\n",
    "- constraints on partitions\n",
    "- can partition data to several machines (sharding)\n",
    "#### vertical partitioning\n",
    "- splits up a table by its columns\n",
    "- best used for splitting out column data that is hardly ever used\n",
    "#### horizontal partitioning\n",
    "- split tables over the rows\n",
    "\n",
    "        # Create a new table called table_partitioned\n",
    "        CREATE TABLE table_partitioned (\n",
    "          primary_key INT,\n",
    "          col1 TEXT NOT NULL,\n",
    "          col2 TEXT\n",
    "        )\n",
    "        PARTITION BY LIST (col2);\n",
    "\n",
    "        # Create the partitions for 2019, 2018, and 2017\n",
    "        CREATE TABLE table_2019\n",
    "            PARTITION OF table_partitioned FOR VALUES IN ('2019');\n",
    "\n",
    "        CREATE TABLE table_2018\n",
    "            PARTITION OF table_partitioned FOR VALUES IN ('2018');\n",
    "\n",
    "        CREATE TABLE table_2017\n",
    "            PARTITION OF table_partitioned FOR VALUES IN ('2017');\n",
    "\n",
    "        # Insert the data into table_partitioned\n",
    "        INSERT INTO table_partitioned\n",
    "        SELECT primary_key, col1, col2 FROM table;\n",
    "\n",
    "        # View film_partitioned\n",
    "        SELECT * FROM table_partitioned;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data integration\n",
    "- combines data from different sources, formats and technologies to provide users with a translated and unified view of that data\n",
    "#### unified data model\n",
    "##### data sources\n",
    "- SQL, NoSQL, csv, images\n",
    "##### processing type\n",
    "- redshift, google, azure\n",
    "##### transformation\n",
    "- auto transform or individual\n",
    "- real time, or periodic transformation\n",
    "##### alerts\n",
    "- alert system for errors\n",
    "##### security\n",
    "- to prevent data breaches internally and externally\n",
    "##### governance\n",
    "- where data originated and where it is used at all times"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Database Management System (DBMS)\n",
    "- system software for creating and maintaining databases\n",
    "    - data\n",
    "    - database schema\n",
    "    - database engine\n",
    "#### types\n",
    "- depends on data use\n",
    "##### SQL DBMS\n",
    "- relational dbms\n",
    "- SQL language\n",
    "- works well with structured data\n",
    "##### NoSQL DBMS\n",
    "- less structured\n",
    "- document centered\n",
    "- good for rapid growth with no schema definitions\n",
    "- large quantities of data\n",
    "- types: \n",
    "    - key-value store\n",
    "        - combination of keys-value (JSON)\n",
    "        - shopping cart for on-line buyer\n",
    "        - Redis\n",
    "    - document store\n",
    "        - combination of key-value\n",
    "        - values are structured (documents)\n",
    "        - content management\n",
    "        - mongoDB\n",
    "    - columnar store\n",
    "        - store data in columns\n",
    "        - scalable\n",
    "        - big data analytics\n",
    "        - Cassandra\n",
    "    - graph database\n",
    "        - interconnected and represented as a graph\n",
    "        - social networks, recommenders\n",
    "        - neo4j"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "learn-env",
   "language": "python",
   "name": "learn-env"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}