{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Approaches to Data Processing \n", "\n", "#### OLTP \n", "online transaction processing\n", "- purpose: supports daily transactions\n", "- design: application oriented\n", "- data: up to date, operational\n", "- size: snapshots, GBs, traditional database\n", "- queries: simple transactions and frequent updates\n", "- users: thousands\n", "\n", "#### OLAP\n", "online analytical processing\n", "- purpose: report and analyze data\n", "- design: subject oriented\n", "- data: consolidated, historical\n", "- size: archive, TBs, data warehouse\n", "- queries: complex, aggregate queries and limited updates\n", "- users: hundreds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Storing Data\n", "\n", "#### structured data\n", "- follows a schema\n", "- defined datatypes and relationships\n", "- SQL, tables \n", "- easier to analyze, less flexible\n", "#### unstructured data\n", "- schemaless\n", "- makes up majority of data\n", "- photos, logs, MP3\n", "- harder to analyze, more flexible\n", "#### semistructured data\n", "- does not follow larger schema\n", "- self-describing structure\n", "- NoSQL, XML, JSON\n", "\n", "#### data warehouses\n", "- optimized for analytics\n", "- organized for reading and aggregating data\n", "- usually read-only\n", "- contains data from multiple sources\n", "- Massively Parallel Processing (MPP)\n", "- denormalized schema and dimensional modeling\n", "- Amazon Redshift, Azure SQL Data Warehouse, Google Big Query\n", "- schema is schema on write (predefined)\n", " - data mart\n", " - subset of a data warehouse\n", " - dedicated to a specific topic\n", "\n", "#### data lake\n", "- storing data of all structures = flexibility and scalability\n", "- for analyzing big data\n", "- lower cost than a warehouse\n", "- can store petabytes of data\n", "- schema is schema on read (determined when processed)\n", "- uncatalogged data lake is a data swamp\n", "- run big data analytics through spark or hadoop\n", "- Google Cloud Storage, AWS Data Lake, Azure Data Lake\n", "#### dataflow approaches\n", "##### ETL\n", "extract, transform, load\n", "- traditional form\n", "- data is transformed before storage\n", "##### ELT\n", "extract, load, transform\n", "- more common among data lakes (big data)\n", "- data is stored in native form, then transformed as needed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Database Design\n", "- determines how data is logically stored\n", "- how data will be read and updated\n", "- uses **_database models_** - high-level specs for database structure\n", " - most popular: relational model\n", " - other options: NoSQL models, object oriented model, network model\n", "- uses **_schemas_** - blueprint of the database\n", "- defines tables, fields, relationships, indexes and views\n", "- when inserting data in relational databases, schemas must be respected\n", "#### Step One: Data modeling\n", "- process of creating a data model for the data to be stored\n", "- **_three levels to a data model:_**\n", " - conceptual data model\n", " - Tools: data structure diagrams aka entity-relational diagrams and UML diagrams\n", " - logical data model\n", " - Tools: database models and schemas aka relational model and star schema\n", " - physical data model\n", " - Tools: partitions, CPUs, indexes, backup systems, tablespaces\n", "##### dimensional models\n", "- adaptation of relational model for warehouse design\n", "- optimized for OLAP queries and aggregating data, not updating (OLTP)\n", "- uses a **_star schema_** (one-dimensional) or an extension of the star schema called a **_snowflake schema_**(multi-dimensional)\n", "- easy to interpret and extend\n", "- organized by what is being analyzed and how often the entities change\n", "- **_two types of tables_**\n", " - FACT TABLES\n", " - decided by business use case\n", " - holds records of a metric\n", " - changes regularly\n", " - connects to dimensions via foreign keys\n", " - DIMENSION TABLES\n", " - holds descriptions of attributes\n", " - does not change as often\n", "#### Normalization\n", "- database design technique\n", "- divides tables into smaller tables and connects them via relationships\n", "- the goal is to reduce redundancy and increase data integrity\n", "- saves space by eliminating data redundancy\n", "- queries are longer and more complex\n", "- enforces data consistency\n", "- safer updating, removing and inserting\n", "- easier to redesign by extending\n", "- OLTP preferred method, not OLAP\n", "\n", " #adding foreign key to a star table\n", " ALTER TABLE main_table ADD CONSTRAINT new_normalized_table\n", " FOREIGN KEY (new_normalized_key) REFERENCES star_table_name (star_table_key);\n", " \n", " #Create a new table with a distinct column\n", " CREATE TABLE new_normalized_table (\n", " new_column DATATYPE NOT NULL\n", " );\n", "\n", " #Insert data from star table \n", " INSERT INTO new_normalized_table\n", " SELECT DISTINCT star_table_column FROM star_table_name;\n", "\n", " #Add a primary key \n", " ALTER TABLE new_normalized_table ADD COLUMN new_primary_key SERIAL PRIMARY KEY;\n", " \n", "##### normal forms (from least to most normalized)\n", "- first normal form (1NF)\n", " - each record must be unique - no duplicate rows\n", " - each cell must hold one value\n", "- second normal form (2NF)\n", " - must satisfy 1NF\n", " - primary key is one column\n", " - if there is a composite (two or more columns) primary key, then each non-key column must be dependent on all the keys\n", "- third normal form (3NF)\n", " - must satisfy 2NF\n", " - no transitive dependencies: non-key columns cannot depend on other non-key columns\n", "- elementary normal key form (EKNF)\n", "- boyce-codd normal form (BCNF)\n", "- fourth normal form (4NF)\n", "- essential tuple normal form (ETNF)\n", "- fifth normal form (5NF)\n", "- domain-key normal form (DKNF)\n", "- sixth normal form (6NF)\n", "##### anomolies if not normalized enough\n", "- update anomoly\n", " - data inconsistency caused by redundancy when updating\n", "- insertion anomoly\n", " - unable to add a record due to missing attributes\n", "-deletion anomoly\n", " - deletion of record unintentionally causes loss of data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Database Views\n", "##### unmaterialized\n", "- virtual table that is not part of the physical schema\n", "- query is not data, but is stored in memory\n", "- data is aggregated from data in tables\n", "- can be queried like a regular table\\\n", "- no need to retype common queries or alter schemas\n", "##### materialized \n", "- physically materialized\n", "- stores the query results\n", "- accesses the stored query results\n", "- refreshed or rematerialized when prompted or scheduled\n", "- good for long running queries\n", "- good for OLAP type data\n", "- refreshing schedules handled through a scheduler like cron\n", "#### create an unmaterialized view\n", " CREATE VIEW view_name AS\n", " SELECT col1. col2\n", " FROM table_name\n", " WHERE condition;\n", "#### create a materialized view \n", " CREATE MATERIALIZED VIEW mv_name AS\n", " SELECT col1. col2\n", " FROM table_name\n", " WHERE condition;\n", "#### refresh a materialized view\n", " REFRESH MATERIALIZED VIEW mv_name;\n", "#### delete a view\n", " DROP VIEW view_name [CASCADE | RESTRICT]\n", "- RESTRICT returns an error if there are objects that depend on the view\n", "- CASCADE drops view and any object that depends on that view\n", "#### redefine a view\n", " CREATE OR REPLACE VIEW view_name AS new_query\n", "- if view_name exists, it is replaced\n", "- new_query must generate the same column names, order and data types as the old query\n", "- column output may be different\n", "- new columns can be added at the end\n", "#### query an unmaterialized view\n", " SELECT * FROM view_name\n", "#### viewing all view information\n", " #postgreSQL - will need to find reletive command per database type\n", " SELECT * FROM INFORMATION_SCHEMA.views\n", " #exclude system views\n", " WHERE table_schema NOT IN ('pg_catalog', 'information_schema'); \n", "#### managing views\n", "- grant access\n", "\n", " GRANT (type of privilege)\n", " #type of privilege: SELECT, INSERT, UPDATE, DELETE\n", " ON object #table, view, schema\n", " TO role FROM role #user or group of users\n", "- revoke access \n", "\n", " REVOKE (type of privilege) \n", " #type of privileges: SELECT, INSERT, UPDATE, DELETE\n", " FROM object #table, view, schema\n", " TO role FROM role #user or group of users\n", "**avoid modifying data through views, create new views and delete old views**\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Database Roles and Access Control\n", "#### roles\n", "- entity that contains information that\n", " - defines the role's privileges (login, db creation, writing to tables)\n", " - interacts with the authentication system (passwords)\n", "- roles can be assigned to one or more users\n", "- roles are global across db cluster installation\n", "##### create role\n", " CREATE ROLE user_name or group_name\n", " WITH PASSWORD 'password_string'\n", " VALID UNTIL 'YYYY-MM-DD';\n", "##### create admin role\n", " CREATE ROLE admin CREATEDB;\n", "##### alter role\n", " ALTER ROLE admin CREATEROLE;\n", "##### grant privilege\n", " GRANT (type of privilege)\n", " #type of privilege: SELECT, INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER, CREATE, TEMPORARY, EXECUTE, USAGE\n", " ON object #table, view, schema\n", " TO role FROM role #user or group of \n", "##### revoke privilege\n", " REVOKE (type of privilege) \n", " #type of privilege: SELECT, INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER, CREATE, TEMPORARY, EXECUTE, USAGE\n", " FROM object #table, view, schema\n", " TO role FROM role #user or group of users\n", "##### assign or remove a role to a group\n", " GRANT group_name TO user_name;\n", " REVOKE group_name FROM user_name;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Table partitioning\n", "- distributes data over several separate entities\n", "- used for both OLAP and OLTP\n", "- will need to copy data from existing tables\n", "- constraints on partitions\n", "- can partition data to several machines (sharding)\n", "#### vertical partitioning\n", "- splits up a table by its columns\n", "- best used for splitting out column data that is hardly ever used\n", "#### horizontal partitioning\n", "- split tables over the rows\n", "\n", " # Create a new table called table_partitioned\n", " CREATE TABLE table_partitioned (\n", " primary_key INT,\n", " col1 TEXT NOT NULL,\n", " col2 TEXT\n", " )\n", " PARTITION BY LIST (col2);\n", "\n", " # Create the partitions for 2019, 2018, and 2017\n", " CREATE TABLE table_2019\n", " PARTITION OF table_partitioned FOR VALUES IN ('2019');\n", "\n", " CREATE TABLE table_2018\n", " PARTITION OF table_partitioned FOR VALUES IN ('2018');\n", "\n", " CREATE TABLE table_2017\n", " PARTITION OF table_partitioned FOR VALUES IN ('2017');\n", "\n", " # Insert the data into table_partitioned\n", " INSERT INTO table_partitioned\n", " SELECT primary_key, col1, col2 FROM table;\n", "\n", " # View film_partitioned\n", " SELECT * FROM table_partitioned;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data integration\n", "- combines data from different sources, formats and technologies to provide users with a translated and unified view of that data\n", "#### unified data model\n", "##### data sources\n", "- SQL, NoSQL, csv, images\n", "##### processing type\n", "- redshift, google, azure\n", "##### transformation\n", "- auto transform or individual\n", "- real time, or periodic transformation\n", "##### alerts\n", "- alert system for errors\n", "##### security\n", "- to prevent data breaches internally and externally\n", "##### governance\n", "- where data originated and where it is used at all times" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Database Management System (DBMS)\n", "- system software for creating and maintaining databases\n", " - data\n", " - database schema\n", " - database engine\n", "#### types\n", "- depends on data use\n", "##### SQL DBMS\n", "- relational dbms\n", "- SQL language\n", "- works well with structured data\n", "##### NoSQL DBMS\n", "- less structured\n", "- document centered\n", "- good for rapid growth with no schema definitions\n", "- large quantities of data\n", "- types: \n", " - key-value store\n", " - combination of keys-value (JSON)\n", " - shopping cart for on-line buyer\n", " - Redis\n", " - document store\n", " - combination of key-value\n", " - values are structured (documents)\n", " - content management\n", " - mongoDB\n", " - columnar store\n", " - store data in columns\n", " - scalable\n", " - big data analytics\n", " - Cassandra\n", " - graph database\n", " - interconnected and represented as a graph\n", " - social networks, recommenders\n", " - neo4j" ] } ], "metadata": { "kernelspec": { "display_name": "learn-env", "language": "python", "name": "learn-env" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }