[ { "course": "data-engineering-zoomcamp", "documents": [ { "text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.", "section": "General course-related questions", "question": "Course - When will the course start?" }, { "text": "GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites", "section": "General course-related questions", "question": "Course - What are the prerequisites for this course?" }, { "text": "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.", "section": "General course-related questions", "question": "Course - Can I still join the course after the start date?" }, { "text": "You don't need it. You're accepted. You can also just start learning and submitting homework without registering. It is not checked against any registered list. Registration is just to gauge interest before the start date.", "section": "General course-related questions", "question": "Course - I have registered for the Data Engineering Bootcamp. When can I expect to receive the confirmation email?" }, { "text": "You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.", "section": "General course-related questions", "question": "Course - What can I do before the course starts?" }, { "text": "There are 3 Zoom Camps in a year, as of 2024. However, they are for separate courses:\nData-Engineering (Jan - Apr)\nMLOps (May - Aug)\nMachine Learning (Sep - Jan)\nThere's only one Data-Engineering Zoomcamp \u201clive\u201d cohort per year, for the certification. Same as for the other Zoomcamps.\nThey follow pretty much the same schedule for each cohort per zoomcamp. For Data-Engineering it is (generally) from Jan-Apr of the year. If you\u2019re not interested in the Certificate, you can take any zoom camps at any time, at your own pace, out of sync with any \u201clive\u201d cohort.", "section": "General course-related questions", "question": "Course - how many Zoomcamps in a year?" }, { "text": "Yes. For the 2024 edition we are using Mage AI instead of Prefect and re-recorded the terraform videos, For 2023, we used Prefect instead of Airflow..", "section": "General course-related questions", "question": "Course - Is the current cohort going to be different from the previous cohort?" }, { "text": "Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.", "section": "General course-related questions", "question": "Course - Can I follow the course after it finishes?" }, { "text": "Yes, the slack channel remains open and you can ask questions there. But always sDocker containers exit code w search the channel first and second, check the FAQ (this document), most likely all your questions are already answered here.\nYou can also tag the bot @ZoomcampQABot to help you conduct the search, but don\u2019t rely on its answers 100%, it is pretty good though.", "section": "General course-related questions", "question": "Course - Can I get support if I take the course in the self-paced mode?" }, { "text": "All the main videos are stored in the Main \u201cDATA ENGINEERING\u201d playlist (no year specified). The Github repository has also been updated to show each video with a thumbnail, that would bring you directly to the same playlist below.\nBelow is the MAIN PLAYLIST\u2019. And then you refer to the year specific playlist for additional videos for that year like for office hours videos etc. Also find this playlist pinned to the slack channel.\nh\nttps://youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&si=NspQhtZhZQs1B9F-", "section": "General course-related questions", "question": "Course - Which playlist on YouTube should I refer to?" }, { "text": "It depends on your background and previous experience with modules. It is expected to require about 5 - 15 hours per week. [source1] [source2]\nYou can also calculate it yourself using this data and then update this answer.", "section": "General course-related questions", "question": "Course - \u200b\u200bHow many hours per week am I expected to spend on this course?" }, { "text": "No, you can only get a certificate if you finish the course with a \u201clive\u201d cohort. We don't award certificates for the self-paced mode. The reason is you need to peer-review capstone(s) after submitting a project. You can only peer-review projects at the time the course is running.", "section": "General course-related questions", "question": "Certificate - Can I follow the course in a self-paced mode and get a certificate?" }, { "text": "The zoom link is only published to instructors/presenters/TAs.\nStudents participate via Youtube Live and submit questions to Slido (link would be pinned in the chat when Alexey goes Live). The video URL should be posted in the announcements channel on Telegram & Slack before it begins. Also, you will see it live on the DataTalksClub YouTube Channel.\nDon\u2019t post your questions in chat as it would be off-screen before the instructors/moderators have a chance to answer it if the room is very active.", "section": "General course-related questions", "question": "Office Hours - What is the video/zoom link to the stream for the \u201cOffice Hour\u201d or workshop sessions?" }, { "text": "Yes! Every \u201cOffice Hours\u201d will be recorded and available a few minutes after the live session is over; so you can view (or rewatch) whenever you want.", "section": "General course-related questions", "question": "Office Hours - I can\u2019t attend the \u201cOffice hours\u201d / workshop, will it be recorded?" }, { "text": "You can find the latest and up-to-date deadlines here: https://docs.google.com/spreadsheets/d/e/2PACX-1vQACMLuutV5rvXg5qICuJGL-yZqIV0FBD84CxPdC5eZHf8TfzB-CJT_3Mo7U7oGVTXmSihPgQxuuoku/pubhtml\nAlso, take note of Announcements from @Au-Tomator for any extensions or other news. Or, the form may also show the updated deadline, if Instructor(s) has updated it.", "section": "General course-related questions", "question": "Homework - What are homework and project deadlines?" }, { "text": "No, late submissions are not allowed. But if the form is still not closed and it\u2019s after the due date, you can still submit the homework. confirm your submission by the date-timestamp on the Course page.y\nOlder news:[source1] [source2]", "section": "General course-related questions", "question": "Homework - Are late submissions of homework allowed?" }, { "text": "Answer: In short, it\u2019s your repository on github, gitlab, bitbucket, etc\nIn long, your repository or any other location you have your code where a reasonable person would look at it and think yes, you went through the week and exercises.", "section": "General course-related questions", "question": "Homework - What is the homework URL in the homework link?" }, { "text": "After you submit your homework it will be graded based on the amount of questions in a particular homework. You can see how many points you have right on the page of the homework up top. Additionally in the leaderboard you will find the sum of all points you\u2019ve earned - points for Homeworks, FAQs and Learning in Public. If homework is clear, others work as follows: if you submit something to FAQ, you get one point, for each learning in a public link you get one point.\n(https://datatalks-club.slack.com/archives/C01FABYF2RG/p1706846846359379?thread_ts=1706825019.546229&cid=C01FABYF2RG)", "section": "General course-related questions", "question": "Homework and Leaderboard - what is the system for points in the course management platform?" }, { "text": "When you set up your account you are automatically assigned a random name such as \u201cLucid Elbakyan\u201d for example. If you want to see what your Display name is.\nGo to the Homework submission link \u2192 https://courses.datatalks.club/de-zoomcamp-2024/homework/hw2 - Log in > Click on \u2018Data Engineering Zoom Camp 2024\u2019 > click on \u2018Edit Course Profile\u2019 - your display name is here, you can also change it should you wish:", "section": "General course-related questions", "question": "Leaderboard - I am not on the leaderboard / how do I know which one I am on the leaderboard?" }, { "text": "Yes, for simplicity (of troubleshooting against the recorded videos) and stability. [source]\nBut Python 3.10 and 3.11 should work fine.", "section": "General course-related questions", "question": "Environment - Is Python 3.9 still the recommended version to use in 2024?" }, { "text": "You can set it up on your laptop or PC if you prefer to work locally from your laptop or PC.\nYou might face some challenges, especially for Windows users. If you face cnd2\nIf you prefer to work on the local machine, you may start with the week 1 Introduction to Docker and follow through.\nHowever, if you prefer to set up a virtual machine, you may start with these first:\nUsing GitHub Codespaces\nSetting up the environment on a cloudV Mcodespace\nI decided to work on a virtual machine because I have different laptops & PCs for my home & office, so I can work on this boot camp virtually anywhere.", "section": "General course-related questions", "question": "Environment - Should I use my local machine, GCP, or GitHub Codespaces for my environment?" }, { "text": "GitHub Codespaces offers you computing Linux resources with many pre-installed tools (Docker, Docker Compose, Python).\nYou can also open any GitHub repository in a GitHub Codespace.", "section": "General course-related questions", "question": "Environment - Is GitHub codespaces an alternative to using cli/git bash to ingest the data and create a docker file?" }, { "text": "It's up to you which platform and environment you use for the course.\nGithub codespaces or GCP VM are just possible options, but you can do the entire course from your laptop.", "section": "General course-related questions", "question": "Environment - Do we really have to use GitHub codespaces? I already have PostgreSQL & Docker installed." }, { "text": "Choose the approach that aligns the most with your idea for the end project\nOne of those should suffice. However, BigQuery, which is part of GCP, will be used, so learning that is probably a better option. Or you can set up a local environment for most of this course.", "section": "General course-related questions", "question": "Environment - Do I need both GitHub Codespaces and GCP?" }, { "text": "1. To open Run command window, you can either:\n(1-1) Use the shortcut keys: 'Windows + R', or\n(1-2) Right Click \"Start\", and click \"Run\" to open.\n2. Registry Values Located in Registry Editor, to open it: Type 'regedit' in the Run command window, and then press Enter.' 3. Now you can change the registry values \"Autorun\" in \"HKEY_CURRENT_USER\\Software\\Microsoft\\Command Processor\" from \"if exists\" to a blank.\nAlternatively, You can simplify the solution by deleting the fingerprint saved within the known_hosts file. In Windows, this file is placed at C:\\Users\\<your_user_name>\\.ssh\\known_host", "section": "General course-related questions", "question": "This happens when attempting to connect to a GCP VM using VSCode on a Windows machine. Changing registry value in registry editor" }, { "text": "For uniformity at least, but you\u2019re not restricted to GCP, you can use other cloud platforms like AWS if you\u2019re comfortable with other cloud platforms, since you get every service that\u2019s been provided by GCP in Azure and AWS or others..\nBecause everyone has a google account, GCP has a free trial period and gives $300 in credits to new users. Also, we are working with BigQuery, which is a part of GCP.\nNote that to sign up for a free GCP account, you must have a valid credit card.", "section": "General course-related questions", "question": "Environment - Why are we using GCP and not other cloud providers?" }, { "text": "No, if you use GCP and take advantage of their free trial.", "section": "General course-related questions", "question": "Should I pay for cloud services?" }, { "text": "You can do most of the course without a cloud. Almost everything we use (excluding BigQuery) can be run locally. We won\u2019t be able to provide guidelines for some things, but most of the materials are runnable without GCP.\nFor everything in the course, there\u2019s a local alternative. You could even do the whole course locally.", "section": "General course-related questions", "question": "Environment - The GCP and other cloud providers are unavailable in some countries. Is it possible to provide a guide to installing a home lab?" }, { "text": "Yes, you can. Just remember to adapt all the information on the videos to AWS. Besides, the final capstone will be evaluated based on the task: Create a data pipeline! Develop a visualisation!\nThe problem would be when you need help. You\u2019d need to rely on fellow coursemates who also use AWS (or have experience using it before), which might be in smaller numbers than those learning the course with GCP.\nAlso see Is it possible to use x tool instead of the one tool you use?", "section": "General course-related questions", "question": "Environment - I want to use AWS. May I do that?" }, { "text": "We will probably have some calls during the Capstone period to clear some questions but it will be announced in advance if that happens.", "section": "General course-related questions", "question": "Besides the \u201cOffice Hour\u201d which are the live zoom calls?" }, { "text": "We will use the same data, as the project will essentially remain the same as last year\u2019s. The data is available here", "section": "General course-related questions", "question": "Are we still using the NYC Trip data for January 2021? Or are we using the 2022 data?" }, { "text": "No, but we moved the 2022 stuff here", "section": "General course-related questions", "question": "Is the 2022 repo deleted?" }, { "text": "Yes, you can use any tool you want for your project.", "section": "General course-related questions", "question": "Can I use Airflow instead for my final project?" }, { "text": "Yes, this applies if you want to use Airflow or Prefect instead of Mage, AWS or Snowflake instead of GCP products or Tableau instead of Metabase or Google data studio.\nThe course covers 2 alternative data stacks, one using GCP and one using local installation of everything. You can use one of them or use your tool of choice.\nShould you consider it instead of the one tool you use? That we can\u2019t support you if you choose to use a different stack, also you would need to explain the different choices of tool for the peer review of your capstone project.", "section": "General course-related questions", "question": "Is it possible to use tool \u201cX\u201d instead of the one tool you use in the course?" }, { "text": "Star the repo! Share it with friends if you find it useful \u2763\ufe0f\nCreate a PR if you see you can improve the text or the structure of the repository.", "section": "General course-related questions", "question": "How can we contribute to the course?" }, { "text": "Yes! Linux is ideal but technically it should not matter. Students last year used all 3 OSes successfully", "section": "General course-related questions", "question": "Environment - Is the course [Windows/mac/Linux/...] friendly?" }, { "text": "Have no idea how past cohorts got past this as I haven't read old slack messages, and no FAQ entries that I can find.\nLater modules (module-05 & RisingWave workshop) use shell scripts in *.sh files and most Windows users not using WSL would hit a wall and cannot continue, even in git bash or MINGW64. This is why WSL environment setup is recommended from the start.", "section": "General course-related questions", "question": "Environment - Roadblock for Windows users in modules with *.sh (shell scripts)." }, { "text": "Yes to both! check out this document: https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/awesome-data-engineering.md", "section": "General course-related questions", "question": "Any books or additional resources you recommend?" }, { "text": "You will have two attempts for a project. If the first project deadline is over and you\u2019re late or you submit the project and fail the first attempt, you have another chance to submit the project with the second attempt.", "section": "General course-related questions", "question": "Project - What is Project Attemp #1 and Project Attempt #2 exactly?" }, { "text": "The first step is to try to solve the issue on your own. Get used to solving problems and reading documentation. This will be a real life skill you need when employed. [ctrl+f] is your friend, use it! It is a universal shortcut and works in all apps/browsers.\nWhat does the error say? There will often be a description of the error or instructions on what is needed or even how to fix it. I have even seen a link to the solution. Does it reference a specific line of your code?\nRestart app or server/pc.\nGoogle it, use ChatGPT, Bing AI etc.\nIt is going to be rare that you are the first to have the problem, someone out there has posted the fly issue and likely the solution.\nSearch using: <technology> <problem statement>. Example: pgcli error column c.relhasoids does not exist.\nThere are often different solutions for the same problem due to variation in environments.\nCheck the tech\u2019s documentation. Use its search if available or use the browsers search function.\nTry uninstall (this may remove the bad actor) and reinstall of application or reimplementation of action. Remember to restart the server/pc for reinstalls.\nSometimes reinstalling fails to resolve the issue but works if you uninstall first.\nPost your question to Stackoverflow. Read the Stackoverflow guide on posting good questions.\nhttps://stackoverflow.com/help/how-to-ask\nThis will be your real life. Ask an expert in the future (in addition to coworkers).\nAsk in Slack\nBefore asking a question,\nCheck Pins (where the shortcut to the repo and this FAQ is located)\nUse the slack app\u2019s search function\nUse the bot @ZoomcampQABot to do the search for you\ncheck the FAQ (this document), use search [ctrl+f]\nWhen asking a question, include as much information as possible:\nWhat are you coding on? What OS?\nWhat command did you run, which video did you follow? Etc etc\nWhat error did you get? Does it have a line number to the \u201coffending\u201d code and have you check it for typos?\nWhat have you tried that did not work? This answer is crucial as without it, helpers would ask you to do the suggestions in the error log first. Or just read this FAQ document.\nDO NOT use screenshots, especially don\u2019t take pictures from a phone.\nDO NOT tag instructors, it may discourage others from helping you. Copy and paste errors; if it\u2019s long, just post it in a reply to your thread.\nUse ``` for formatting your code.\nUse the same thread for the conversation (that means reply to your own thread).\nDO NOT create multiple posts to discuss the issue.\nlearYou may create a new post if the issue reemerges down the road. Describe what has changed in the environment.\nProvide additional information in the same thread of the steps you have taken for resolution.\nTake a break and come back later. You will be amazed at how often you figure out the solution after letting your brain rest. Get some fresh air, workout, play a video game, watch a tv show, whatever allows your brain to not think about it for a little while or even until the next day.\nRemember technology issues in real life sometimes take days or even weeks to resolve.\nIf somebody helped you with your problem and it's not in the FAQ, please add it there. It will help other students.", "section": "General course-related questions", "question": "How to troubleshoot issues" }, { "text": "When the troubleshooting guide above does not help resolve it and you need another pair of eyeballs to spot mistakes. When asking a question, include as much information as possible:\nWhat are you coding on? What OS?\nWhat command did you run, which video did you follow? Etc etc\nWhat error did you get? Does it have a line number to the \u201coffending\u201d code and have you check it for typos?\nWhat have you tried that did not work? This answer is crucial as without it, helpers would ask you to do the suggestions in the error log first. Or just read this FAQ document.", "section": "General course-related questions", "question": "How to ask questions" }, { "text": "After you create a GitHub account, you should clone the course repo to your local machine using the process outlined in this video: Git for Everybody: How to Clone a Repository from GitHub\nHaving this local repository on your computer will make it easy for you to access the instructors\u2019 code and make pull requests (if you want to add your own notes or make changes to the course content).\nYou will probably also create your own repositories that host your notes, versions of your file, to do this. Here is a great tutorial that shows you how to do this: https://www.atlassian.com/git/tutorials/setting-up-a-repository\nRemember to ignore large database, .csv, and .gz files, and other files that should not be saved to a repository. Use .gitignore for this: https://www.atlassian.com/git/tutorials/saving-changes/gitignore NEVER store passwords or keys in a git repo (even if that repo is set to private).\nThis is also a great resource: https://dangitgit.com/", "section": "General course-related questions", "question": "How do I use Git / GitHub for this course?" }, { "text": "Error: Makefile:2: *** missing separator. Stop.\nSolution: Tabs in document should be converted to Tab instead of spaces. Follow this stack.", "section": "General course-related questions", "question": "VS Code: Tab using spaces" }, { "text": "If you\u2019re running Linux on Windows Subsystem for Linux (WSL) 2, you can open HTML files from the guest (Linux) with whatever Internet Browser you have installed on the host (Windows). Just install wslu and open the page with wslview <file>, for example:\nwslview index.html\nYou can customise which browser to use by setting the BROWSER environment variable first. For example:\nexport BROWSER='/mnt/c/Program Files/Firefox/firefox.exe'", "section": "General course-related questions", "question": "Opening an HTML file with a Windows browser from Linux running on WSL" }, { "text": "This tutorial shows you how to set up the Chrome Remote Desktop service on a Debian Linux virtual machine (VM) instance on Compute Engine. Chrome Remote Desktop allows you to remotely access applications with a graphical user interface.\nTaxi Data - Yellow Taxi Trip Records downloading error, Error no or XML error webpage\nWhen you try to download the 2021 data from TLC website, you get this error:\nIf you click on the link, and ERROR 403: Forbidden on the terminal.\nWe have a backup, so use it instead: https://github.com/DataTalksClub/nyc-tlc-data\nSo the link should be https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz\nNote: Make sure to unzip the \u201cgz\u201d file (no, the \u201cunzip\u201d command won\u2019t work for this.)\n\u201cgzip -d file.gz\u201dg", "section": "Module 1: Docker and Terraform", "question": "Set up Chrome Remote Desktop for Linux on Compute Engine" }, { "text": "In this video, we store the data file as \u201coutput.csv\u201d. The data file won\u2019t store correctly if the file extension is csv.gz instead of csv. One alternative is to replace csv_name = \u201coutput.cs -v\u201d with the file name given at the end of the URL. Notice that the URL for the yellow taxi data is: https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz where the highlighted part is the name of the file. We can parse this file name from the URL and use it as csv_name. That is, we can replace csv_name = \u201coutput.csv\u201d with\ncsv_name = url.split(\u201c/\u201d)[-1] . Then when we use csv_name to using pd.read_csv, there won\u2019t be an issue even though the file name really has the extension csv.gz instead of csv since the pandas read_csv function can read csv.gz files directly.", "section": "Module 1: Docker and Terraform", "question": "Taxi Data - How to handle taxi data files, now that the files are available as *.csv.gz?" }, { "text": "Yellow Trips: https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf\nGreen Trips: https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf", "section": "Module 1: Docker and Terraform", "question": "Taxi Data - Data Dictionary for NY Taxi data?" }, { "text": "You can unzip this downloaded parquet file, in the command line. The result is a csv file which can be imported with pandas using the pd.read_csv() shown in the videos.\n\u2018\u2019\u2019gunzip green_tripdata_2019-09.csv.gz\u2019\u2019\u2019\nSOLUTION TO USING PARQUET FILES DIRECTLY IN PYTHON SCRIPT ingest_data.py\nIn the def main(params) add this line\nparquet_name= 'output.parquet'\nThen edit the code which downloads the files\nos.system(f\"wget {url} -O {parquet_name}\")\nConvert the download .parquet file to csv and rename as csv_name to keep it relevant to the rest of the code\ndf = pd.read_parquet(parquet_name)\ndf.to_csv(csv_name, index=False)", "section": "Module 1: Docker and Terraform", "question": "Taxi Data - Unzip Parquet file" }, { "text": "\u201cwget is not recognized as an internal or external command\u201d, you need to install it.\nOn Ubuntu, run:\n$ sudo apt-get install wget\nOn MacOS, the easiest way to install wget is to use Brew:\n$ brew install wget\nOn Windows, the easiest way to install wget is to use Chocolatey:\n$ choco install wget\nOr you can download a binary (https://gnuwin32.sourceforge.net/packages/wget.htm) and put it to any location in your PATH (e.g. C:/tools/)\nAlso, you can following this step to install Wget on MS Windows\n* Download the latest wget binary for windows from [eternallybored] (https://eternallybored.org/misc/wget/) (they are available as a zip with documentation, or just an exe)\n* If you downloaded the zip, extract all (if windows built in zip utility gives an error, use [7-zip] (https://7-zip.org/)).\n* Rename the file `wget64.exe` to `wget.exe` if necessary.\n* Move wget.exe to your `Git\\mingw64\\bin\\`.\nAlternatively, you can use a Python wget library, but instead of simply using \u201cwget\u201d you\u2019ll need to use\npython -m wget\nYou need to install it with pip first:\npip install wget\nAlternatively, you can just paste the file URL into your web browser and download the file normally that way. You\u2019ll want to move the resulting file into your working directory.\nAlso recommended a look at the python library requests for the loading gz file https://pypi.org/project/requests", "section": "Module 1: Docker and Terraform", "question": "lwget is not recognized as an internal or external command" }, { "text": "Firstly, make sure that you add \u201c!\u201d before wget if you\u2019re running your command in a Jupyter Notebook or CLI. Then, you can check one of this 2 things (from CLI):\nUsing the Python library wget you installed with pip, try python -m wget <url>\nWrite the usual command and add --no-check-certificate at the end. So it should be:\n!wget <website_url> --no-check-certificate", "section": "Module 1: Docker and Terraform", "question": "wget - ERROR: cannot verify <website> certificate (MacOS)" }, { "text": "For those who wish to use the backslash as an escape character in Git Bash for Windows (as Alexey normally does), type in the terminal: bash.escapeChar=\\ (no need to include in .bashrc)", "section": "Module 1: Docker and Terraform", "question": "Git Bash - Backslash as an escape character in Git Bash for Windows" }, { "text": "Instruction on how to store secrets that will be avialable in GitHub Codespaces.\nManaging your account-specific secrets for GitHub Codespaces - GitHub Docs", "section": "Module 1: Docker and Terraform", "question": "GitHub Codespaces - How to store secrets" }, { "text": "Make sure you're able to start the Docker daemon, and check the issue immediately down below:\nAnd don\u2019t forget to update the wsl in powershell the command is wsl \u2013update", "section": "Module 1: Docker and Terraform", "question": "Docker - Cannot connect to Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" }, { "text": "As the official Docker for Windows documentation says, the Docker engine can either use the\nHyper-V or WSL2 as its backend. However, a few constraints might apply\nWindows 10 Pro / 11 Pro Users: \nIn order to use Hyper-V as its back-end, you MUST have it enabled first, which you can do by following the tutorial: Enable Hyper-V Option on Windows 10 / 11\nWindows 10 Home / 11 Home Users: \nOn the other hand, Users of the 'Home' version do NOT have the option Hyper-V option enabled, which means, you can only get Docker up and running using the WSL2 credentials(Windows Subsystem for Linux). Url\nYou can find the detailed instructions to do so here: rt ghttps://pureinfotech.com/install-wsl-windows-11/\nIn case, you run into another issue while trying to install WSL2 (WslRegisterDistribution failed with error: 0x800701bc), Make sure you update the WSL2 Linux Kernel, following the guidelines here: \n\nhttps://github.com/microsoft/WSL/issues/5393", "section": "Module 1: Docker and Terraform", "question": "Docker - Error during connect: In the default daemon configuration on Windows, the docker client must be run with elevated privileges to connect.: Post: \"http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.24/containers/create\" : open //./pipe/docker_engine: The system cannot find the file specified" }, { "text": "Whenever a `docker pull is performed (either manually or by `docker-compose up`), it attempts to fetch the given image name (pgadmin4, for the example above) from a repository (dbpage).\nIF the repository is public, the fetch and download happens without any issue whatsoever.\nFor instance:\ndocker pull postgres:13\ndocker pull dpage/pgadmin4\nBE ADVISED:\n\nThe Docker Images we'll be using throughout the Data Engineering Zoomcamp are all public (except when or if explicitly said otherwise by the instructors or co-instructors).\n\nMeaning: you are NOT required to perform a docker login to fetch them. \n\nSo if you get the message above saying \"docker login': denied: requested access to the resource is denied. That is most likely due to a typo in your image name:\n\nFor instance:\n$ docker pull dbpage/pgadmin4\nWill throw that exception telling you \"repository does not exist or may require 'docker login'\nError response from daemon: pull access denied for dbpage/pgadmin4, repository does not exist or \nmay require 'docker login': denied: requested access to the resource is denied\nBut that actually happened because the actual image is dpage/pgadmin4 and NOT dbpage/pgadmin4\nHow to fix it:\n$ docker pull dpage/pgadmin4\nEXTRA NOTES:\nIn the real world, occasionally, when you're working for a company or closed organisation, the Docker image you're trying to fetch might be under a private repo that your DockerHub Username was granted access to.\nFor which cases, you must first execute:\n$ docker login\nFill in the details of your username and password.\nAnd only then perform the `docker pull` against that private repository\nWhy am I encountering a \"permission denied\" error when creating a PostgreSQL Docker container for the New York Taxi Database with a mounted volume on macOS M1?\nIssue Description:\nWhen attempting to run a Docker command similar to the one below:\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \\\n-p 5432:5432 \\mount\npostgres:13\nYou encounter the error message:\ndocker: Error response from daemon: error while creating mount source path '/path/to/ny_taxi_postgres_data': chown /path/to/ny_taxi_postgres_data: permission denied.\nSolution:\n1- Stop Rancher Desktop:\nIf you are using Rancher Desktop and face this issue, stop Rancher Desktop to resolve compatibility problems.\n2- Install Docker Desktop:\nInstall Docker Desktop, ensuring that it is properly configured and has the required permissions.\n2-Retry Docker Command:\nRun the Docker command again after switching to Docker Desktop. This step resolves compatibility issues on some systems.\nNote: The issue occurred because Rancher Desktop was in use. Switching to Docker Desktop resolves compatibility problems and allows for the successful creation of PostgreSQL containers with mounted volumes for the New York Taxi Database on macOS M1.", "section": "Module 1: Docker and Terraform", "question": "Docker - docker pull dbpage" }, { "text": "When I runned command to create postgre in docker container it created folder on my local machine to mount it to volume inside container. It has write and read protection and owned by user 999, so I could not delete it by simply drag to trash. My obsidian could not started due to access error, so I had to change placement of this folder and delete old folder by this command:\nsudo rm -r -f docker_test/\n- where `rm` - remove, `-r` - recursively, `-f` - force, `docker_test/` - folder.", "section": "Module 1: Docker and Terraform", "question": "Docker - can\u2019t delete local folder that mounted to docker volume" }, { "text": "First off, make sure you're running the latest version of Docker for Windows, which you can download from here. Sometimes using the menu to \"Upgrade\" doesn't work (which is another clear indicator for you to uninstall, and reinstall with the latest version)\nIf docker is stuck on starting, first try to switch containers by right clicking the docker symbol from the running programs and switch the containers from windows to linux or vice versa\n[Windows 10 / 11 Pro Edition] The Pro Edition of Windows can run Docker either by using Hyper-V or WSL2 as its backend (Docker Engine)\nIn order to use Hyper-V as its back-end, you MUST have it enabled first, which you can do by following the tutorial: Enable Hyper-V Option on Windows 10 / 11\nIf you opt-in for WSL2, you can follow the same steps as detailed in the tutorial here", "section": "Module 1: Docker and Terraform", "question": "Docker - Docker won't start or is stuck in settings (Windows 10 / 11)" }, { "text": "It is recommended by the Docker do\n[Windows 10 / 11 Home Edition] If you're running a Home Edition, you can still make it work with WSL2 (Windows Subsystem for Linux) by following the tutorial here\nIf even after making sure your WSL2 (or Hyper-V) is set up accordingly, Docker remains stuck, you can try the option to Reset to Factory Defaults or do a fresh install.", "section": "Module 1: Docker and Terraform", "question": "Should I run docker commands from the windows file system or a file system of a Linux distribution in WSL?" }, { "text": "More info in the Docker Docs on Best Practises", "section": "Module 1: Docker and Terraform", "question": "Docker - cs to store all code in your default Linux distro to get the best out of file system performance (since Docker runs on WSL2 backend by default for Windows 10 Home / Windows 11 Home users)." }, { "text": "You may have this error:\n$ docker run -it ubuntu bash\nthe input device is not a TTY. If you are using mintty, try prefixing the command with 'winpty'\nerror:\nSolution:\nUse winpty before docker command (source)\n$ winpty docker run -it ubuntu bash\nYou also can make an alias:\necho \"alias docker='winpty docker'\" >> ~/.bashrc\nOR\necho \"alias docker='winpty docker'\" >> ~/.bash_profile", "section": "Module 1: Docker and Terraform", "question": "Docker - The input device is not a TTY (Docker run for Windows)" }, { "text": "You may have this error:\nRetrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.u\nrllib3.connection.HTTPSConnection object at 0x7efe331cf790>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')':\n/simple/pandas/\nPossible solution might be:\n$ winpty docker run -it --dns=8.8.8.8 --entrypoint=bash python:3.9", "section": "Module 1: Docker and Terraform", "question": "Docker - Cannot pip install on Docker container (Windows)" }, { "text": "Even after properly running the docker script the folder is empty in the vs code then try this (For Windows)\nwinpty docker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v \"C:\\Users\\abhin\\dataengg\\DE_Project_git_connected\\DE_OLD\\week1_set_up\\docker_sql/ny_taxi_postgres_data:/var/lib/postgresql/data\" \\\n-p 5432:5432 \\\npostgres:13\nHere quoting the absolute path in the -v parameter is solving the issue and all the files are visible in the Vs-code ny_taxi folder as shown in the video", "section": "Module 1: Docker and Terraform", "question": "Docker - ny_taxi_postgres_data is empty" }, { "text": "Check this article for details - Setting up docker in macOS\nFrom researching it seems this method might be out of date, it seems that since docker changed their licensing model, the above is a bit hit and miss. What worked for me was to just go to the docker website and download their dmg. Haven\u2019t had an issue with that method.", "section": "Module 1: Docker and Terraform", "question": "dasDocker - Setting up Docker on Mac" }, { "text": "$ docker run -it\\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"admin\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v \"/mnt/path/to/ny_taxi_postgres_data\":\"/var/lib/postgresql/data\" \\\n-p 5432:5432 \\\npostgres:13\nCCW\nThe files belonging to this database system will be owned by user \"postgres\".\nThis use The database cluster will be initialized with locale \"en_US.utf8\".\nThe default databerrorase encoding has accordingly been set to \"UTF8\".\nxt search configuration will be set to \"english\".\nData page checksums are disabled.\nfixing permissions on existing directory /var/lib/postgresql/data ... initdb: f\nerror: could not change permissions of directory \"/var/lib/postgresql/data\": Operation not permitted volume\nOne way to solve this issue is to create a local docker volume and map it to postgres data directory /var/lib/postgresql/data\nThe input dtc_postgres_volume_local must match in both commands below\n$ docker volume create --name dtc_postgres_volume_local -d local\n$ docker run -it\\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v dtc_postgres_volume_local:/var/lib/postgresql/data \\\n-p 5432:5432\\\npostgres:13\nTo verify the above command works in (WSL2 Ubuntu 22.04, verified 2024-Jan), go to the Docker Desktop app and look under Volumes - dtc_postgres_volume_local would be listed there. The folder ny_taxi_postgres_data would however be empty, since we used an alternative config.\nAn alternate error could be:\ninitdb: error: directory \"/var/lib/postgresql/data\" exists but is not empty\nIf you want to create a new database system, either remove or empthe directory \"/var/lib/postgresql/data\" or run initdb\nwitls", "section": "Module 1: Docker and Terraform", "question": "1Docker - Could not change permissions of directory \"/var/lib/postgresql/data\": Operation not permitted" }, { "text": "Mapping volumes on Windows could be tricky. The way it was done in the course video doesn\u2019t work for everyone.\nFirst, if yo\nmove your data to some folder without spaces. E.g. if your code is in \u201cC:/Users/Alexey Grigorev/git/\u2026\u201d, move it to \u201cC:/git/\u2026\u201d\nTry replacing the \u201c-v\u201d part with one of the following options:\n-v /c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n-v //c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n-v /c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n-v //c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n--volume //driveletter/path/ny_taxi_postgres_data/:/var/lib/postgresql/data\nwinpty docker run -it\n-e POSTGRES_USER=\"root\"\n-e POSTGRES_PASSWORD=\"root\"\n-e POSTGRES_DB=\"ny_taxi\"\n-v /c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n-p 5432:5432\npostgres:1\nTry adding winpty before the whole command\n3\nwin\nTry adding quotes:\n-v \"/c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\n-v \"//c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\n-v \u201c/c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\n-v \"//c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\n-v \"c:\\some\\path\\ny_taxi_postgres_data\":/var/lib/postgresql/data\nNote: (Window) if it automatically creates a folder called \u201cny_taxi_postgres_data;C\u201d suggests you have problems with volume mapping, try deleting both folders and replacing \u201c-v\u201d part with other options. For me \u201c//c/\u201d works instead of \u201c/c/\u201d. And it will work by automatically creating a correct folder called \u201cny_taxi_postgres_data\u201d.\nA possible solution to this error would be to use /\u201d$(pwd)\u201d/ny_taxi_postgres_data:/var/lib/postgresql/data (with quotes\u2019 position varying as in the above list).\nYes for windows use the command it works perfectly fine\n-v /\u201d$(pwd)\u201d/ny_taxi_postgres_data:/var/lib/postgresql/data\nImportant: note how the quotes are placed.\nIf none of these options work, you can use a volume name instead of the path:\n-v ny_taxi_postgres_data:/var/lib/postgresql/data\nFor Mac: You can wrap $(pwd) with quotes like the highlighted.\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v \"$(pwd)\"/ny_taxi_postgres_data:/var/lib/postgresql/data \\\n-p 5432:5432 \\\nPostgres:13\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v \"$(pwd)\"/ny_taxi_postgres_data:/var/lib/postgresql/data \\\n-p 5432:5432 \\\npostgres:13\nSource:https://stackoverflow.com/questions/48522615/docker-error-invalid-reference-format-repository-name-must-be-lowercase", "section": "Module 1: Docker and Terraform", "question": "Docker - invalid reference format: repository name must be lowercase (Mounting volumes with Docker on Windows)" }, { "text": "Change the mounting path. Replace it with one of following:\n-v /e/zoomcamp/...:/var/lib/postgresql/data\n-v /c:/.../ny_taxi_postgres_data:/var/lib/postgresql/data\\ (leading slash in front of c:)", "section": "Module 1: Docker and Terraform", "question": "Docker - Error response from daemon: invalid mode: \\Program Files\\Git\\var\\lib\\postgresql\\data." }, { "text": "When you run this command second time\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v <your path>:/var/lib/postgresql/data \\\n-p 5432:5432 \\\npostgres:13\nThe error message above could happen. That means you should not mount on the second run. This command helped me:\nWhen you run this command second time\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-p 5432:5432 \\\npostgres:13", "section": "Module 1: Docker and Terraform", "question": "Docker - Error response from daemon: error while creating buildmount source path '/run/desktop/mnt/host/c/<your path>': mkdir /run/desktop/mnt/host/c: file exists" }, { "text": "This error appeared when running the command: docker build -t taxi_ingest:v001 .\nWhen feeding the database with the data the user id of the directory ny_taxi_postgres_data was changed to 999, so my user couldn\u2019t access it when running the above command. Even though this is not the problem here it helped to raise the error due to the permission issue.\nSince at this point we only need the files Dockerfile and ingest_data.py, to fix this error one can run the docker build command on a different directory (having only these two files).\nA more complete explanation can be found here: https://stackoverflow.com/questions/41286028/docker-build-error-checking-context-cant-stat-c-users-username-appdata\nYou can fix the problem by changing the permission of the directory on ubuntu with following command:\nsudo chown -R $USER dir_path\nOn windows follow the link: https://thegeekpage.com/take-ownership-of-a-file-folder-through-command-prompt-in-windows-10/ \n\n\t\t\t\t\t\t\t\t\t\t\tAdded by\n\t\t\t\t\t\t\t\t\t\t\tKenan Arslanbay", "section": "Module 1: Docker and Terraform", "question": "Docker - build error: error checking context: 'can't stat '/home/user/repos/data-engineering/week_1_basics_n_setup/2_docker_sql/ny_taxi_postgres_data''." }, { "text": "You might have installed docker via snap. Run \u201csudo snap status docker\u201d to verify.\nIf you have \u201cerror: unknown command \"status\", see 'snap help'.\u201d as a response than deinstall docker and install via the official website\nBind for 0.0.0.0:5432 failed: port is a", "section": "Module 1: Docker and Terraform", "question": "Docker - ERRO[0000] error waiting for container: context canceled" }, { "text": "Found the issue in the PopOS linux. It happened because our user didn\u2019t have authorization rights to the host folder ( which also caused folder seems empty, but it didn\u2019t!).\n\u2705Solution:\nJust add permission for everyone to the corresponding folder\nsudo chmod -R 777 <path_to_folder>\nExample:\nsudo chmod -R 777 ny_taxi_postgres_data/", "section": "Module 1: Docker and Terraform", "question": "Docker - build error checking context: can\u2019t stat \u2018/home/fhrzn/Projects/\u2026./ny_taxi_postgres_data\u2019" }, { "text": "This happens on Ubuntu/Linux systems when trying to run the command to build the Docker container again.\n$ docker build -t taxi_ingest:v001 .\nA folder is created to host the Docker files. When the build command is executed again to rebuild the pipeline or create a new one the error is raised as there are no permissions on this new folder. Grant permissions by running this comtionmand;\n$ sudo chmod -R 755 ny_taxi_postgres_data\nOr use 777 if you still see problems. 755 grants write access to only the owner.", "section": "Module 1: Docker and Terraform", "question": "Docker - failed to solve with frontend dockerfile.v0: failed to read dockerfile: error from sender: open ny_taxi_postgres_data: permission denied." }, { "text": "Get the network name via: $ docker network ls.", "section": "Module 1: Docker and Terraform", "question": "Docker - Docker network name" }, { "text": "Sometimes, when you try to restart a docker image configured with a network name, the above message appears. In this case, use the following command with the appropriate container name:\n>>> If the container is running state, use docker stop <container_name>\n>>> then, docker rm pg-database\nOr use docker start instead of docker run in order to restart the docker image without removing it.", "section": "Module 1: Docker and Terraform", "question": "Docker - Error response from daemon: Conflict. The container name \"pg-database\" is already in use by container \u201cxxx\u201d. You have to remove (or rename) that container to be able to reuse that name." }, { "text": "Typical error: sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name \"pgdatabase\" to address: Name or service not known\nWhen running docker-compose up -d see which network is created and use this for the ingestions script instead of pg-network and see the name of the database to use instead of pgdatabase\nE.g.:\npg-network becomes 2docker_default\nPgdatabase becomes 2docker-pgdatabase-1", "section": "Module 1: Docker and Terraform", "question": "Docker - ingestion when using docker-compose could not translate host name" }, { "text": "terraformRun this command before starting your VM:\nOn Intel CPU:\nmodprobe -r kvm_intel\nmodprobe kvm_intel nested=1\nOn AMD CPU:\nmodprobe -r kvm_amd\nmodprobe kvm_amd nested=1", "section": "Module 1: Docker and Terraform", "question": "Docker - Cannot install docker on MacOS/Windows 11 VM running on top of Linux (due to Nested virtualization)." }, { "text": "It\u2019s very easy to manage your docker container, images, network and compose projects from VS Code.\nJust install the official extension and launch it from the left side icon.\nIt will work even if your Docker runs on WSL2, as VS Code can easily connect with your Linux.\nDocker - How to stop a container?\nUse the following command:\n$ docker stop <container_id>", "section": "Module 1: Docker and Terraform", "question": "Docker - Connecting from VS Code" }, { "text": "When you see this in logs, your container with postgres is not accepting any requests, so if you attempt to connect, you'll get this error:\nconnection failed: server closed the connection unexpectedly\nThis probably means the server terminated abnormally before or while processing the request.\nIn this case, you need to delete the directory with data (the one you map to the container with the -v flag) and restart the container.", "section": "Module 1: Docker and Terraform", "question": "Docker - PostgreSQL Database directory appears to contain a database. Database system is shut down" }, { "text": "On few versions of Ubuntu, snap command can be used to install Docker.\nsudo snap install docker", "section": "Module 1: Docker and Terraform", "question": "Docker not installable on Ubuntu" }, { "text": "error: could not change permissions of directory \"/var/lib/postgresql/data\": Operation not permitted volume\nif you have used the prev answer (just before this) and have created a local docker volume, then you need to tell the compose file about the named volume:\nvolumes:\ndtc_postgres_volume_local: # Define the named volume here\n# services mentioned in the compose file auto become part of the same network!\nservices:\nyour remaining code here . . .\nnow use docker volume inspect dtc_postgres_volume_local to see the location by checking the value of Mountpoint\nIn my case, after i ran docker compose up the mounting dir created was named \u2018docker_sql_dtc_postgres_volume_local\u2019 whereas it should have used the already existing \u2018dtc_postgres_volume_local\u2019\nAll i did to fix this is that I renamed the existing \u2018dtc_postgres_volume_local\u2019 to \u2018docker_sql_dtc_postgres_volume_local\u2019 and removed the newly created one (just be careful when doing this)\nrun docker compose up again and check if the table is there or not!", "section": "Module 1: Docker and Terraform", "question": "Docker-Compose - mounting error" }, { "text": "Couldn\u2019t translate host name to address\nMake sure postgres database is running.\n\n\u200b\u200bUse the command to start containers in detached mode: docker-compose up -d\n(data-engineering-zoomcamp) hw % docker compose up -d\n[+] Running 2/2\n\u283f Container pg-admin Started 0.6s\n\u283f Container pg-database Started\nTo view the containers use: docker ps.\n(data-engineering-zoomcamp) hw % docker ps\nCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\nfaf05090972e postgres:13 \"docker-entrypoint.s\u2026\" 39 seconds ago Up 37 seconds 0.0.0.0:5432->5432/tcp pg-database\n6344dcecd58f dpage/pgadmin4 \"/entrypoint.sh\" 39 seconds ago Up 37 seconds 443/tcp, 0.0.0.0:8080->80/tcp pg-admin\nhw\nTo view logs for a container: docker logs <containerid>\n(data-engineering-zoomcamp) hw % docker logs faf05090972e\nPostgreSQL Database directory appears to contain a database; Skipping initialization\n2022-01-25 05:58:45.948 UTC [1] LOG: starting PostgreSQL 13.5 (Debian 13.5-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit\n2022-01-25 05:58:45.948 UTC [1] LOG: listening on IPv4 address \"0.0.0.0\", port 5432\n2022-01-25 05:58:45.948 UTC [1] LOG: listening on IPv6 address \"::\", port 5432\n2022-01-25 05:58:45.954 UTC [1] LOG: listening on Unix socket \"/var/run/postgresql/.s.PGSQL.5432\"\n2022-01-25 05:58:45.984 UTC [28] LOG: database system was interrupted; last known up at 2022-01-24 17:48:35 UTC\n2022-01-25 05:58:48.581 UTC [28] LOG: database system was not properly shut down; automatic recovery in\nprogress\n2022-01-25 05:58:48.602 UTC [28] LOG: redo starts at 0/872A5910\n2022-01-25 05:59:33.726 UTC [28] LOG: invalid record length at 0/98A3C160: wanted 24, got 0\n2022-01-25 05:59:33.726 UTC [28\n] LOG: redo done at 0/98A3C128\n2022-01-25 05:59:48.051 UTC [1] LOG: database system is ready to accept connections\nIf docker ps doesn\u2019t show pgdatabase running, run: docker ps -a\nThis should show all containers, either running or stopped.\nGet the container id for pgdatabase-1, and run", "section": "Module 1: Docker and Terraform", "question": "Docker-Compose - Error translating host name to address" }, { "text": "After executing `docker-compose up` - if you lose database data and are unable to successfully execute your Ingestion script (to re-populate your database) but receive the following error:\nsqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name /data_pgadmin:/var/lib/pgadmin\"pg-database\" to address: Name or service not known\nDocker compose is creating its own default network since it is no longer specified in a docker execution command or file. Docker Compose will emit to logs the new network name. See the logs after executing `docker compose up` to find the network name and change the network name argument in your Ingestion script.\nIf problems persist with pgcli, we can use HeidiSQL,usql\nKrishna Anand", "section": "Module 1: Docker and Terraform", "question": "Docker-Compose - Data retention (could not translate host name \"pg-database\" to address: Name or service not known)" }, { "text": "It returns --> Error response from daemon: network 66ae65944d643fdebbc89bd0329f1409dec2c9e12248052f5f4c4be7d1bdc6a3 not found\nTry:\ndocker ps -a to see all the stopped & running containers\nd to nuke all the containers\nTry: docker-compose up -d again ports\nOn localhost:8080 server \u2192 Unable to connect to server: could not translate host name 'pg-database' to address: Name does not resolve\nTry: new host name, best without \u201c - \u201d e.g. pgdatabase\nAnd on docker-compose.yml, should specify docker network & specify the same network in both containers\nservices:\npgdatabase:\nimage: postgres:13\nenvironment:\n- POSTGRES_USER=root\n- POSTGRES_PASSWORD=root\n- POSTGRES_DB=ny_taxi\nvolumes:\n- \"./ny_taxi_postgres_data:/var/lib/postgresql/data:rw\"\nports:\n- \"5431:5432\"\nnetworks:\n- pg-network\npgadmin:\nimage: dpage/pgadmin4\nenvironment:\n- PGADMIN_DEFAULT_EMAIL=admin@admin.com\n- PGADMIN_DEFAULT_PASSWORD=root\nports:\n- \"8080:80\"\nnetworks:\n- pg-network\nnetworks:\npg-network:\nname: pg-network", "section": "Module 1: Docker and Terraform", "question": "Docker-Compose - Hostname does not resolve" }, { "text": "So one common issue is when you run docker-compose on GCP, postgres won\u2019t persist it\u2019s data to mentioned path for example:\nservices:\n\u2026\n\u2026\npgadmin:\n\u2026\n\u2026\nVolumes:\n\u201c./pgadmin\u201d:/var/lib/pgadmin:wr\u201d\nMight not work so in this use you can use Docker Volume to make it persist, by simply changing\nservices:\n\u2026\n\u2026.\npgadmin:\n\u2026\n\u2026\nVolumes:\npgadmin:/var/lib/pgadmin\nvolumes:\nPgadmin:", "section": "Module 1: Docker and Terraform", "question": "Docker-Compose - Persist PGAdmin docker contents on GCP" }, { "text": "The docker will keep on crashing continuously\nNot working after restart\ndocker engine stopped\nAnd failed to fetch extensions pop ups will on screen non-stop\nSolution :\nTry checking if latest version of docker is installed / Try updating the docker\nIf Problem still persist then final solution is to reinstall docker\n(Just have to fetch images again else no issues)", "section": "Module 1: Docker and Terraform", "question": "Docker engine stopped_failed to fetch extensions" }, { "text": "As per the lessons,\nPersisting pgAdmin configuration (i.e. server name) is done by adding a \u201cvolumes\u201d section:\nservices:\npgdatabase:\n[...]\npgadmin:\nimage: dpage/pgadmin4\nenvironment:\n- PGADMIN_DEFAULT_EMAIL=admin@admin.com\n- PGADMIN_DEFAULT_PASSWORD=root\nvolumes:\n- \"./pgAdmin_data:/var/lib/pgadmin/sessions:rw\"\nports:\n- \"8080:80\"\nIn the example above, \u201dpgAdmin_data\u201d is a folder on the host machine, and \u201c/var/lib/pgadmin/sessions\u201d is the session settings folder in the pgAdmin container.\nBefore running docker-compose up on the YAML file, we also need to give the pgAdmin container access to write to the \u201cpgAdmin_data\u201d folder. The container runs with a username called \u201c5050\u201d and user group \u201c5050\u201d. The bash command to give access over the mounted volume is:\nsudo chown -R 5050:5050 pgAdmin_data", "section": "Module 1: Docker and Terraform", "question": "Docker-Compose - Persist PGAdmin configuration" }, { "text": "This happens if you did not create the docker group and added your user. Follow these steps from the link:\nguides/docker-without-sudo.md at main \u00b7 sindresorhus/guides \u00b7 GitHub\nAnd then press ctrl+D to log-out and log-in again. pgAdmin: Maintain state so that it remembers your previous connection\nIf you are tired of having to setup your database connection each time that you fire up the containers, all you have to do is create a volume for pgAdmin:\nIn your docker-compose.yaml file, enter the following into your pgAdmin declaration:\nvolumes:\n- type: volume\nsource: pgadmin_data\ntarget: /var/lib/pgadmin\nAlso add the following to the end of the file:ls\nvolumes:\nPgadmin_data:", "section": "Module 1: Docker and Terraform", "question": "Docker-Compose - dial unix /var/run/docker.sock: connect: permission denied" }, { "text": "This is happen to me after following 1.4.1 video where we are installing docker compose in our Google Cloud VM. In my case, the docker-compose file downloaded from github named docker-compose-linux-x86_64 while it is more convenient to use docker-compose command instead. So just change the docker-compose-linux-x86_64 into docker-compose.", "section": "Module 1: Docker and Terraform", "question": "Docker-Compose - docker-compose still not available after changing .bashrc" }, { "text": "Installing pass via \u2018sudo apt install pass\u2019 helped to solve the issue. More about this can be found here: https://github.com/moby/buildkit/issues/1078", "section": "Module 1: Docker and Terraform", "question": "Docker-Compose - Error getting credentials after running docker-compose up -d" }, { "text": "For everyone who's having problem with Docker compose, getting the data in postgres and similar issues, please take care of the following:\ncreate a new volume on docker (either using the command line or docker desktop app)\nmake the following changes to your docker-compose.yml file (see attachment)\nset low_memory=false when importing the csv file (df = pd.read_csv('yellow_tripdata_2021-01.csv', nrows=1000, low_memory=False))\nuse the below function (in the upload-data.ipynb) for better tracking of your ingestion process (see attachment)\nOrder of execution:\n(1) open terminal in 2_docker_sql folder and run docker compose up\n(2) ensure no other containers are running except the one you just executed (pgadmin and pgdatabase)\n(3) open jupyter notebook and begin the data ingestion\n(4) open pgadmin and set up a server (make sure you use the same configurations as your docker-compose.yml file like the same name (pgdatabase), port, databasename (ny_taxi) etc.", "section": "Module 1: Docker and Terraform", "question": "Docker-Compose - Errors pertaining to docker-compose.yml and pgadmin setup" }, { "text": "Locate config.json file for docker (check your home directory; Users/username/.docker).\nModify credsStore to credStore\nSave and re-run", "section": "Module 1: Docker and Terraform", "question": "Docker Compose up -d error getting credentials - err: exec: \"docker-credential-desktop\": executable file not found in %PATH%, out: ``" }, { "text": "To figure out which docker-compose you need to download from https://github.com/docker/compose/releases you can check your system with these commands:\nuname -s -> return Linux most likely\nuname -m -> return \"flavor\"\nOr try this command -\nsudo curl -L \"https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)\" -o /usr/local/bin/docker-compose", "section": "Module 1: Docker and Terraform", "question": "Docker-Compose - Which docker-compose binary to use for WSL?" }, { "text": "If you wrote the docker-compose.yaml file exactly like the video, you might run into an error like this:dev\nservice \"pgdatabase\" refers to undefined volume dtc_postgres_volume_local: invalid compose project\nIn order to make it work, you need to include the volume in your docker-compose file. Just add the following:\nvolumes:\ndtc_postgres_volume_local:\n(Make sure volumes are at the same level as services.)", "section": "Module 1: Docker and Terraform", "question": "Docker-Compose - Error undefined volume in Windows/WSL" }, { "text": "Error: initdb: error: could not change permissions of directory\nIssue: WSL and Windows do not manage permissions in the same way causing conflict if using the Windows file system rather than the WSL file system.\nSolution: Use Docker volumes.\nWhy: Volume is used for storage of persistent data and not for use of transferring files. A local volume is unnecessary.\nBenefit: This resolves permission issues and allows for better management of volumes.\nNOTE: the \u2018user:\u2019 is not necessary if using docker volumes, but is if using local drive.\n</> docker-compose.yaml\nservices:\npostgres:\nimage: postgres:15-alpine\ncontainer_name: postgres\nuser: \"0:0\"\nenvironment:\n- POSTGRES_USER=postgres\n- POSTGRES_PASSWORD=postgres\n- POSTGRES_DB=ny_taxi\nvolumes:\n- \"pg-data:/var/lib/postgresql/data\"\nports:\n- \"5432:5432\"\nnetworks:\n- pg-network\npgadmin:\nimage: dpage/pgadmin4\ncontainer_name: pgadmin\nuser: \"${UID}:${GID}\"\nenvironment:\n- PGADMIN_DEFAULT_EMAIL=email@some-site.com\n- PGADMIN_DEFAULT_PASSWORD=pgadmin\nvolumes:\n- \"pg-admin:/var/lib/pgadmin\"\nports:\n- \"8080:80\"\nnetworks:\n- pg-network\nnetworks:\npg-network:\nname: pg-network\nvolumes:\npg-data:\nname: ingest_pgdata\npg-admin:\nname: ingest_pgadmin", "section": "Module 1: Docker and Terraform", "question": "WSL Docker directory permissions error" }, { "text": "Cause : If Running on git bash or vm in windows pgadmin doesnt work easily LIbraries like psycopg2 and libpq ar required still the error persists.\nSolution- I use psql instead of pgadmin totally same\nPip install psycopg2\ndock", "section": "Module 1: Docker and Terraform", "question": "Docker - If pgadmin is not working for Querying in Postgres Use PSQL" }, { "text": "Cause:\nIt happens because the apps are not updated. To be specific, search for any pending updates for Windows Terminal, WSL and Windows Security updates.\nSolution\nfor updating Windows terminal which worked for me:\nGo to Microsoft Store.\nGo to the library of apps installed in your system.\nSearch for Windows terminal.\nUpdate the app and restart your system to see the changes.\nFor updating the Windows security updates:\nGo to Windows updates and check if there are any pending updates from Windows, especially security updates.\nDo restart your system once the updates are downloaded and installed successfully.", "section": "Module 1: Docker and Terraform", "question": "WSL - Insufficient system resources exist to complete the requested service." }, { "text": "Up restardoting the same issue appears. Happens out of the blue on windows.\nSolution 1: Fixing DNS Issue (credit: reddit) this worked for me personally\nreg add \"HKLM\\System\\CurrentControlSet\\Services\\Dnscache\" /v \"Start\" /t REG_DWORD /d \"4\" /f\nRestart your computer and then enable it with the following\nreg add \"HKLM\\System\\CurrentControlSet\\Services\\Dnscache\" /v \"Start\" /t REG_DWORD /d \"2\" /f\nRestart your OS again. It should work.\nSolution 2: right click on running Docker icon (next to clock) and chose \"Switch to Linux containers\"\nbash: conda: command not found\nDatabase is uninitialized and superuser password is not specified.\nDatabase is uninitialized and superuser password is not specified.", "section": "Module 1: Docker and Terraform", "question": "WSL - WSL integration with distro Ubuntu unexpectedly stopped with exit code 1." }, { "text": "Issue when trying to run the GPC VM through SSH through WSL2, probably because WSL2 isn\u2019t looking for .ssh keys in the correct folder. My case I was trying to run this command in the terminal and getting an error\nPC:/mnt/c/Users/User/.ssh$ ssh -i gpc [username]@[my external IP]\nYou can try to use sudo before the command\nSudo .ssh$ ssh -i gpc [username]@[my external IP]\nYou can also try to cd to your folder and change the permissions for the private key SSH file.\nchmod 600 gpc\nIf that doesn\u2019t work, create a .ssh folder in the home diretory of WSL2 and copy the content of windows .ssh folder to that new folder.\ncd ~\nmkdir .ssh\ncp -r /mnt/c/Users/YourUsername/.ssh/* ~/.ssh/\nYou might need to adjust the permissions of the files and folders in the .ssh directory.", "section": "Module 1: Docker and Terraform", "question": "WSL - Permissions too open at Windows" }, { "text": "Such as the issue above, WSL2 may not be referencing the correct .ssh/config path from Windows. You can create a config file at the home directory of WSL2.\ncd ~\nmkdir .ssh\nCreate a config file in this new .ssh/ folder referencing this folder:\nHostName [GPC VM external IP]\nUser [username]\nIdentityFile ~/.ssh/[private key]", "section": "Module 1: Docker and Terraform", "question": "WSL - Could not resolve host name" }, { "text": "Change TO Socket\npgcli -h 127.0.0.1 -p 5432 -u root -d ny_taxi\npgcli -h 127.0.0.1 -p 5432 -u root -d ny_taxi", "section": "Module 1: Docker and Terraform", "question": "PGCLI - connection failed: :1), port 5432 failed: could not receive data from server: Connection refused could not send SSL negotiation packet: Connection refused" }, { "text": "probably some installation error, check out sy", "section": "Module 1: Docker and Terraform", "question": "PGCLI --help error" }, { "text": "In this section of the course, the 5432 port of pgsql is mapped to your computer\u2019s 5432 port. Which means you can access the postgres database via pgcli directly from your computer.\nSo No, you don\u2019t need to run it inside another container. Your local system will do.", "section": "Module 1: Docker and Terraform", "question": "PGCLI - INKhould we run pgcli inside another docker container?" }, { "text": "FATAL: password authentication failed for user \"root\"\nobservations: Below in bold do not forget the folder that was created ny_taxi_postgres_data\nThis happens if you have a local Postgres installation in your computer. To mitigate this, use a different port, like 5431, when creating the docker container, as in: -p 5431: 5432\nThen, we need to use this port when connecting to pgcli, as shown below:\npgcli -h localhost -p 5431 -u root -d ny_taxi\nThis will connect you to your postgres docker container, which is mapped to your host\u2019s 5431 port (though you might choose any port of your liking as long as it is not occupied).\nFor a more visual and detailed explanation, feel free to check the video 1.4.2 - Port Mapping and Networks in Docker\nIf you want to debug: the following can help (on a MacOS)\nTo find out if something is blocking your port (on a MacOS):\nYou can use the lsof command to find out which application is using a specific port on your local machine. `lsof -i :5432`wi\nOr list the running postgres services on your local machine with launchctl\nTo unload the running service on your local machine (on a MacOS):\nunload the launch agent for the PostgreSQL service, which will stop the service and free up the port \n`launchctl unload -w ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist`\nthis one to start it again\n`launchctl load -w ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist`\nChanging port from 5432:5432 to 5431:5432 helped me to avoid this error.", "section": "Module 1: Docker and Terraform", "question": "PGCLI - FATAL: password authentication failed for user \"root\" (You already have Postgres)" }, { "text": "I get this error\npgcli -h localhost -p 5432 -U root -d ny_taxi\nTraceback (most recent call last):\nFile \"/opt/anaconda3/bin/pgcli\", line 8, in <module>\nsys.exit(cli())\nFile \"/opt/anaconda3/lib/python3.9/site-packages/click/core.py\", line 1128, in __call__\nreturn self.main(*args, **kwargs)\nFile \"/opt/anaconda3/lib/python3.9/sitYe-packages/click/core.py\", line\n1053, in main\nrv = self.invoke(ctx)\nFile \"/opt/anaconda3/lib/python3.9/site-packages/click/core.py\", line 1395, in invoke\nreturn ctx.invoke(self.callback, **ctx.params)\nFile \"/opt/anaconda3/lib/python3.9/site-packages/click/core.py\", line 754, in invoke\nreturn __callback(*args, **kwargs)\nFile \"/opt/anaconda3/lib/python3.9/site-packages/pgcli/main.py\", line 880, in cli\nos.makedirs(config_dir)\nFile \"/opt/anaconda3/lib/python3.9/os.py\", line 225, in makedirspython\nmkdir(name, mode)PermissionError: [Errno 13] Permission denied: '/Users/vray/.config/pgcli'\nMake sure you install pgcli without sudo.\nThe recommended approach is to use conda/anaconda to make sure your system python is not affected.\nIf conda install gets stuck at \"Solving environment\" try these alternatives: https://stackoverflow.com/questions/63734508/stuck-at-solving-environment-on-anaconda", "section": "Module 1: Docker and Terraform", "question": "PGCLI - PermissionError: [Errno 13] Permission denied: '/some/path/.config/pgcli'" }, { "text": "ImportError: no pq wrapper available.\nAttempts made:\n- couldn't import \\dt\nopg 'c' implementation: No module named 'psycopg_c'\n- couldn't import psycopg 'binary' implementation: No module named 'psycopg_binary'\n- couldn't import psycopg 'python' implementation: libpq library not found\nSolution:\nFirst, make sure your Python is set to 3.9, at least.\nAnd the reason for that is we have had cases of 'psycopg2-binary' failing to install because of an old version of Python (3.7.3). \n\n0. You can check your current python version with: \n$ python -V(the V must be capital)\n1. Based on the previous output, if you've got a 3.9, skip to Step #2\n Otherwispye better off with a new environment with 3.9\n$ conda create \u2013name de-zoomcamp python=3.9\n$ conda activate de-zoomcamp\n2. Next, you should be able to install the lib for postgres like this:\n```\n$ e\n$ pip install psycopg2_binary\n```\n3. Finally, make sure you're also installing pgcli, but use conda for that:\n```\n$ pgcli -h localhost -U root -d ny_taxisudo\n```\nThere, you should be good to go now!\nAnother solution:\nRun this\npip install \"psycopg[binary,pool]\"", "section": "Module 1: Docker and Terraform", "question": "PGCLI - no pq wrapper available." }, { "text": "If your Bash prompt is stuck on the password command for postgres\nUse winpty:\nwinpty pgcli -h localhost -p 5432 -u root -d ny_taxi\nAlternatively, try using Windows terminal or terminal in VS code.\nEditPGCLI -connection failed: FATAL: password authentication failed for user \"root\"\nThe error above was faced continually despite inputting the correct password\nSolution\nOption 1: Stop the PostgreSQL service on Windows\nOption 2 (using WSL): Completely uninstall Protgres 12 from Windows and install postgresql-client on WSL (sudo apt install postgresql-client-common postgresql-client libpq-dev)\nOption 3: Change the port of the docker container\nNEW SOLUTION: 27/01/2024\nPGCLI -connection failed: FATAL: password authentication failed for user \"root\"\nIf you\u2019ve got the error above, it\u2019s probably because you were just like me, closed the connection to the Postgres:13 image in the previous step of the tutorial, which is\n\ndocker run -it \\\n-e POSTGRES_USER=root \\\n-e POSTGRES_PASSWORD=root \\\n-e POSTGRES_DB=ny_taxi \\\n-v d:/git/data-engineering-zoomcamp/week_1/docker_sql/ny_taxi_postgres_data:/var/lib/postgresql/data \\\n-p 5432:5432 \\\npostgres:13\nSo keep the database connected and you will be able to implement all the next steps of the tutorial.", "section": "Module 1: Docker and Terraform", "question": "PGCLI - stuck on password prompt" }, { "text": "Problem: If you have already installed pgcli but bash doesn't recognize pgcli\nOn Git bash: bash: pgcli: command not found\nOn Windows Terminal: pgcli: The term 'pgcli' is not recognized\u2026\nSolution: Try adding a Python path C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\Scripts to Windows PATH\nFor details:\nGet the location: pip list -v\nCopy C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\site-packages\n3. Replace site-packages with Scripts: C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\Scripts\nIt can also be that you have Python installed elsewhere.\nFor me it was under c:\\python310\\lib\\site-packages\nSo I had to add c:\\python310\\lib\\Scripts to PATH, as shown below.\nPut the above path in \"Path\" (or \"PATH\") in System Variables\nReference: https://stackoverflow.com/a/68233660", "section": "Module 1: Docker and Terraform", "question": "PGCLI - pgcli: command not found" }, { "text": "In case running pgcli locally causes issues or you do not want to install it locally you can use it running in a Docker container instead.\nBelow the usage with values used in the videos of the course for:\nnetwork name (docker network)\npostgres related variables for pgcli\nHostname\nUsername\nPort\nDatabase name\n$ docker run -it --rm --network pg-network ai2ys/dockerized-pgcli:4.0.1\n175dd47cda07:/# pgcli -h pg-database -U root -p 5432 -d ny_taxi\nPassword for root:\nServer: PostgreSQL 16.1 (Debian 16.1-1.pgdg120+1)\nVersion: 4.0.1\nHome: http://pgcli.com\nroot@pg-database:ny_taxi> \\dt\n+--------+------------------+-------+-------+\n| Schema | Name | Type | Owner |\n|--------+------------------+-------+-------|\n| public | yellow_taxi_data | table | root |\n+--------+------------------+-------+-------+\nSELECT 1\nTime: 0.009s\nroot@pg-database:ny_taxi>", "section": "Module 1: Docker and Terraform", "question": "PGCLI - running in a Docker container" }, { "text": "PULocationID will not be recognized but \u201cPULocationID\u201d will be. This is because unquoted \"Localidentifiers are case insensitive. See docs.", "section": "Module 1: Docker and Terraform", "question": "PGCLI - case sensitive use \u201cQuotations\u201d around columns with capital letters" }, { "text": "When using the command `\\d <database name>` you get the error column `c.relhasoids does not exist`.\nResolution:\nUninstall pgcli\nReinstall pgclidatabase \"ny_taxi\" does not exist\nRestart pc", "section": "Module 1: Docker and Terraform", "question": "PGCLI - error column c.relhasoids does not exist" }, { "text": "This happens while uploading data via the connection in jupyter notebook\nengine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')\nThe port 5432 was taken by another postgres. We are not connecting to the port in docker, but to the port on our machine. Substitute 5431 or whatever port you mapped to for port 5432.\nAlso if this error is still persistent , kindly check if you have a service in windows running postgres , Stopping that service will resolve the issue", "section": "Module 1: Docker and Terraform", "question": "Postgres - OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL: password authentication failed for user \"root\"" }, { "text": "Can happen when connecting via pgcli\npgcli -h localhost -p 5432 -U root -d ny_taxi\nOr while uploading data via the connection in jupyter notebook\nengine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')\nThis can happen when Postgres is already installed on your computer. Changing the port can resolve that (e.g. from 5432 to 5431).\nTo check whether there even is a root user with the ability to login:\nTry: docker exec -it <your_container_name> /bin/bash\nAnd then run\n???\nAlso, you could change port from 5432:5432 to 5431:5432\nOther solution that worked:\nChanging `POSTGRES_USER=juroot` to `PGUSER=postgres`\nBased on this: postgres with docker compose gives FATAL: role \"root\" does not exist error - Stack Overflow\nAlso `docker compose down`, removing folder that had postgres volume, running `docker compose up` again.", "section": "Module 1: Docker and Terraform", "question": "Postgres - OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL: role \"root\" does not exist" }, { "text": "~\\anaconda3\\lib\\site-packages\\psycopg2\\__init__.py in connect(dsn, connection_factory, cursor_factory, **kwargs)\n120\n121 dsn = _ext.make_dsn(dsn, **kwargs)\n--> 122 conn = _connect(dsn, connection_factory=connection_factory, **kwasync)\n123 if cursor_factory is not None:\n124 conn.cursor_factory = cursor_factory\nOperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL: database \"ny_taxi\" does not exist\nMake sure postgres is running. You can check that by running `docker ps`\n\u2705Solution: If you have postgres software installed on your computer before now, build your instance on a different port like 8080 instead of 5432", "section": "Module 1: Docker and Terraform", "question": "Postgres - OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL: dodatabase \"ny_taxi\" does not exist" }, { "text": "Issue:\ne\u2026\nSolution:\npip install psycopg2-binary\nIf you already have it, you might need to update it:\npip install psycopg2-binary --upgrade\nOther methods, if the above fails:\nif you are getting the \u201c ModuleNotFoundError: No module named 'psycopg2' \u201c error even after the above installation, then try updating conda using the command conda update -n base -c defaults conda. Or if you are using pip, then try updating it before installing the psycopg packages i.e\nFirst uninstall the psycopg package\nThen update conda or pip\nThen install psycopg again using pip.\nif you are still facing error with r pcycopg2 and showing pg_config not found then you will have to install postgresql. in MAC it is brew install postgresql", "section": "Module 1: Docker and Terraform", "question": "Postgres - ModuleNotFoundError: No module named 'psycopg2'" }, { "text": "In the join queries, if we mention the column name directly or enclosed in single quotes it\u2019ll throw an error says \u201ccolumn does not exist\u201d.\n\u2705Solution: But if we enclose the column names in double quotes then it will work", "section": "Module 1: Docker and Terraform", "question": "Postgres - \"Column does not exist\" but it actually does (Pyscopg2 error in MacBook Pro M2)" }, { "text": "pgAdmin has a new version. Create server dialog may not appear. Try using register-> server instead.", "section": "Module 1: Docker and Terraform", "question": "pgAdmin - Create server dialog does not appear" }, { "text": "Using GitHub Codespaces in the browser resulted in a blank screen after the login to pgAdmin (running in a Docker container). The terminal of the pgAdmin container was showing the following error message:\nCSRFError: 400 Bad Request: The referrer does not match the host.\nSolution #1:\nAs recommended in the following issue https://github.com/pgadmin-org/pgadmin4/issues/5432 setting the following environment variable solved it.\nPGADMIN_CONFIG_WTF_CSRF_ENABLED=\"False\"\nModified \u201cdocker run\u201d command\ndocker run --rm -it \\\n-e PGADMIN_DEFAULT_EMAIL=\"admin@admin.com\" \\\n-e PGADMIN_DEFAULT_PASSWORD=\"root\" \\\n-e PGADMIN_CONFIG_WTF_CSRF_ENABLED=\"False\" \\\n-p \"8080:80\" \\\n--name pgadmin \\\n--network=pg-network \\\ndpage/pgadmin4:8.2\nSolution #2:\nUsing the local installed VSCode to display GitHub Codespaces.\nWhen using GitHub Codespaces in the locally installed VSCode (opening a Codespace or creating/starting one) this issue did not occur.", "section": "Module 1: Docker and Terraform", "question": "pgAdmin - Blank/white screen after login (browser)" }, { "text": "I am using a Mac Pro device and connect to the GCP Compute Engine via Remote SSH - VSCode. But when I trying to run the PgAdmin container via docker run or docker compose command, I am failed to access the pgAdmin address via my browser. I have switched to another browser, but still can not access the pgAdmin address. So I modified a little bit the configuration from the previous DE Zoomcamp repository like below and can access the pgAdmin address:\nSolution #1:\nModified \u201cdocker run\u201d command\ndocker run --rm -it \\\n-e PGADMIN_DEFAULT_EMAIL=\"admin@admin.com\" \\\n-e PGADMIN_DEFAULT_PASSWORD=\"pgadmin\" \\\n-e PGADMIN_CONFIG_WTF_CSRF_ENABLED=\"False\" \\\n-e PGADMIN_LISTEN_ADDRESS=0.0.0.0 \\\n-e PGADMIN_LISTEN_PORT=5050 \\\n-p 5050:5050 \\\n--network=de-zoomcamp-network \\\n--name pgadmin-container \\\n--link postgres-container \\\n-t dpage/pgadmin4\nSolution #2:\nModified docker-compose.yaml configuration (via \u201cdocker compose up\u201d command)\npgadmin:\nimage: dpage/pgadmin4\ncontainer_name: pgadmin-conntainer\nenvironment:\n- PGADMIN_DEFAULT_EMAIL=admin@admin.com\n- PGADMIN_DEFAULT_PASSWORD=pgadmin\n- PGADMIN_CONFIG_WTF_CSRF_ENABLED=False\n- PGADMIN_LISTEN_ADDRESS=0.0.0.0\n- PGADMIN_LISTEN_PORT=5050\nvolumes:\n- \"./pgadmin_data:/var/lib/pgadmin/data\"\nports:\n- \"5050:5050\"\nnetworks:\n- de-zoomcamp-network\ndepends_on:\n- postgres-conntainer\nPython - ModuleNotFoundError: No module named 'pysqlite2'\nImportError: DLL load failed while importing _sqlite3: The specified module could not be found. ModuleNotFoundError: No module named 'pysqlite2'\nThe issue seems to arise from the missing of sqlite3.dll in path \".\\Anaconda\\Dlls\\\".\n\u2705I solved it by simply copying that .dll file from \\Anaconda3\\Library\\bin and put it under the path mentioned above. (if you are using anaconda)", "section": "Module 1: Docker and Terraform", "question": "pgAdmin - Can not access/open the PgAdmin address via browser" }, { "text": "If you follow the video 1.2.2 - Ingesting NY Taxi Data to Postgres and you execute all the same\nsteps as Alexey does, you will ingest all the data (~1.3 million rows) into the table yellow_taxi_data as expected.\nHowever, if you try to run the whole script in the Jupyter notebook for a second time from top to bottom, you will be missing the first chunk of 100000 records. This is because there is a call to the iterator before the while loop that puts the data in the table. The while loop therefore starts by ingesting the second chunk, not the first.\n\u2705Solution: remove the cell \u201cdf=next(df_iter)\u201d that appears higher up in the notebook than the while loop. The first time w(df_iter) is called should be within the while loop.\n\ud83d\udcd4Note: As this notebook is just used as a way to test the code, it was not intended to be run top to bottom, and the logic is tidied up in a later step when it is instead inserted into a .py file for the pipeline", "section": "Module 1: Docker and Terraform", "question": "Python - Ingestion with Jupyter notebook - missing 100000 records" }, { "text": "{t_end - t_start} seconds\")\nimport pandas as pd\ndf = pd.read_csv('path/to/file.csv.gz', /app/ingest_data.py:1: DeprecationWarning:)\nIf you prefer to keep the uncompressed csv (easier preview in vscode and similar), gzip files can be unzipped using gunzip (but not unzip). On a Ubuntu local or virtual machine, you may need to apt-get install gunzip first.", "section": "Module 1: Docker and Terraform", "question": "Python - Iteration csv without error" }, { "text": "Pandas can interpret \u201cstring\u201d column values as \u201cdatetime\u201d directly when reading the CSV file using \u201cpd.read_csv\u201d using the parameter \u201cparse_dates\u201d, which for example can contain a list of column names or column indices. Then the conversion afterwards is not required anymore.\npandas.read_csv \u2014 pandas 2.1.4 documentation (pydata.org)\nExample from week 1\nimport pandas as pd\ndf = pd.read_csv(\n'yellow_tripdata_2021-01.csv',\nnrows=100,\nparse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])\ndf.info()\nwhich will output\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 100 entries, 0 to 99\nData columns (total 18 columns):\n# Column Non-Null Count Dtype\n--- ------ -------------- -----\n0 VendorID 100 non-null int64\n1 tpep_pickup_datetime 100 non-null datetime64[ns]\n2 tpep_dropoff_datetime 100 non-null datetime64[ns]\n3 passenger_count 100 non-null int64\n4 trip_distance 100 non-null float64\n5 RatecodeID 100 non-null int64\n6 store_and_fwd_flag 100 non-null object\n7 PULocationID 100 non-null int64\n8 DOLocationID 100 non-null int64\n9 payment_type 100 non-null int64\n10 fare_amount 100 non-null float64\n11 extra 100 non-null float64\n12 mta_tax 100 non-null float64\n13 tip_amount 100 non-null float64\n14 tolls_amount 100 non-null float64\n15 improvement_surcharge 100 non-null float64\n16 total_amount 100 non-null float64\n17 congestion_surcharge 100 non-null float64\ndtypes: datetime64[ns](2), float64(9), int64(6), object(1)\nmemory usage: 14.2+ KB", "section": "Module 1: Docker and Terraform", "question": "iPython - Pandas parsing dates with \u2018read_csv\u2019" }, { "text": "os.system(f\"curl -LO {url} -o {csv_name}\")", "section": "Module 1: Docker and Terraform", "question": "Python - Python cant ingest data from the github link provided using curl" }, { "text": "When a CSV file is compressed using Gzip, it is saved with a \".csv.gz\" file extension. This file type is also known as a Gzip compressed CSV file. When you want to read a Gzip compressed CSV file using Pandas, you can use the read_csv() function, which is specifically designed to read CSV files. The read_csv() function accepts several parameters, including a file path or a file-like object. To read a Gzip compressed CSV file, you can pass the file path of the \".csv.gz\" file as an argument to the read_csv() function.\nHere is an example of how to read a Gzip compressed CSV file using Pandas:\ndf = pd.read_csv('file.csv.gz'\n, compression='gzip'\n, low_memory=False\n)", "section": "Module 1: Docker and Terraform", "question": "Python - Pandas can read *.csv.gzip" }, { "text": "Contrary to panda\u2019s read_csv method there\u2019s no such easy way to iterate through and set chunksize for parquet files. We can use PyArrow (Apache Arrow Python bindings) to resolve that.\nimport pyarrow.parquet as pq\noutput_name = \u201chttps://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet\u201d\nparquet_file = pq.ParquetFile(output_name)\nparquet_size = parquet_file.metadata.num_rows\nengine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')\ntable_name=\u201dyellow_taxi_schema\u201d\n# Clear table if exists\npq.read_table(output_name).to_pandas().head(n=0).to_sql(name=table_name, con=engine, if_exists='replace')\n# default (and max) batch size\nindex = 65536\nfor i in parquet_file.iter_batches(use_threads=True):\nt_start = time()\nprint(f'Ingesting {index} out of {parquet_size} rows ({index / parquet_size:.0%})')\ni.to_pandas().to_sql(name=table_name, con=engine, if_exists='append')\nindex += 65536\nt_end = time()\nprint(f'\\t- it took %.1f seconds' % (t_end - t_start))", "section": "Module 1: Docker and Terraform", "question": "Python - How to iterate through and ingest parquet file" }, { "text": "Error raised during the jupyter notebook\u2019s cell execution:\nfrom sqlalchemy import create_engine.\nSolution: Version of Python module \u201ctyping_extensions\u201d >= 4.6.0. Can be updated by Conda or pip.", "section": "Module 1: Docker and Terraform", "question": "Python - SQLAlchemy - ImportError: cannot import name 'TypeAliasType' from 'typing_extensions'." }, { "text": "create_engine('postgresql://root:root@localhost:5432/ny_taxi') I get the error \"TypeError: 'module' object is not callable\"\nSolution:\nconn_string = \"postgresql+psycopg://root:root@localhost:5432/ny_taxi\"\nengine = create_engine(conn_string)", "section": "Module 1: Docker and Terraform", "question": "Python - SQLALchemy - TypeError 'module' object is not callable" }, { "text": "Error raised during the jupyter notebook\u2019s cell execution:\nengine = create_engine('postgresql://root:root@localhost:5432/ny_taxi').\nSolution: Need to install Python module \u201cpsycopg2\u201d. Can be installed by Conda or pip.", "section": "Module 1: Docker and Terraform", "question": "Python - SQLAlchemy - ModuleNotFoundError: No module named 'psycopg2'." }, { "text": "Unable to add Google Cloud SDK PATH to Windows\nWindows error: The installer is unable to automatically update your system PATH. Please add C:\\tools\\google-cloud-sdk\\bin\nif you are constantly getting this feedback. Might be that you needed to add Gitbash to your Windows path:\nOne way of doing that is to use conda: \u2018If you are not already using it\nDownload the Anaconda Navigator\nMake sure to check the box (add conda to the path when installing navigator: although not recommended do it anyway)\nYou might also need to install git bash if you are not already using it(or you might need to uninstall it to reinstall it properly)\nMake sure to check the following boxes while you install Gitbash\nAdd a GitBash to Windows Terminal\nUse Git and optional Unix tools from the command prompt\nNow open up git bash and type conda init bash This should modify your bash profile\nAdditionally, you might want to use Gitbash as your default terminal.\nOpen your Windows terminal and go to settings, on the default profile change Windows power shell to git bash", "section": "Module 1: Docker and Terraform", "question": "GCP - Unable to add Google Cloud SDK PATH to Windows" }, { "text": "It asked me to create a project. This should be done from the cloud console. So maybe we don\u2019t need this FAQ.\nWARNING: Project creation failed: HttpError accessing <https://cloudresourcemanager.googleapis.com/v1/projects?alt=json>: response: <{'vtpep_pickup_datetimeary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'content-encoding': 'gzip', 'date': 'Mon, 24 Jan 2022 19:29:12 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'server-timing': 'gfet4t7; dur=189', 'alt-svc': 'h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000,h3-Q050=\":443\"; ma=2592000,h3-Q046=\":443\"; ma=2592000,h3-Q043=\":443\"; ma=2592000,quic=\":443\"; ma=2592000; v=\"46,43\"', 'transfer-encoding': 'chunked', 'status': 409}>, content <{\n\"error\": {\n\"code\": 409,\n\"message\": \"Requested entity alreadytpep_pickup_datetime exists\",\n\"status\": \"ALREADY_EXISTS\"\n}\n}\nFrom Stackoverflow: https://stackoverflow.com/questions/52561383/gcloud-cli-cannot-create-project-the-project-id-you-specified-is-already-in-us?rq=1\nProject IDs are unique across all projects. That means if any user ever had a project with that ID, you cannot use it. testproject is pretty common, so it's not surprising it's already taken.", "section": "Module 1: Docker and Terraform", "question": "GCP - Project creation failed: HttpError accessing \u2026 Requested entity alreadytpep_pickup_datetime exists" }, { "text": "If you receive the error: \u201cError 403: The project to be billed is associated with an absent billing account., accountDisabled\u201d It is most likely because you did not enter YOUR project ID. The snip below is from video 1.3.2\nThe value you enter here will be unique to each student. You can find this value on your GCP Dashboard when you login.\nAshish Agrawal\nAnother possibility is that you have not linked your billing account to your current project", "section": "Module 1: Docker and Terraform", "question": "GCP - The project to be billed is associated with an absent billing account" }, { "text": "GCP Account Suspension Inquiry\nIf Google refuses your credit/debit card, try another - I\u2019ve got an issue with Kaspi (Kazakhstan) but it worked with TBC (Georgia).\nUnfortunately, there\u2019s small hope that support will help.\nIt seems that Pyypl web-card should work too.", "section": "Module 1: Docker and Terraform", "question": "GCP - OR-CBAT-15 ERROR Google cloud free trial account" }, { "text": "The ny-rides.json is your private file in Google Cloud Platform (GCP). \n\nAnd here\u2019s the way to find it:\nGCP -> Select project with your instance -> IAM & Admin -> Service Accounts Keys tab -> add key, JSON as key type, then click create\nNote: Once you go into Service Accounts Keys tab, click the email, then you can see the \u201cKEYS\u201d tab where you can add key as a JSON as its key type", "section": "Module 1: Docker and Terraform", "question": "GCP - Where can I find the \u201cny-rides.json\u201d file?" }, { "text": "In this lecture, Alexey deleted his instance in Google Cloud. Do I have to do it?\nNope. Do not delete your instance in Google Cloud platform. Otherwise, you have to do this twice for the week 1 readings.", "section": "Module 1: Docker and Terraform", "question": "GCP - Do I need to delete my instance in Google Cloud?" }, { "text": "System Resource Usage:\ntop or htop: Shows real-time information about system resource usage, including CPU, memory, and processes.\nfree -h: Displays information about system memory usage and availability.\ndf -h: Shows disk space usage of file systems.\ndu -h <directory>: Displays disk usage of a specific directory.\nRunning Processes:\nps aux: Lists all running processes along with detailed information.\nNetwork:\nifconfig or ip addr show: Shows network interface configuration.\nnetstat -tuln: Displays active network connections and listening ports.\nHardware Information:\nlscpu: Displays CPU information.\nlsblk: Lists block devices (disks and partitions).\nlshw: Lists hardware configuration.\nUser and Permissions:\nwho: Shows who is logged on and their activities.\nw: Displays information about currently logged-in users and their processes.\nPackage Management:\napt list --installed: Lists installed packages (for Ubuntu and Debian-based systems)", "section": "Module 1: Docker and Terraform", "question": "Commands to inspect the health of your VM:" }, { "text": "if you\u2019ve got the error\n\u2502 Error: Error updating Dataset \"projects/<your-project-id>/datasets/demo_dataset\": googleapi: Error 403: Billing has not been enabled for this project. Enable billing at https://console.cloud.google.com/billing. The default table expiration time must be less than 60 days, billingNotEnabled\nbut you\u2019ve set your billing account indeed, then try to disable billing for the project and enable it again. It worked for ME!", "section": "Module 1: Docker and Terraform", "question": "Billing account has not been enabled for this project. But you\u2019ve done it indeed!" }, { "text": "for windows if you having trouble install SDK try follow these steps on the link, if you getting this error:\nThese credentials will be used by any library that requests Application Default Credentials (ADC).\nWARNING:\nCannot find a quota project to add to ADC. You might receive a \"quota exceeded\" or \"API not enabled\" error. Run $ gcloud auth application-default set-quota-project to add a quota project.\nFor me:\nI reinstalled the sdk using unzip file \u201cinstall.bat\u201d,\nafter successfully checking gcloud version,\nrun gcloud init to set up project before\nyou run gcloud auth application-default login\nhttps://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/1_terraform_gcp/windows.md\nGCP VM - I cannot get my Virtual Machine to start because GCP has no resources.\nClick on your VM\nCreate an image of your VM\nOn the page of the image, tell GCP to create a new VM instance via the image\nOn the settings page, change the location", "section": "Module 1: Docker and Terraform", "question": "GCP - Windows Google Cloud SDK install issue:gcp" }, { "text": "The reason this video about the GCP VM exists is that many students had problems configuring their env. You can use your own env if it works for you.\nAnd the advantage of using your own environment is that if you are working in a Github repo where you can commit, you will be able to commit the changes that you do. In the VM the repo is cloned via HTTPS so it is not possible to directly commit, even if you are the owner of the repo.", "section": "Module 1: Docker and Terraform", "question": "GCP VM - Is it necessary to use a GCP VM? When is it useful?" }, { "text": "I am trying to create a directory but it won't let me do it\nUser1@DESKTOP-PD6UM8A MINGW64 /\n$ mkdir .ssh\nmkdir: cannot create directory \u2018.ssh\u2019: Permission denied\nYou should do it in your home directory. Should be your home (~)\nLocal. But it seems you're trying to do it in the root folder (/). Should be your home (~)\nLink to Video 1.4.1", "section": "Module 1: Docker and Terraform", "question": "GCP VM - mkdir: cannot create directory \u2018.ssh\u2019: Permission denied" }, { "text": "Failed to save '<file>': Unable to write file 'vscode-remote://ssh-remote+de-zoomcamp/home/<user>/data_engineering_course/week_2/airflow/dags/<file>' (NoPermissions (FileSystemError): Error: EACCES: permission denied, open '/home/<user>/data_engineering_course/week_2/airflow/dags/<file>')\nYou need to change the owner of the files you are trying to edit via VS Code. You can run the following command to change the ownership.\nssh\nsudo chown -R <user> <path to your directory>", "section": "Module 1: Docker and Terraform", "question": "GCP VM - Error while saving the file in VM via VS Code" }, { "text": "Question: I connected to my VM perfectly fine last week (ssh) but when I tried again this week, the connection request keeps timing out.\n\u2705Answer: Start your VM. Once the VM is running, copy its External IP and paste that into your config file within the ~/.ssh folder.\ncd ~/.ssh\ncode config \u2190 this opens the config file in VSCode", "section": "Module 1: Docker and Terraform", "question": ". GCP VM - VM connection request timeout" }, { "text": "(reference: https://serverfault.com/questions/953290/google-compute-engine-ssh-connect-to-host-ip-port-22-operation-timed-out)Go to edit your VM.\nGo to section Automation\nAdd Startup script\n```\n#!/bin/bash\nsudo ufw allow ssh\n```\nStop and Start VM.", "section": "Module 1: Docker and Terraform", "question": "GCP VM - connect to host port 22 no route to host" }, { "text": "You can easily forward the ports of pgAdmin, postgres and Jupyter Notebook using the built-in tools in Ubuntu and without any additional client:\nFirst, in the VM machine, launch docker-compose up -d and jupyter notebook in the correct folder.\nFrom the local machine, execute: ssh -i ~/.ssh/gcp -L 5432:localhost:5432 username@external_ip_of_vm\nExecute the same command but with ports 8080 and 8888.\nNow you can access pgAdmin on local machine in browser typing localhost:8080\nFor Jupyter Notebook, type localhost:8888 in the browser of your local machine. If you have problems with the credentials, it is possible that you have to copy the link with the access token provided in the logs of the terminal of the VM machine when you launched the jupyter notebook command.\nTo forward both pgAdmin and postgres use, ssh -i ~/.ssh/gcp -L 5432:localhost:5432 -L 8080:localhost:8080 modito@35.197.218.128", "section": "Module 1: Docker and Terraform", "question": "GCP VM - Port forwarding from GCP without using VS Code" }, { "text": "If you are using MS VS Code and running gcloud in WSL2, when you first try to login to gcp via the gcloud cli gcloud auth application-default login, you will see a message like this, and nothing will happen\nAnd there might be a prompt to ask if you want to open it via browser, if you click on it, it will open up a page with error message\nSolution : you should instead hover on the long link, and ctrl + click the long link\n\nClick configure Trusted Domains here\n\nPopup will appear, pick first or second entry\nNext time you gcloud auth, the login page should popup via default browser without issues", "section": "Module 1: Docker and Terraform", "question": "GCP gcloud + MS VS Code - gcloud auth hangs" }, { "text": "It is an internet connectivity error, terraform is somehow not able to access the online registry. Check your VPN/Firewall settings (or just clear cookies or restart your network). Try terraform init again after this, it should work.", "section": "Module 1: Docker and Terraform", "question": "Terraform - Error: Failed to query available provider packages \u2502 Could not retrieve the list of available versions for provider hashicorp/google: could not query \u2502 provider registry for registry.terrafogorm.io/hashicorp/google: the request failed after 2 attempts, \u2502 please try again later" }, { "text": "The issue was with the network. Google is not accessible in my country, I am using a VPN. And The terminal program does not automatically follow the system proxy and requires separate proxy configuration settings.I opened a Enhanced Mode in Clash, which is a VPN app, and 'terraform apply' works! So if you encounter the same issue, you can ask help for your vpn provider.", "section": "Module 1: Docker and Terraform", "question": "Terraform - Error:Post \"https://storage.googleapis.com/storage/v1/b?alt=json&prettyPrint=false&project=coherent-ascent-379901\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp 172.217.163.42:443: i/o timeout" }, { "text": "https://techcommunity.microsoft.com/t5/azure-developer-community-blog/configuring-terraform-on-windows-10-linux-sub-system/ba-p/393845", "section": "Module 1: Docker and Terraform", "question": "Terraform - Install for WSL" }, { "text": "https://github.com/hashicorp/terraform/issues/14513", "section": "Module 1: Docker and Terraform", "question": "Terraform - Error acquiring the state lock" }, { "text": "When running\nterraform apply\non wsl2 I've got this error:\n\u2502 Error: Post \"https://storage.googleapis.com/storage/v1/b?alt=json&prettyPrint=false&project=<your-project-id>\": oauth2: cannot fetch token: 400 Bad Request\n\u2502 Response: {\"error\":\"invalid_grant\",\"error_description\":\"Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.\"}\nIT happens because there may be time desync on your machine which affects computing JWT\nTo fix this, run the command\nsudo hwclock -s\nwhich fixes your system time.\nReference", "section": "Module 1: Docker and Terraform", "question": "Terraform - Error 400 Bad Request. Invalid JWT Token on WSL." }, { "text": "\u2502 Error: googleapi: Error 403: Access denied., forbidden\nYour $GOOGLE_APPLICATION_CREDENTIALS might not be pointing to the correct file \nrun = export GOOGLE_APPLICATION_CREDENTIALS=~/.gc/YOUR_JSON.json\nAnd then = gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS", "section": "Module 1: Docker and Terraform", "question": "Terraform - Error 403 : Access denied" }, { "text": "One service account is enough for all the services/resources you'll use in this course. After you get the file with your credentials and set your environment variable, you should be good to go.", "section": "Module 1: Docker and Terraform", "question": "Terraform - Do I need to make another service account for terraform before I get the keys (.json file)?" }, { "text": "Here: https://releases.hashicorp.com/terraform/1.1.3/terraform_1.1.3_linux_amd64.zip", "section": "Module 1: Docker and Terraform", "question": "Terraform - Where can I find the Terraform 1.1.3 Linux (AMD 64)?" }, { "text": "You get this error because I run the command terraform init outside the working directory, and this is wrong.You need first to navigate to the working directory that contains terraform configuration files, and and then run the command.", "section": "Module 1: Docker and Terraform", "question": "Terraform - Terraform initialized in an empty directory! The directory has no Terraform configuration files. You may begin working with Terraform immediately by creating Terraform configuration files.g" }, { "text": "The error:\nError: googleapi: Error 403: Access denied., forbidden\n\u2502\nand\n\u2502 Error: Error creating Dataset: googleapi: Error 403: Request had insufficient authentication scopes.\nFor this solution make sure to run:\necho $GOOGLE_APPLICATION_CREDENTIALS\necho $?\nSolution:\nYou have to set again the GOOGLE_APPLICATION_CREDENTIALS as Alexey did in the environment set-up video in week1:\nexport GOOGLE_APPLICATION_CREDENTIALS=\"<path/to/your/service-account-authkeys>.json", "section": "Module 1: Docker and Terraform", "question": "Terraform - Error creating Dataset: googleapi: Error 403: Request had insufficient authentication scopes" }, { "text": "The error:\nError: googleapi: Error 403: terraform-trans-campus@trans-campus-410115.iam.gserviceaccount.com does not have storage.buckets.create access to the Google Cloud project. Permission 'storage.buckets.create' denied on resource (or it may not exist)., forbidden\nThe solution:\nYou have to declare the project name as your Project ID, and not your Project name, available on GCP console Dashboard.", "section": "Module 1: Docker and Terraform", "question": "Terraform - Error creating Bucket: googleapi: Error 403: Permission denied to access \u2018storage.buckets.create\u2019" }, { "text": "provider \"google\" {\nproject = var.projectId\ncredentials = file(\"${var.gcpkey}\")\n#region = var.region\nzone = var.zone\n}", "section": "Module 1: Docker and Terraform", "question": "To ensure the sensitivity of the credentials file, I had to spend lot of time to input that as a file." }, { "text": "For the HW1 I encountered this issue. The solution is\nSELECT * FROM zones AS z WHERE z.\"Zone\" = 'Astoria Zone';\nI think columns which start with uppercase need to go between \u201cColumn\u201d. I ran into a lot of issues like this and \u201c \u201d made it work out.\nAddition to the above point, for me, there is no \u2018Astoria Zone\u2019, only \u2018Astoria\u2019 is existing in the dataset.\nSELECT * FROM zones AS z WHERE z.\"Zone\" = 'Astoria\u2019;", "section": "Module 1: Docker and Terraform", "question": "SQL - SELECT * FROM zones_taxi WHERE Zone='Astoria Zone'; Error Column Zone doesn't exist" }, { "text": "It is inconvenient to use quotation marks all the time, so it is better to put the data to the database all in lowercase, so in Pandas after\ndf = pd.read_csv(\u2018taxi+_zone_lookup.csv\u2019)\nAdd the row:\ndf.columns = df.columns.str.lower()", "section": "Module 1: Docker and Terraform", "question": "SQL - SELECT Zone FROM taxi_zones Error Column Zone doesn't exist" }, { "text": "Solution (for mac users): os.system(f\"curl {url} --output {csv_name}\")", "section": "Module 1: Docker and Terraform", "question": "CURL - curl: (6) Could not resolve host: output.csv" }, { "text": "To resolve this, ensure that your config file is in C/User/Username/.ssh/config", "section": "Module 1: Docker and Terraform", "question": "SSH Error: ssh: Could not resolve hostname linux: Name or service not known" }, { "text": "If you use Anaconda (recommended for the course), it comes with pip, so the issues is probably that the anaconda\u2019s Python is not on the PATH.\nAdding it to the PATH is different for each operation system.\nFor Linux and MacOS:\nOpen a terminal.\nFind the path to your Anaconda installation. This is typically `~/anaconda3` or `~/opt/anaconda3`.\nAdd Anaconda to your PATH with the command: `export PATH=\"/path/to/anaconda3/bin:$PATH\"`.\nTo make this change permanent, add the command to your `.bashrc` (Linux) or `.bash_profile` (MacOS) file.\nOn Windows, python and pip are in different locations (python is in the anaconda root, and pip is in Scripts). With GitBash:\nLocate your Anaconda installation. The default path is usually `C:\\Users\\[YourUsername]\\Anaconda3`.\nDetermine the correct path format for Git Bash. Paths in Git Bash follow the Unix-style, so convert the Windows path to a Unix-style path. For example, `C:\\Users\\[YourUsername]\\Anaconda3` becomes `/c/Users/[YourUsername]/Anaconda3`.\nAdd Anaconda to your PATH with the command: `export PATH=\"/c/Users/[YourUsername]/Anaconda3/:/c/Users/[YourUsername]/Anaconda3/Scripts/$PATH\"`.\nTo make this change permanent, add the command to your `.bashrc` file in your home directory.\nRefresh your environment with the command: `source ~/.bashrc`.\nFor Windows (without Git Bash):\nRight-click on 'This PC' or 'My Computer' and select 'Properties'.\nClick on 'Advanced system settings'.\nIn the System Properties window, click on 'Environment Variables'.\nIn the Environment Variables window, select the 'Path' variable in the 'System variables' section and click 'Edit'.\nIn the Edit Environment Variable window, click 'New' and add the path to your Anaconda installation (typically `C:\\Users\\[YourUsername]\\Anaconda3` and C:\\Users\\[YourUsername]\\Anaconda3\\Scripts`).\nClick 'OK' in all windows to apply the changes.\nAfter adding Anaconda to the PATH, you should be able to use `pip` from the command line. Remember to restart your terminal (or command prompt in Windows) to apply these changes.", "section": "Module 1: Docker and Terraform", "question": "'pip' is not recognized as an internal or external command, operable program or batch file." }, { "text": "Resolution: You need to stop the services which is using the port.\nRun the following:\n```\nsudo kill -9 `sudo lsof -t -i:<port>`\n```\n<port> being 8080 in this case. This will free up the port for use.\n~ Abhijit Chakraborty\nError: error response from daemon: cannot stop container: 1afaf8f7d52277318b71eef8f7a7f238c777045e769dd832426219d6c4b8dfb4: permission denied\nResolution: In my case, I had to stop docker and restart the service to get it running properly\nUse the following command:\n```\nsudo systemctl restart docker.socket docker.service\n```\n~ Abhijit Chakraborty\nError: cannot import module psycopg2\nResolution: Run the following command in linux:\n```\nsudo apt-get install libpq-dev\npip install psycopg2\n```\n~ Abhijit Chakraborty\nError: docker build Error checking context: 'can't stat '<path-to-file>'\nResolution: This happens due to insufficient permission for docker to access a certain file within the directory which hosts the Dockerfile.\n1. You can create a .dockerignore file and add the directory/file which you want Dockerfile to ignore while build.\n2. If the above does not work, then put the dockerfile and corresponding script, `\t1.py` in our case to a subfolder. and run `docker build ...`\nfrom inside the new folder.\n~ Abhijit Chakraborty", "section": "Module 1: Docker and Terraform", "question": "Error: error starting userland proxy: listen tcp4 0.0.0.0:8080: bind: address already in use" }, { "text": "To get a pip-friendly requirements.txt file file from Anaconda use\nconda install pip then `pip list \u2013format=freeze > requirements.txt`.\n`conda list -d > requirements.txt` will not work and `pip freeze > requirements.txt` may give odd pathing.", "section": "Module 2: Workflow Orchestration", "question": "Anaconda to PIP" }, { "text": "Prefect: https://docs.google.com/document/d/1K_LJ9RhAORQk3z4Qf_tfGQCDbu8wUWzru62IUscgiGU/edit?usp=sharing\nAirflow: https://docs.google.com/document/d/1-BwPAsyDH_mAsn8HH5z_eNYVyBMAtawJRjHHsjEKHyY/edit?usp=sharing", "section": "Module 2: Workflow Orchestration", "question": "Where are the FAQ questions from the previous cohorts for the orchestration module?" }, { "text": "Issue : Docker containers exit instantly with code 132, upon docker compose up\nMage documentation has it listing the cause as \"older architecture\" .\nThis might be a hardware issue, so unless you have another computer, you can't solve it without purchasing a new one, so the next best solution is a VM.\nThis is from a student running on a VirtualBox VM, Ubuntu 22.04.3 LTS, Docker version 25.0.2. So not having the context on how the vbox was spin up with (CPU, RAM, network, etc), it\u2019s really inconclusive at this time.", "section": "Module 2: Workflow Orchestration", "question": "Docker - 2.2.2 Configure Mage" }, { "text": "This issue was occurring with Windows WSL 2\nFor me this was because WSL 2 was not dedicating enough cpu cores to Docker.The load seems to take up at least one cpu core so I recommend dedicating at least two.\nOpen Bash and run the following code:\n$ cd ~\n$ ls -la\nLook for the .wsl config file:\n-rw-r--r-- 1 ~1049089 31 Jan 25 12:54 .wslconfig\nUsing a text editing tool of your choice edit or create your .wslconfig file:\n$ nano .wslconfig\nPaste the following into the new file/ edit the existing file in this format and save:\n*** Note - for memory\u2013 this is the RAM on your machine you can dedicate to Docker, your situation may be different than mine ***\n[wsl2]\nprocessors=<Number of Processors - at least 2!> example: 4\nmemory=<memory> example:4GB\nExample:\nOnce you do that run:\n$ wsl --shutdown\nThis shuts down WSL\nThen Restart Docker Desktop - You should now be able to load the .csv.gz file without the error into a pandas dataframe", "section": "Module 2: Workflow Orchestration", "question": "WSL - 2.2.3 Mage - Unexpected Kernel Restarts; Kernel Running out of memory:" }, { "text": "The issue and solution on the link:\nhttps://datatalks-club.slack.com/archives/C01FABYF2RG/p1706817366764269?thread_ts=1706815324.993529&cid=C01FABYF2RG", "section": "Module 2: Workflow Orchestration", "question": "2.2.3 Configuring Postgres" }, { "text": "Check that the POSTGRES_PORT variable in the io_config.yml file is set to port 5432, which is the default postgres port. The POSTGRES_PORT variable is the mage container port, not the host port. Hence, there\u2019s no need to set the POSTGRES_PORT to 5431 just because you already have a conflicting postgres installation in your host machine.", "section": "Module 2: Workflow Orchestration", "question": "MAGE - 2.2.3 OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5431 failed: Connection refused" }, { "text": "You forgot to select \u2018dev\u2019 profile in the dropdown menu next to where you select \u2018PostgreSQL\u2019 in the connection drop down.", "section": "Module 2: Workflow Orchestration", "question": "MAGE - 2.2.4 executing SELECT 1; results in KeyError" }, { "text": "If you are getting this error. Update your mage io_config.yaml file, and specify a timeout value set to 600 like this.\nMake sure to save your changes.\nMAGE - 2.2.4 Testing BigQuery connection using SQL 404 error:\nNotFound: 404 Not found: Dataset ny-rides-diegogutierrez:None was not found in location northamerica-northeast1\nIf you get this error even with all roles/permissions given to the service account check if you have ticked the box where it says \u201cUse raw SQL\u201d, just like the image below.", "section": "Module 2: Workflow Orchestration", "question": "MAGE -2.2.4 ConnectionError: ('Connection aborted.', TimeoutError('The write operation timed out'))" }, { "text": "Solution: https://stackoverflow.com/questions/48056381/google-client-invalid-jwt-token-must-be-a-short-lived-token", "section": "Module 2: Workflow Orchestration", "question": "Problem: RefreshError: ('invalid_grant: Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.', {'error': 'invalid_grant', 'error_description': 'Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.'})" }, { "text": "Origin of Solution (Mage Slack-Channel): https://mageai.slack.com/archives/C03HTTWFEKE/p1706543947795599\nProblem: This error can often be seen after solving the error mentioned in 2.2.4. The error can be found in Mage version 0.9.61 and is a side-effect of the update of the code for data-loader blocks.\nNote: Mage 0.9.62 has been released, as of Feb 5 2024. Please recheck. Solution below may be obsolete\nSolution: Using a \u201cfixed\u201d version of the docker container\nPull updated docker image from docker-hub\nmageai/mageaidocker pull:alpha\nUpdate docker-compose.yaml\nversion: '3'\nservices:\nmagic:\nimage: mageai/mageai:alpha <--- instead of \u201clatest\u201d-tag\ndocker-compose up\nThe original Error is still present, but the SQL-query will return the desired result:\n--------------------------------------------------------------------------------------", "section": "Module 2: Workflow Orchestration", "question": "Mage - 2.2.4 IndexError: list index out of range" }, { "text": "Add\nif not path.parent.is_dir():\npath.parent.mkdir(parents=True)\npath = Path(path).as_posix()\nsee:\nhttps://datatalks-club.slack.com/archives/C01FABYF2RG/p1675774214591809?thread_ts=1675768839.028879&cid=C01FABYF2RG", "section": "Module 2: Workflow Orchestration", "question": "2.2.6 OSError: Cannot save file into a non-existent directory: '..\\\\..\\\\data\\\\yellow'\\n\")" }, { "text": "The video DE Zoomcamp 2.2.7 is missing the actual deployment of Mage using Terraform to GCP. The steps for the deployment were not covered in the video.\nI successfully deployed it and wanted to share some key points:\nIn variables.tf, set the project_id default value to your GCP project ID.\nEnable the Cloud Filestore API:\nVisit the Google Cloud Console.to\nNavigate to \"APIs & Services\" > \"Library.\"\nSearch for \"Cloud Filestore API.\"\nClick on the API and enable it.\nTo perform the deployment:\nterraform init\nterraform apply\nPlease note that during the terraform apply step, Terraform will prompt you to enter the PostgreSQL password. After that, it will ask for confirmation to proceed with the deployment. Review the changes, type 'yes' when prompted, and press Enter.", "section": "Module 2: Workflow Orchestration", "question": "GCP - 2.2.7d Deploying Mage to GCP" }, { "text": "If you want to rune multiple docker containers from different directories. Then make sure to change the port mappings in the docker-compose.yml file.\nports:\n- 8088:6789\nThe 8088 port in above case is hostport, where mage will run on your local machine. You can customize this as long as the port is available. If you are running on VM, make sure to forward the port too. You need to keep the container port to 6789 as this is the port where mage is running.\nGCP - 2.2.7d Deploying Mage to Google Cloud\nWhile terraforming all the resources inside a VM created in GCS the following error is shown.\nError log:\nmodule.lb-http.google_compute_backend_service.default[\"default\"]: Creating...\n\u2577\n\u2502 Error: Error creating GlobalAddress: googleapi: Error 403: Request had insufficient authentication scopes.\n\u2502 Details:\n\u2502 [\n\u2502 {\n\u2502 \"@type\": \"type.googleapis.com/google.rpc.ErrorInfo\",\n\u2502 \"domain\": \"googleapis.com\",\n\u2502 \"metadatas\": {\n\u2502 \"method\": \"compute.beta.GlobalAddressesService.Insert\",\n\u2502 \"service\": \"compute.googleapis.com\"\n\u2502 },\n\u2502 \"reason\": \"ACCESS_TOKEN_SCOPE_INSUFFICIENT\"\n\u2502 }\n\u2502 ]\n\u2502\n\u2502 More details:\n\u2502 Reason: insufficientPermissions, Message: Insufficient Permission\nThis error might happen when you are using a VM inside GCS. To use the Google APIs from a GCP virtual machine you need to add the cloud platform scope (\"https://www.googleapis.com/auth/cloud-platform\") to your VM when it is created.\nSince ours is already created you can just stop it and change the permissions. You can do it in the console, just go to \"EDIT\", g99o all the way down until you find \"Cloud API access scopes\". There you can \"Allow full access to all Cloud APIs\". I did this and all went smoothly generating all the resources needed. Hope it helps if you encounter this same error.\nResources: https://stackoverflow.com/questions/35928534/403-request-had-insufficient-authentication-scopes-during-gcloud-container-clu", "section": "Module 2: Workflow Orchestration", "question": "Ruuning Multiple Mage instances in Docker from different directories" }, { "text": "If you are on the free trial account on GCP you will face this issue when trying to deploy the infrastructures with terraform. This service is not available for this kind of account.\nThe solution I found was to delete the load_balancer.tf file and to comment or delete the rows that differentiate it on the main.tf file. After this just do terraform destroy to delete any infrastructure created on the fail attempts and re-run the terraform apply.\nCode on main.tf to comment/delete:\nLine 166, 167, 168", "section": "Module 2: Workflow Orchestration", "question": "GCP - 2.2.7d Load Balancer Problem (Security Policies quota)" }, { "text": "If you get the following error\nYou have to edit variables.tf on the gcp folder, set your project-id and region and zones properly. Then, run terraform apply again.\nYou can find correct regions/zones here: https://cloud.google.com/compute/docs/regions-zones\nDeploying MAGE to GCP with Terraform via the VM (2.2.7)\nFYI - It can take up to 20 minutes to deploy the MAGE Terraform files if you are using a GCP Virtual Machine. It is normal, so don\u2019t interrupt the process or think it\u2019s taking too long. If you have, make sure you run a terraform destroy before trying again as you will have likely partially created resources which will cause errors next time you run `terraform apply`.\n`terraform destroy` may not completely delete partial resources - go to Google Cloud Console and use the search bar at the top to search for the \u2018app.name\u2019 you declared in your variables.tf file; this will list all resources with that name - make sure you delete them all before running `terraform apply` again.\nWhy are my GCP free credits going so fast? MAGE .tf files - Terraform Destroy not destroying all Resources\nI checked my GCP billing last night & the MAGE Terraform IaC didn't destroy a GCP Resource called Filestore as \u2018mage-data-prep- it has been costing \u00a35.01 of my free credits each day I now have \u00a3151 left - Alexey has assured me that This amount WILL BE SUFFICIENT funds to finish the course. Note to anyone who had issues deploying the MAGE terraform code: check your billing account to see what you're being charged for (main menu - billing) (even if it's your free credits) and run a search for 'mage-data-prep' in the top bar just to be sure that your resources have been destroyed - if any come up delete them.", "section": "Module 2: Workflow Orchestration", "question": "GCP - 2.2.7d Part 2 - Getting error when you run terraform apply" }, { "text": "```\n\u2502 Error: Error creating Connector: googleapi: Error 403: Permission 'vpcaccess.connectors.create' denied on resource '//vpcaccess.googleapis.com/projects/<ommit>/locations/us-west1' (or it may not exist).\n\u2502 Details:\n\u2502 [\n\u2502 {\n\u2502 \"@type\": \"type.googleapis.com/google.rpc.ErrorInfo\",\n\u2502 \"domain\": \"vpcaccess.googleapis.com\",\n\u2502 \"metadata\": {\n\u2502 \"permission\": \"vpcaccess.connectors.create\",\n\u2502 \"resource\": \"projects/<ommit>/locations/us-west1\"\n\u2502 },\n\u2502 \"reason\": \"IAM_PERMISSION_DENIED\"\n\u2502 }\n\u2502 ]\n\u2502\n\u2502 with google_vpc_access_connector.connector,\n\u2502 on fs.tf line 19, in resource \"google_vpc_access_connector\" \"connector\":\n\u2502 19: resource \"google_vpc_access_connector\" \"connector\" {\n\u2502\n```\nSolution: Add Serverless VPC Access Admin to Service Account.\nLine 148", "section": "Module 2: Workflow Orchestration", "question": "Question: Permission 'vpcaccess.connectors.create'" }, { "text": "Git won\u2019t push an empty folder to GitHub, so if you put a file in that folder and then push, then you should be good to go.\nOr - in your code- make the folder if it doesn\u2019t exist using Pathlib as shown here: https://stackoverflow.com/a/273227/4590385.\nFor some reason, when using github storage, the relative path for writing locally no longer works. Try using two separate paths, one full path for the local write, and the original relative path for GCS bucket upload.", "section": "Module 2: Workflow Orchestration", "question": "File Path: Cannot save file into a non-existent directory: 'data/green'" }, { "text": "The green dataset contains lpep_pickup_datetime while the yellow contains tpep_pickup_datetime. Modify the script(s) depending on the dataset as required.", "section": "Module 2: Workflow Orchestration", "question": "No column name lpep_pickup_datetime / tpep_pickup_datetime" }, { "text": "pd.read_csv\ndf_iter = pd.read_csv(dataset_url, iterator=True, chunksize=100000)\nThe data needs to be appended to the parquet file using the fastparquet engine\ndf.to_parquet(path, compression=\"gzip\", engine='fastparquet', append=True)", "section": "Module 2: Workflow Orchestration", "question": "Process to download the VSC using Pandas is killed right away" }, { "text": "denied: requested access to the resource is denied\nThis can happen when you\nHaven't logged in properly to Docker Desktop (use docker login -u \"myusername\")\nHave used the wrong username when pushing to docker images. Use the same one as your username and as the one you build on\ndocker image build -t <myusername>/<imagename>:<tag>\ndocker image push <myusername>/<imagename>:<tag>", "section": "Module 2: Workflow Orchestration", "question": "Push to docker image failure" }, { "text": "16:21:35.607 | INFO | Flow run 'singing-malkoha' - Executing 'write_bq-b366772c-0' immediately...\nKilled\nSolution: You probably are running out of memory on your VM and need to add more. For example, if you have 8 gigs of RAM on your VM, you may want to expand that to 16 gigs.", "section": "Module 2: Workflow Orchestration", "question": "Flow script fails with \u201ckilled\u201d message:" }, { "text": "After playing around with prefect for a while this can happen.\nSsh to your VM and run sudo du -h --block-size=G | sort -n -r | head -n 30 to see which directory needs the most space.\nMost likely it will be \u2026/.prefect/storage, where your cached flows are stored. You can delete older flows from there. You also have to delete the corresponding flow in the UI, otherwise it will throw you an error, when you try to run your next flow.\nSSL Certificate Verify: (I got it when trying to run flows on MAC): urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED]\npip install certifi\n/Applications/Python\\ {ver}/Install\\ Certificates.command\nor\nrunning the \u201cInstall Certificate.command\u201d inside of the python{ver} folder", "section": "Module 2: Workflow Orchestration", "question": "GCP VM: Disk Space is full" }, { "text": "It means your container consumed all available RAM allocated to it. It can happen in particular when working on Question#3 in the homework as the dataset is relatively large and containers eat a lot of memory in general.\nI would recommend restarting your computer and only starting the necessary processes to run the container. If that doesn\u2019t work, allocate more resources to docker. If also that doesn\u2019t work because your workstation is a potato, you can use an online compute environment service like GitPod, which is free under under 50 hours / month of use.", "section": "Module 2: Workflow Orchestration", "question": "Docker: container crashed with status code 137." }, { "text": "In Q3 there was a task to run the etl script from web to GCS. The problem was, it wasn\u2019t really an ETL straight from web to GCS, but it was actually a web to local storage to local memory to GCS over network ETL. Yellow data is about 100 MB each per month compressed and ~700 MB after uncompressed on memory\nThis leads to a problem where i either got a network type error because my not so good 3rd world internet or i got my WSL2 crashed/hanged because out of memory error and/or 100% resource usage hang.\nSolution:\nif you have a lot of time at hand, try compressing it to parquet and writing it to GCS with the timeout argument set to a really high number (the default os 60 seconds)\nthe yellow taxi data for feb 2019 is about 100MB as parquet file\ngcp_cloud_storage_bucket_block.upload_from_path(\nfrom_path=f\"{path}\",\nto_path=path,\ntimeout=600\n)", "section": "Module 2: Workflow Orchestration", "question": "Timeout due to slow upload internet" }, { "text": "This error occurs when you try to re-run the export block, of the transformed green_taxi data to PostgreSQL.\nWhat you\u2019ll need to do is to drop the table using SQL in Mage (screenshot below).\nYou should be able to re-run the block successfully after dropping the table.", "section": "Module 2: Workflow Orchestration", "question": "UndefinedColumn: column \"ratecode_id\", \"rate_code_id\" \u201cvendor_id\u201d, \u201cpu_location_id\u201d, \u201cdo_location_id\u201d of relation \"green_taxi\" does not exist - Export transformed green_taxi data to PostgreSQL" }, { "text": "SettingWithCopyWarning:\nA value is trying to be set on a copy of a slice from a DataFrame.\nUse the data.loc[] = value syntax instead of df[] = value to ensure that the new column is being assigned to the original dataframe instead of a copy of a dataframe or a series.", "section": "Module 2: Workflow Orchestration", "question": "Homework - Q3 SettingWithCopyWarning Error:" }, { "text": "CSV Files are very big in nyc data, so we instead of using Pandas/Python kernel , we can try Pyspark Kernel\nDocumentation of Mage for using pyspark kernel: https://docs.mage.ai/integrations/spark-pyspark\n?", "section": "Module 2: Workflow Orchestration", "question": "Since I was using slow laptop, and we have so big csv files, I used pyspark kernel in mage instead of python, How to do it?" }, { "text": "So we will first delete the connection between blocks then we can remove the connection.", "section": "Module 2: Workflow Orchestration", "question": "I got an error when I was deleting BLOCK IN A PIPELINE" }, { "text": "While Editing the Pipeline Name It throws permission denied error.\n(Work around)In that case proceed with the work and save later on revisit it will let you edit.", "section": "Module 2: Workflow Orchestration", "question": "Mage UI won\u2019t let you edit the Pipeline name?" }, { "text": "Solution n\u00b01 if you want to download everything :\n```\nimport pyarrow as pa\nimport pyarrow.parquet as pq\nfrom pyarrow.fs import GcsFileSystem\n\u2026\n@data_loader\ndef load_data(*args, **kwargs):\n bucket_name = YOUR_BUCKET_NAME_HERE'\n blob_prefix = 'PATH / TO / WHERE / THE / PARTITIONS / ARE'\n root_path = f\"{bucket_name}/{blob_prefix}\"\npa_table = pq.read_table(\n source=root_path,\n filesystem=GcsFileSystem(), \n )\n\n return pa_table.to_pandas()\nSolution n\u00b02 if you want to download only some dates :\n@data_loader\ndef load_data(*args, **kwargs):\ngcs = pa.fs.GcsFileSystem()\nbucket_name = 'YOUR_BUCKET_NAME_HERE'\nblob_prefix = ''PATH / TO / WHERE / THE / PARTITIONS / ARE''\nroot_path = f\"{bucket_name}/{blob_prefix}\"\npa_dataset = pq.ParquetDataset(\npath_or_paths=root_path,\nfilesystem=gcs,\nfilters=[('lpep_pickup_date', '>=', '2020-10-01'), ('lpep_pickup_date', '<=', '2020-10-31')]\n)\nreturn pa_dataset.read().to_pandas()\n# More information about the pq.Parquet.Dataset : Encapsulates details of reading a complete Parquet dataset possibly consisting of multiple files and partitions in subdirectories. Documentation here :\nhttps://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset\nERROR: UndefinedColumn: column \"vendor_id\" of relation \"green_taxi\" does not exist\nTwo possible solutions both of them work in the same way.\nOpen up a Data Loader connect using SQL - RUN the command \n`DROP TABLE mage.green_taxi`\nElse, Open up a Data Extractor of SQL - increase the rows to above the number of rows in the dataframe (you can find that in the bottom of the transformer block) change the Write Policy to `Replace` and run the SELECT statement", "section": "Module 2: Workflow Orchestration", "question": "How do I make Mage load the partitioned files that we created on 2.2.4, to load them into BigQuery ?" }, { "text": "All mage files are in your /home/src/folder where you saved your credentials.json so you should be able to access them locally. You will see a folder for \u2018Pipelines\u2019, 'data loaders', 'data transformers' & 'data exporters' - inside these will be the .py or .sql files for the blocks you created in your pipeline.\nRight click & \u2018download\u2019 the pipeline itself to your local machine (which gives you metadata, pycache and other files)\nAs above, download each .py/.sql file that corresponds to each block you created for the pipeline. You'll find these under 'data loaders', 'data transformers' 'data exporters'\nMove the downloaded files to your GitHub repo folder & commit your changes.", "section": "Module 2: Workflow Orchestration", "question": "Git - What Files Should I Submit for Homework 2 & How do I get them out of MAGE:" }, { "text": "Assuming you downloaded the Mage repo in the week 2 folder of the Data Engineering Zoomcamp, you might want to include your mage copy, demo pipelines and homework within your personal copy of the Data Engineering Zoomcamp repo. This will not work by default, because GitHub sees them as two separate repositories, and one does not track the other. To add the Mage files to your main DE Zoomcamp repo, you will need to:\nMove the contents of the .gitignore file in your main .gitignore.\nUse the terminal to cd into the Mage folder and:\nrun \u201cgit remote remove origin\u201d to de-couple the Mage repo,\nrun \u201crm -rf .git\u201d to delete local git files,\nrun \u201cgit add .\u201d to add the current folder as changes to stage, commit and push.", "section": "Module 2: Workflow Orchestration", "question": "Git - How do I include the files in the Mage repo (including exercise files and homework) in a personal copy of the Data Engineering Zoomcamp repo?" }, { "text": "When try to add three assertions:\nvendor_id is one of the existing values in the column (currently)\npassenger_count is greater than 0\ntrip_distance is greater than 0\nto test_output, I got ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Below is my code:\ndata_filter = (data['passenger_count'] > 0) and (data['trip_distance'] > 0)\nAfter looking for solutions at Stackoverflow, I found great discussion about it. So I changed my code into:\ndata_filter = (data['passenger_count'] > 0) & (data['trip_distance'] > 0)", "section": "Module 2: Workflow Orchestration", "question": "Got ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()" }, { "text": "This happened when I just booted up my PC, continuing from the progress I was doing from yesterday.\nAfter cd-ing into your directory, and running docker compose up , the web interface for the Mage shows, but the files that I had yesterday was gone.\nIf your files are gone, go ahead and close the web interface, and properly shutting down the mage docker compose by doing Ctrl + C once. Try running it again. This worked for me more than once (yes the issue persisted with my PC twice)\nAlso, you should check if you\u2019re in the correct repository before doing docker compose up . This was discussed in the Slack #course-data-engineering channel", "section": "Module 2: Workflow Orchestration", "question": "Mage AI Files are Gone/disappearing" }, { "text": "The above errors due to \u201c at the trailing side and it need to be modified with \u2018 quotes at both ends\nKrishna Anand", "section": "Module 2: Workflow Orchestration", "question": "Mage - Errors in io.config.yaml file" }, { "text": "Problem: The following error occurs when attempting to export data from Mage to a GCS bucket using pyarrow suggesting Mage doesn\u2019t have the necessary permissions to access the specified GCP credentials .json file.\nArrowException: Unknown error: google::cloud::Status(UNKNOWN: Permanent error GetBucketMetadata: Could not create a OAuth2 access token to authenticate the request. The request was not sent, as such an access token is required to complete the request successfully. Learn more about Google Cloud authentication at https://cloud.google.com/docs/authentication. The underlying error message was: Cannot open credentials file /home/src/...\nSolution: Inside the Mage app:\nCreate a credentials folder (e.g. gcp-creds) within the magic-zoomcamp folder\nIn the credentials folder create a .json key file (e.g. mage-gcp-creds.json)\nCopy/paste GCP service account credentials into the .json key file and save\nUpdate code to point to this file. E.g.\nenviron['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/src/magic-zoomcamp/gcp-creds/mage-gcp-creds.json'", "section": "Module 2: Workflow Orchestration", "question": "Mage - ArrowException Cannot open credentials file" }, { "text": "Oserror: google::cloud::status(unavailable: retry policy exhausted getbucketmetadata: could not create a OAuth2 access token to authenticate the request. the request was not sent, as such an access token is required to complete the request successfully. learn more about google cloud authentication at https://cloud.google.com/docs/authentication. the underlying error message was: performwork() - curl error [6]=couldn't resolve host name)", "section": "Module 2: Workflow Orchestration", "question": "Mage - OSError" }, { "text": "Problem: The following error occurs when attempting to export data from Mage to a GCS bucket. Assigned service account doesn\u2019t have the necessary permissions access Google Cloud Storage Bucket\nPermissionError: [Errno 13] google::cloud::Status(PERMISSION_DENIED: Permanent error GetBucketMetadata:... .iam.gserviceaccount.com does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist). error_info={reason=forbidden, domain=global, metadata={http_status_code=403}}). Detail: [errno 13] Permission denied\nSolution: Add Cloud Storage Admin role to the service account:\nGo to project in Google Cloud Console>IAM & Admin>IAM\nClick Edit principal (pencil symbol) to the right of the service account you are using\nClick + ADD ANOTHER ROLE\nSelect Cloud Storage>Storage Admin\nClick Save", "section": "Module 2: Workflow Orchestration", "question": "Mage - PermissionError service account does not have storage.buckets.get access to the Google Cloud Storage bucket" }, { "text": "1. Make sure your pyspark script is ready to be send to Dataproc cluster\n2. Create a Dataproc Cluster in GCP Console\n3. Make sure to edit the service account and add new role - Dataproc Editor\n4. Copy the python script ./notebooks/pyspark_script.py and place it under GCS bucket path\n5. Make sure gcloud cli is installed either in Mage manually or via your Dockerfile and docker-compose files. This is needed to let Mage access google Dataproc and the script it needs to execute. Refer - Installing the latest gcloud CLI\n6. Use the Bigquery/Dataproc script mentioned here - https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/05-batch/code/cloud.md . Use Mage to trigger the query", "section": "Module 3: Data Warehousing", "question": "Trigger Dataproc from Mage" }, { "text": "A:\n1 solution) Add -Y flag, so that apt-get automatically agrees to install additional packages\n2) Use python ZipFile package, which is included in all modern python distributions", "section": "Module 3: Data Warehousing", "question": "Docker-compose takes infinitely long to install zip unzip packages for linux, which are required to unpack datasets" }, { "text": "Make sure to use Nullable dataTypes, such as Int64 when appliable.", "section": "Module 3: Data Warehousing", "question": "GCS Bucket - error when writing data from web to GCS:" }, { "text": "Ultimately, when trying to ingest data into a BigQuery table, all files within a given directory must have the same schema.\nWhen dealing for example with the FHV Datasets from 2019, however (see image below), one can see that the files for '2019-05', and 2019-06, have the columns \"PUlocationID\" and \"DOlocationID\" as Integers, while for the period of '2019-01' through '2019-04', the same column is defined as FLOAT.\nSo while importing these files as parquet to BigQuery, the first one will be used to define the schema of the table, while all files following that will be used to append data on the existing table. Which means, they must all follow the very same schema of the file that created the table.\nSo, in order to prevent errors like that, make sure to enforce the data types for the columns on the DataFrame before you serialize/upload them to BigQuery. Like this:\npd.read_csv(\"path_or_url\").astype({\n\t\"col1_name\": \"datatype\",\t\n\t\"col2_name\": \"datatype\",\t\n\t...\t\t\t\t\t\n\t\"colN_name\": \"datatype\" \t\n})", "section": "Module 3: Data Warehousing", "question": "GCS Bucket - Failed to create table: Error while reading data, error message: Parquet column 'XYZ' has type INT which does not match the target cpp_type DOUBLE. File: gs://path/to/some/blob.parquet" }, { "text": "If you receive the error gzip.BadGzipFile: Not a gzipped file (b'\\n\\n'), this is because you have specified the wrong URL to the FHV dataset. Make sure to use https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/{dataset_file}.csv.gz\nEmphasising the \u2018/releases/download\u2019 part of the URL.", "section": "Module 3: Data Warehousing", "question": "GCS Bucket - Fix Error when importing FHV data to GCS" }, { "text": "Krishna Anand", "section": "Module 3: Data Warehousing", "question": "GCS Bucket - Load Data From URL list in to GCP Bucket" }, { "text": "Check the Schema\nYou might have a wrong formatting\nTry to upload the CSV.GZ files without formatting or going through pandas via wget\nSee this Slack conversation for helpful tips", "section": "Module 3: Data Warehousing", "question": "GCS Bucket - I query my dataset and get a Bad character (ASCII 0) error?" }, { "text": "Run the following command to check if \u201cBigQuery Command Line Tool\u201d is installed or not: gcloud components list\nYou can also use bq.cmd instead of bq to make it work.", "section": "Module 3: Data Warehousing", "question": "GCP BQ - \u201cbq: command not found\u201d" }, { "text": "Use big queries carefully,\nI created by bigquery dataset on an account where my free trial was exhausted, and got a bill of $80.\nUse big query in free credits and destroy all the datasets after creation.\nCheck your Billing daily! Especially if you\u2019ve spinned up a VM.", "section": "Module 3: Data Warehousing", "question": "GCP BQ - Caution in using bigquery:no" }, { "text": "Be careful when you create your resources on GCP, all of them have to share the same Region in order to allow load data from GCS Bucket to BigQuery. If you forgot it when you created them, you can create a new dataset on BigQuery using the same Region which you used on your GCS Bucket.\nThis means that your GCS Bucket and the BigQuery dataset are placed in different regions. You have to create a new dataset inside BigQuery in the same region with your GCS bucket and store the data in the newly created dataset.", "section": "Module 3: Data Warehousing", "question": "GCP BQ - Cannot read and write in different locations: source: EU, destination: US - Loading data from GCS into BigQuery (different Region):" }, { "text": "Make sure to create the BigQuery dataset in the very same location that you've created the GCS Bucket. For instance, if your GCS Bucket was created in `us-central1`, then BigQuery dataset must be created in the same region (us-central1, in this example)", "section": "Module 3: Data Warehousing", "question": "GCP BQ - Cannot read and write in different locations: source: <REGION_HERE>, destination: <ANOTHER_REGION_HERE>" }, { "text": "By the way, this isn\u2019t a problem/solution, but a useful hint:\nPlease, remember to save your progress in BigQuery SQL Editor.\nI was almost finishing the homework, when my Chrome Tab froze and I had to reload it. Then I lost my entire SQL script.\nSave your script from time to time. Just click on the button at the top bar. Your saved file will be available on the left panel.\nAlternatively, you can copy paste your queries into an .sql file in your preferred editor (Notepad++, VS Code, etc.). Using the .sql extension will provide convenient color formatting.", "section": "Module 3: Data Warehousing", "question": "GCP BQ - Remember to save your queries" }, { "text": "Ans : While real-time analytics might not be explicitly mentioned, BigQuery has real-time data streaming capabilities, allowing for potential integration in future project iterations.", "section": "Module 3: Data Warehousing", "question": "GCP BQ - Can I use BigQuery for real-time analytics in this project?" }, { "text": "could not parse 'pickup_datetime' as timestamp for field pickup_datetime (position 2)\nThis error is caused by invalid data in the timestamp column. A way to identify the problem is to define the schema from the external table using string datatype. This enables the queries to work at which point we can filter out the invalid rows from the import to the materialised table and insert the fields with the timestamp data type.", "section": "Module 3: Data Warehousing", "question": "GCP BQ - Unable to load data from external tables into a materialized table in BigQuery due to an invalid timestamp error that are added while appending data to the file in Google Cloud Storage" }, { "text": "Background:\n`pd.read_parquet`\n`pd.to_datetime`\n`pq.write_to_dataset`\nReference:\nhttps://stackoverflow.com/questions/48314880/are-parquet-file-created-with-pyarrow-vs-pyspark-compatible\nhttps://stackoverflow.com/questions/57798479/editing-parquet-files-with-python-causes-errors-to-datetime-format\nhttps://www.reddit.com/r/bigquery/comments/16aoq0u/parquet_timestamp_to_bq_coming_across_as_int/?share_id=YXqCs5Jl6hQcw-kg6-VgF&utm_content=1&utm_medium=ios_app&utm_name=ioscss&utm_source=share&utm_term=1\nSolution:\nAdd `use_deprecated_int96_timestamps=True` to `pq.write_to_dataset` function, like below\npq.write_to_dataset(\ntable,\nroot_path=root_path,\nfilesystem=gcs,\nuse_deprecated_int96_timestamps=True\n# Write timestamps to INT96 Parquet format\n)", "section": "Module 3: Data Warehousing", "question": "GCP BQ - Error Message in BigQuery: annotated as a valid Timestamp, please annotate it as TimestampType(MICROS) or TimestampType(MILLIS)" }, { "text": "Solution:\nIf you\u2019re using Mage, in the last Data Exporter that writes to Google Cloud Storage use PyArrow to generate the Parquet file with the correct logical type for the datetime columns, otherwise they won't be converted to timestamp when loaded by BigQuery later on.\nimport pyarrow as pa\nimport pyarrow.parquet as pq\nimport os\nif 'data_exporter' not in globals():\nfrom mage_ai.data_preparation.decorators import data_exporter\n# Replace with the location of your service account key JSON file.\nos.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/src/personal-gcp.json'\nbucket_name = \"<YOUR_BUCKET_NAME>\"\nobject_key = 'nyc_taxi_data_2022.parquet'\nwhere = f'{bucket_name}/{object_key}'\n@data_exporter\ndef export_data(data, *args, **kwargs):\ntable = pa.Table.from_pandas(data, preserve_index=False)\ngcs = pa.fs.GcsFileSystem()\npq.write_table(\ntable,\nwhere,\n# Convert integer columns in Epoch milliseconds\n# to Timestamp columns in microseconds ('us') so\n# they can be loaded into BigQuery with the right\n# data type\ncoerce_timestamps='us',\nfilesystem=gcs\n)\nSolution 2:\nIf you\u2019re using Mage, in the last Data Exporter that writes to Google Cloud Storage, provide PyArrow with explicit schema to generate the Parquet file with the correct logical type for the datetime columns, otherwise they won't be converted to timestamp when loaded by BigQuery later on.\nschema = pa.schema([\n('vendor_id', pa.int64()),\n('lpep_pickup_datetime', pa.timestamp('ns')),\n('lpep_dropoff_datetime', pa.timestamp('ns')),\n('store_and_fwd_flag', pa.string()),\n('ratecode_id', pa.int64()),\n('pu_location_id', pa.int64()),\n('do_location_id', pa.int64()),\n('passenger_count', pa.int64()),\n('trip_distance', pa.float64()),\n('fare_amount', pa.float64()),\n('extra', pa.float64()),\n('mta_tax', pa.float64()),\n('tip_amount', pa.float64()),\n('tolls_amount', pa.float64()),\n('improvement_surcharge', pa.float64()),\n('total_amount', pa.float64()),\n('payment_type', pa.int64()),\n('trip_type', pa.int64()),\n('congestion_surcharge', pa.float64()),\n('lpep_pickup_month', pa.int64())\n])\ntable = pa.Table.from_pandas(data, schema=schema)", "section": "Module 3: Data Warehousing", "question": "GCP BQ - Datetime columns in Parquet files created from Pandas show up as integer columns in BigQuery" }, { "text": "Reference:\nhttps://cloud.google.com/bigquery/docs/external-data-cloud-storage\nSolution:\nfrom google.cloud import bigquery\n# Set table_id to the ID of the table to create\ntable_id = f\"{project_id}.{dataset_name}.{table_name}\"\n# Construct a BigQuery client object\nclient = bigquery.Client()\n# Set the external source format of your table\nexternal_source_format = \"PARQUET\"\n# Set the source_uris to point to your data in Google Cloud\nsource_uris = [ f'gs://{bucket_name}/{object_key}/*']\n# Create ExternalConfig object with external source format\nexternal_config = bigquery.ExternalConfig(external_source_format)\n# Set source_uris that point to your data in Google Cloud\nexternal_config.source_uris = source_uris\nexternal_config.autodetect = True\ntable = bigquery.Table(table_id)\n# Set the external data configuration of the table\ntable.external_data_configuration = external_config\ntable = client.create_table(table) # Make an API request.\nprint(f'Created table with external source: {table_id}')\nprint(f'Format: {table.external_data_configuration.source_format}')", "section": "Module 3: Data Warehousing", "question": "GCP BQ - Create External Table using Python" }, { "text": "Reference:\nhttps://stackoverflow.com/questions/60941726/can-bigquery-api-overwrite-existing-table-view-with-create-table-tables-inser\nSolution:\nCombine with \u201cCreate External Table using Python\u201d, use it before \u201cclient.create_table\u201d function.\ndef tableExists(tableID, client):\n\"\"\"\nCheck if a table already exists using the tableID.\nreturn : (Boolean)\n\"\"\"\ntry:\ntable = client.get_table(tableID)\nreturn True\nexcept Exception as e: # NotFound:\nreturn False", "section": "Module 3: Data Warehousing", "question": "GCP BQ - Check BigQuery Table Exist And Delete" }, { "text": "To avoid this error you can upload data from Google Cloud Storage to BigQuery through BigQuery Cloud Shell using the command:\n$ bq load --autodetect --allow_quoted_newlines --source_format=CSV dataset_name.table_name \"gs://dtc-data-lake-bucketname/fhv/fhv_tripdata_2019-*.csv.gz\"", "section": "Module 3: Data Warehousing", "question": "GCP BQ - Error: Missing close double quote (\") character" }, { "text": "Solution: This problem arises if your gcs and bigquery storage is in different regions.\nOne potential way to solve it:\nGo to your google cloud bucket and check the region in field named \u201cLocation\u201d\nNow in bigquery, click on three dot icon near your project name and select create dataset.\nIn region filed choose the same regions as you saw in your google cloud bucket", "section": "Module 3: Data Warehousing", "question": "GCP BQ - Cannot read and write in different locations: source: asia-south2, destination: US" }, { "text": "There are multiple benefits of using Cloud Functions to automate tasks in Google Cloud.\nUse below Cloud Function python script to load files directly to BigQuery. Use your project id, dataset id & table id as defined by you.\nimport tempfile\nimport requests\nimport logging\nfrom google.cloud import bigquery\ndef hello_world(request):\n# table_id = <project_id.dataset_id.table_id>\ntable_id = 'de-zoomcap-project.dezoomcamp.fhv-2019'\n# Create a new BigQuery client\nclient = bigquery.Client()\nfor month in range(4, 13):\n# Define the schema for the data in the CSV.gz files\nurl = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-{:02d}.csv.gz'.format(month)\n# Download the CSV.gz file from Github\nresponse = requests.get(url)\n# Create new table if loading first month data else append\nwrite_disposition_string = \"WRITE_APPEND\" if month > 1 else \"WRITE_TRUNCATE\"\n# Defining LoadJobConfig with schema of table to prevent it from changing with every table\njob_config = bigquery.LoadJobConfig(\nschema=[\nbigquery.SchemaField(\"dispatching_base_num\", \"STRING\"),\nbigquery.SchemaField(\"pickup_datetime\", \"TIMESTAMP\"),\nbigquery.SchemaField(\"dropOff_datetime\", \"TIMESTAMP\"),\nbigquery.SchemaField(\"PUlocationID\", \"STRING\"),\nbigquery.SchemaField(\"DOlocationID\", \"STRING\"),\nbigquery.SchemaField(\"SR_Flag\", \"STRING\"),\nbigquery.SchemaField(\"Affiliated_base_number\", \"STRING\"),\n],\nskip_leading_rows=1,\nwrite_disposition=write_disposition_string,\nautodetect=True,\nsource_format=\"CSV\",\n)\n# Load the data into BigQuery\n# Create a temporary file to prevent the exception- AttributeError: 'bytes' object has no attribute 'tell'\"\nwith tempfile.NamedTemporaryFile() as f:\nf.write(response.content)\nf.seek(0)\njob = client.load_table_from_file(\nf,\ntable_id,\nlocation=\"US\",\njob_config=job_config,\n)\njob.result()\nlogging.info(\"Data for month %d successfully loaded into table %s.\", month, table_id)\nreturn 'Data loaded into table {}.'.format(table_id)", "section": "Module 3: Data Warehousing", "question": "GCP BQ - Tip: Using Cloud Function to read csv.gz files from github directly to BigQuery in Google Cloud:" }, { "text": "You need to uncheck cache preferences in query settings", "section": "Module 3: Data Warehousing", "question": "GCP BQ - When querying two different tables external and materialized you get the same result when count(distinct(*))" }, { "text": "Problem: When you inject data into GCS using Pandas, there is a chance that some dataset has missing values on DOlocationID and PUlocationID. Pandas by default will cast these columns as float data type, causing inconsistent data type between parquet in GCS and schema defined in big query. You will see something like this:\nSolution:\nFix the data type issue in data pipeline\nBefore injecting data into GCS, use astype and Int64 (which is different from int64 and accept both missing value and integer exist in the column) to cast the columns.\nSomething like:\ndf[\"PUlocationID\"] = df.PUlocationID.astype(\"Int64\")\ndf[\"DOlocationID\"] = df.DOlocationID.astype(\"Int64\")\nNOTE: It is best to define the data type of all the columns in the Transformation section of the ETL pipeline before loading to BigQuery", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "GCP BQ - How to handle type error from big query and parquet data?" }, { "text": "Problem occurs when misplacing content after fro``m clause in BigQuery SQLs.\nCheck to remove any extra apaces or any other symbols, keep in lowercases, digits and dashes only", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "GCP BQ - Invalid project ID . Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project" }, { "text": "No. Based on the documentation for Bigquery, it does not support more than 1 column to be partitioned.\n[source]", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "GCP BQ - Does BigQuery support multiple columns partition?" }, { "text": "Error Message:\nPARTITION BY expression must be DATE(<timestamp_column>), DATE(<datetime_column>), DATETIME_TRUNC(<datetime_column>, DAY/HOUR/MONTH/YEAR), a DATE column, TIMESTAMP_TRUNC(<timestamp_column>, DAY/HOUR/MONTH/YEAR), DATE_TRUNC(<date_column>, MONTH/YEAR), or RANGE_BUCKET(<int64_column>, GENERATE_ARRAY(<int64_value>, <int64_value>[, <int64_value>]))\nSolution:\nConvert the column to datetime first.\ndf[\"pickup_datetime\"] = pd.to_datetime(df[\"pickup_datetime\"])\ndf[\"dropOff_datetime\"] = pd.to_datetime(df[\"dropOff_datetime\"])", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "GCP BQ - DATE() Error in BigQuery" }, { "text": "Native tables are tables where the data is stored in BigQuery. External tables store the data outside BigQuery, with BigQuery storing metadata about that external table.\nResources:\nhttps://cloud.google.com/bigquery/docs/external-tables\nhttps://cloud.google.com/bigquery/docs/tables-intro", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "GCP BQ - Native tables vs External tables in BigQuery?" }, { "text": "Issue: Tried running command to export ML model from BQ to GCS from Week 3\nbq --project_id taxi-rides-ny extract -m nytaxi.tip_model gs://taxi_ml_model/tip_model\nIt is failing on following error:\nBigQuery error in extract operation: Error processing job Not found: Dataset was not found in location US\nI verified the BQ data set and gcs bucket are in the same region- us-west1. Not sure how it gets location US. I couldn\u2019t find the solution yet.\nSolution: Please enter correct project_id and gcs_bucket folder address. My gcs_bucket folder address is\ngs://dtc_data_lake_optimum-airfoil-376815/tip_model", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "GCP BQ ML - Unable to run command (shown in video) to export ML model from BQ to GCS" }, { "text": "To solve this error mention the location = US when creating the dim_zones table\n{{ config(\nmaterialized='table',\nlocation='US'\n) }}\nJust Update this part to solve the issue and run the dim_zones again and then run the fact_trips", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "Dim_zones.sql Dataset was not found in location US When Running fact_trips.sql" }, { "text": "Solution: proceed with setting up serving_dir on your computer as in the extract_model.md file. Then instead of\ndocker pull tensorflow/serving\nuse\ndocker pull emacski/tensorflow-serving\nThen\ndocker run -p 8500:8500 -p 8501:8501 --mount type=bind,source=`pwd`/serving_dir/tip_model,target=/models/tip_model -e MODEL_NAME=tip_model -t emacski/tensorflow-serving\nThen run the curl command as written, and you should get a prediction.", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "GCP BQ ML - Export ML model to make predictions does not work for MacBook with Apple M1 chip (arm architecture)." }, { "text": "Try deleting data you\u2019ve saved to your VM locally during ETLs\nKill processes related to deleted files\nDownload ncdu and look for large files (pay particular attention to files related to Prefect)\nIf you delete any files related to Prefect, eliminate caching from your flow code", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "VMs - What do I do if my VM runs out of space?" }, { "text": "Ans: What they mean is that they don't want you to do anything more than that. You should load the files into the bucket and create an external table based on those files (but nothing like cleaning the data and putting it in parquet format)", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "Homework - What does it mean \u201cStop with loading the files into a bucket.' Stop with loading the files into a bucket?\u201d" }, { "text": "If for whatever reason you try to read parquets directly from nyc.gov\u2019s cloudfront into pandas, you might run into this error:\npyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds\nCause:\nthere is one errant data record where the dropOff_datetime was set to year 3019 instead of 2019.\npandas uses \u201ctimestamp[ns]\u201d (as noted above), and int64 only allows a ~580 year range, centered on 2000. See `pd.Timestamp.max` and `pd.Timestamp.min`\nThis becomes out of bounds when pandas tries to read it because 3019 > 2300 (approx value of pd.Timestamp.Max\nFix:\nUse pyarrow to read it:\nimport pyarrow.parquet as pq df = pq.read_table('fhv_tripdata_2019-02.parquet').to_pandas(safe=False)\nHowever this results in weird timestamps for the offending record\nRead the datetime columns separately using pq.read_table\n\ntable = pq.read_table(\u2018taxi.parquet\u2019)\ndatetimes = [\u2018list of datetime column names\u2019]\ndf_dts = pd.DataFrame()\nfor col in datetimes:\ndf_dts[col] = pd.to_datetime(table .column(col), errors='coerce')\n\nThe `errors=\u2019coerce\u2019` parameter will convert the out of bounds timestamps into either the max or the min\nUse parquet.compute.filter to remove the offending rows\n\nimport pyarrow.compute as pc\ntable = pq.read_table(\"\u2018taxi.parquet\")\ndf = table.filter(\npc.less_equal(table[\"dropOff_datetime\"], pa.scalar(pd.Timestamp.max))\n).to_pandas()", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "Homework - Reading parquets from nyc.gov directly into pandas returns Out of bounds error" }, { "text": "Answer: The 2022 NYC taxi data parquet files are available for each month separately. Therefore, you need to add all 12 files to your GCS bucket and then refer to them using the URIs option when creating an external table in BigQuery. You can use the wildcard \"*\" to refer to all 12 files using a single string.", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "Question: for homework 3 , we need all 12 parquet files for green taxi 2022 right ?" }, { "text": "This can help avoid schema issues in the homework. \nDownload files locally and use the \u2018upload files\u2019 button in GCS at the desired path. You can upload many files at once. You can also choose to upload a folder.", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "Homework - Uploading files to GCS via GUI" }, { "text": "Ans: Take a careful look at the format of the dates in the question.", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "Homework - Qn 5: The partitioned/clustered table isn\u2019t giving me the prediction I expected" }, { "text": "Many people aren\u2019t getting an exact match, but are very close to one of the options. As per Alexey said to choose the closest option.", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "Homework - Qn 6: Did anyone get an exact match for one of the options given in Module 3 homework Q6?" }, { "text": "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 41721: invalid start byte\nSolution:\nStep 1: When reading the data from the web into the pandas dataframe mention the encoding as follows:\npd.read_csv(dataset_url, low_memory=False, encoding='latin1')\nStep 2: When writing the dataframe from the local system to GCS as a csv mention the encoding as follows:\ndf.to_csv(path_on_gsc, compression=\"gzip\", encoding='utf-8')\nAlternative: use pd.read_parquet(url)", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "Python - invalid start byte Error Message" }, { "text": "A generator is a function in python that returns an iterator using the yield keyword.\nA generator is a special type of iterable, similar to a list or a tuple, but with a crucial difference. Instead of creating and storing all the values in memory at once, a generator generates values on-the-fly as you iterate over it. This makes generators memory-efficient, particularly when dealing with large datasets.", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "Python - Generators in python" }, { "text": "The read_parquet function supports a list of files as an argument. The list of files will be merged into a single result table.", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "Python - Easiest way to read multiple files at the same time?" }, { "text": "Incorrect:\ndf['DOlocationID'] = pd.to_numeric(df['DOlocationID'], downcast=integer) or\ndf['DOlocationID'] = df['DOlocationID'].astype(int)\nCorrect:\ndf['DOlocationID'] = df['DOlocationID'].astype('Int64')", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "Python - These won't work. You need to make sure you use Int64:" }, { "text": "ValueError: Path /Users/kt/.prefect/storage/44ccce0813ed4f24ab2d3783de7a9c3a does not exist.\nRemove ```cache_key_fn=task_input_hash ``` as it\u2019s in argument in your function & run your flow again.\nNote: catche key is beneficial if you happen to run the code multiple times, it won't repeat the process which you have finished running in the previous run. That means, if you have this ```cache_key``` in your initial run, this might cause the error.", "section": "error: Error while reading table: trips_data_all.external_fhv_tripdata, error message: Parquet column 'DOlocationID' has type INT64 which does not match the target cpp_type DOUBLE.", "question": "Prefect - Error on Running Prefect Flow to Load data to GCS" }, { "text": "@task\ndef download_file(url: str, file_path: str):\nresponse = requests.get(url)\nopen(file_path, \"wb\").write(response.content)\nreturn file_path\n@flow\ndef extract_from_web() -> None:\nfile_path = download_file(url=f'{url-filename}.csv.gz',file_path=f'{filename}.csv.gz')", "section": "Module 4: analytics engineering with dbt", "question": "Prefect - Tip: Downloading csv.gz from a url in a prefect environment (sample snippet)." }, { "text": "Update the seed column types in the dbt_project.yaml file\nfor using double : float\nfor using int : numeric\nDBT Cloud production error: prod dataset not available in location EU\nProblem: I am trying to deploy my DBT models to production, using DBT Cloud. The data should live in BigQuery. The dataset location is EU. However, when I am running the model in production, a prod dataset is being create in BigQuery with a location US and the dbt invoke build is failing giving me \"ERROR 404: porject.dataset:prod not available in location EU\". I tried different ways to fix this. I am not sure if there is a more simple solution then creating my project or buckets in location US. Hope anyone can help here.\nNote: Everything is working fine in development mode, the issue is just happening when scheduling and running job in production\nSolution: I created the prod dataset manually in BQ and specified EU, then I ran the job.", "section": "Module 4: analytics engineering with dbt", "question": "If you are getting not found in location us error." }, { "text": "Error: This project does not have a development environment configured. Please create a development environment and configure your development credentials to use the dbt IDE.\nThe error itself tells us how to solve this issue, the guide is here. And from videos @1:42 and also slack chat", "section": "Module 4: analytics engineering with dbt", "question": "Setup - No development environment" }, { "text": "Runtime Error\ndbt was unable to connect to the specified database.\nThe database returned the following error:\n>Database Error\nAccess Denied: Project <project_name>: User does not have bigquery.jobs.create permission in project <project_name>.\nCheck your database credentials and try again. For more information, visit:\nhttps://docs.getdbt.com/docs/configure-your-profile\nSteps to resolve error in Google Cloud:\n1. Navigate to IAM & Admin and select IAM\n2. Click Grant Access if your newly created dbt service account isn't listed\n3. In New principals field, add your service account\n4. Select a Role and search for BigQuery Job User to add\n5. Go back to dbt cloud project setup and Test your connection\n6. Note: Also add BigQuery Data Owner, Storage Object Admin, & Storage Admin to prevent permission issues later in the course", "section": "Module 4: analytics engineering with dbt", "question": "Setup - Connecting dbt Cloud with BigQuery Error" }, { "text": "error: This dbt Cloud run was cancelled because a valid dbt project was not found. Please check that the repository contains a proper dbt_project.yml config file. If your dbt project is located in a subdirectory of the connected repository, be sure to specify its location on the Project settings page in dbt Cloud", "section": "Module 4: analytics engineering with dbt", "question": "Dbt build error" }, { "text": "Error: Failed to clone repository.\ngit clone git@github.com:DataTalksClub/data-engineering-zoomcamp.git /usr/src/develop/\u2026\nCloning into '/usr/src/develop/...\nWarning: Permanently added 'github.com,140.82.114.4' (ECDSA) to the list of known hosts.\ngit@github.com: Permission denied (publickey).\nfatal: Could not read from remote repository.\nIssue: You don\u2019t have permissions to write to DataTalksClub/data-engineering-zoomcamp.git\nSolution 1: Clone the repository and use this forked repo, which contains your github username. Then, proceed to specify the path, as in:\n[your github username]/data-engineering-zoomcamp.git\nSolution 2: create a fresh repo for dbt-lessons. We\u2019d need to do branching and PRs in this lesson, so it might be a good idea to also not mess up your whole other repo. Then you don\u2019t have to create a subfolder for the dbt project files\nSolution 3: Use https link", "section": "Module 4: analytics engineering with dbt", "question": "Setup - Failed to clone repository." }, { "text": "Solution:\nCheck if you\u2019re on the Developer Plan. As per the prerequisites, you'll need to be enrolled in the Team Plan or Enterprise Plan to set up a CI Job in dbt Cloud.\nSo If you're on the Developer Plan, you'll need to upgrade to utilise CI Jobs.\nNote from another user: I\u2019m in the Team Plan (trial period) but the option is still disabled. What worked for me instead was this. It works for the Developer (free) plan.", "section": "Module 4: analytics engineering with dbt", "question": "dbt job - Triggered by pull requests is disabled when I try to create a new Continuous Integration job in dbt cloud." }, { "text": "Issue: If the DBT cloud IDE loading indefinitely then giving you this error\nSolution: check the dbt_cloud_setup.md file and make a SSH Key and use gitclone to import repo into dbt project, copy and paste deploy key back in your repo setting.", "section": "Module 4: analytics engineering with dbt", "question": "Setup - Your IDE session was unable to start. Please contact support." }, { "text": "Issue: If you don\u2019t define the column format while converting from csv to parquet Python will \u201cchoose\u201d based on the first rows.\n\u2705Solution: Defined the schema while running web_to_gcp.py pipeline.\nSebastian adapted the script:\nhttps://github.com/sebastian2296/data-engineering-zoomcamp/blob/main/week_4_analytics_engineering/web_to_gcs.py\nNeed a quick change to make the file work with gz files, added the following lines (and don\u2019t forget to delete the file at the end of each iteration of the loop to avoid any problem of disk space)\nfile_name_gz = f\"{service}_tripdata_{year}-{month}.csv.gz\"\nopen(file_name_gz, 'wb').write(r.content)\nos.system(f\"gzip -d {file_name_gz}\")\nos.system(f\"rm {file_name_init}.*\")\nSame ERROR - When running dbt run for fact_trips.sql, the task failed with error:\n\u201cParquet column 'ehail_fee' has type DOUBLE which does not match the target cpp_type INT64\u201d\n\u5f00\u542f\u5c4f\u5e55\u9605\u8bfb\u5668\u652f\u6301\n\u8981\u542f\u7528\u5c4f\u5e55\u9605\u8bfb\u5668\u652f\u6301\uff0c\u8bf7\u6309Ctrl+Alt+Z\u3002\u8981\u4e86\u89e3\u952e\u76d8\u5feb\u6377\u952e\uff0c\u8bf7\u6309Ctrl+\u659c\u6760\u3002\n\u67e5\u627e\u548c\u66ff\u6362\nReason: Parquet files have their own schema. Some parquet files for green data have records with decimals in ehail_fee column.\nThere are some possible fixes:\nDrop ehail_feel column since it is not really used. For instance when creating a partitioned table from the external table in BigQuery\nSELECT * EXCEPT (ehail_fee) FROM\u2026\nModify stg_green_tripdata.sql model using this line cast(0 as numeric) as ehail_fee.\nModify Airflow dag to make the conversion and avoid the error.\npv.read_csv(src_file, convert_options=pv.ConvertOptions(column_types = {'ehail_fee': 'float64'}))\nSame type of ERROR - parquet files with different data types - Fix it with pandas\nHere is another possibility that could be interesting:\nYou can specify the dtypes when importing the file from csv to a dataframe with pandas\npd.from_csv(..., dtype=type_dict)\nOne obstacle is that the regular int64 pandas use (I think this is from the numpy library) does not accept null values (NaN, not a number). But you can use the pandas Int64 instead, notice capital \u2018I\u2019. The type_dict is a python dictionary mapping the column names to the dtypes.\nSources:\nhttps://pandas.pydata.org/docs/reference/api/pandas.read_csv.html\nNullable integer data type \u2014 pandas 1.5.3 documentation", "section": "Module 4: analytics engineering with dbt", "question": "DBT - I am having problems with columns datatype while running DBT/BigQuery" }, { "text": "If the provided URL isn\u2019t working for you (https://nyc-tlc.s3.amazonaws.com/trip+data/):\nWe can use the GitHub CLI to easily download the needed trip data from https://github.com/DataTalksClub/nyc-tlc-data, and manually upload to a GCS bucket.\nInstructions on how to download the CLI here: https://github.com/cli/cli\nCommands to use:\ngh auth login\ngh release list -R DataTalksClub/nyc-tlc-data\ngh release download yellow -R DataTalksClub/nyc-tlc-data\ngh release download green -R DataTalksClub/nyc-tlc-data\netc.\nNow you can upload the files to a GCS bucket using the GUI.", "section": "Module 4: analytics engineering with dbt", "question": "Ingestion: When attempting to use the provided quick script to load trip data into GCS, you receive error Access Denied from the S3 bucket" }, { "text": "R: This conversion is needed for the question 3 of homework, in order to process files for fhv data. The error is:\npyarrow.lib.ArrowInvalid: CSV parse error: Expected 7 columns, got 1: B02765\nCause: Some random line breaks in this particular file.\nFixed by opening a bash in the container executing the dag and manually running the following command that deletes all \\n not preceded by \\r.\nperl -i -pe 's/(?<!\\r)\\n/\\1/g' fhv_tripdata_2020-01.csv\nAfter that, clear the failed task in Airflow to force re-execution.", "section": "Module 4: analytics engineering with dbt", "question": "Ingestion - Error thrown by format_to_parquet_task when converting fhv_tripdata_2020-01.csv using Airflow" }, { "text": "I initially followed data-engineering-zoomcamp/03-data-warehouse/extras/web_to_gcs.py at main \u00b7 DataTalksClub/data-engineering-bootcamp (github.com)\nBut it was taking forever for the yellow trip data and when I tried to download and upload the parquet files directly to GCS, that works fine but when creating the Bigquery table, there was a schema inconsistency issue\nThen I found another hack shared in the slack which was suggested by Victoria.\n[Optional] Hack for loading data to BigQuery for Week 4 - YouTube\nPlease watch until the end as there is few schema changes required to be done", "section": "Module 4: analytics engineering with dbt", "question": "Hack to load yellow and green trip data for 2019 and 2020" }, { "text": "\u201cgs\\storage_link\\*.parquet\u201d need to be added in destination folder", "section": "Module 4: analytics engineering with dbt", "question": "Move many files (more than one) from Google cloud storage bucket to Big query" }, { "text": "One common cause experienced is lack of space after running prefect several times. When running prefect, check the folder \u2018.prefect/storage\u2019 and delete the logs now and then to avoid the problem.", "section": "Module 4: analytics engineering with dbt", "question": "GCP VM - All of sudden ssh stopped working for my VM after my last restart" }, { "text": "You can try to do this steps:", "section": "Module 4: analytics engineering with dbt", "question": "GCP VM - If you have lost SSH access to your machine due to lack of space. Permission denied (publickey)" }, { "text": "R: Go to BigQuery, and check the location of BOTH\nThe source dataset (trips_data_all), and\nThe schema you\u2019re trying to write to (name should be \tdbt_<first initial><last name> (if you didn\u2019t change the default settings at the end when setting up your project))\nLikely, your source data will be in your region, but the write location will be a multi-regional location (US in this example). Delete these datasets, and recreate them with your specified region and the correct naming format.\nAlternatively, instead of removing datasets, you can specify the single-region location you are using. E.g. instead of \u2018location: US\u2019, specify the region, so \u2018location: US-east1\u2019. See this Github comment for more detail. Additionally please see this post of Sandy\nIn DBT cloud you can actually specify the location using the following steps:\nGPo to your profile page (top right drop-down --> profile)\nThen go to under Credentials --> Analytics (you may have customised this name)\nClick on Bigquery >\nHit Edit\nUpdate your location, you may need to re-upload your service account JSON to re-fetch your private key, and save. (NOTE: be sure to exactly copy the region BigQuery specifies your dataset is in.)", "section": "Module 4: analytics engineering with dbt", "question": "404 Not found: Dataset eighth-zenith-372015:trip_data_all was not found in location us-west1" }, { "text": "Error: `dbt_utils.surrogate_key` has been replaced by `dbt_utils.generate_surrogate_key`\nFix:\nReplace dbt_utils.surrogate_key with dbt_utils.generate_surrogate_key in stg_green_tripdata.sql\nWhen executing dbt run after fact_trips.sql has been created, the task failed with error:\nR: \u201cAccess Denied: BigQuery BigQuery: Permission denied while globbing file pattern.\u201d\n1. Fixed by adding the Storage Object Viewer role to the service account in use in BigQuery.\n2. Add the related roles to the service account in use in GCS.", "section": "Module 4: analytics engineering with dbt", "question": "When executing dbt run after installing dbt-utils latest version i.e., 1.0.0 warning has generated" }, { "text": "You need to create packages.yml file in main project directory and add packages\u2019 meta data:\npackages:\n- package: dbt-labs/dbt_utils\nversion: 0.8.0\nAfter creating file run:\nAnd hit enter.", "section": "Module 4: analytics engineering with dbt", "question": "When You are getting error dbt_utils not found" }, { "text": "Ensure you properly format your yml file. Check the build logs if the run was completed successfully. You can expand the command history console (where you type the --vars '{'is_test_run': 'false'}') and click on any stage\u2019s logs to expand and read errors messages or warnings.", "section": "Module 4: analytics engineering with dbt", "question": "Lineage is currently unavailable. Check that your project does not contain compilation errors or contact support if this error persists." }, { "text": "Make sure you use:\ndbt run --var \u2018is_test_run: false\u2019 or\ndbt build --var \u2018is_test_run: false\u2019\n(watch out for formatted text from this document: re-type the single quotes). If that does not work, use --vars '{'is_test_run': 'false'}' with each phrase separately quoted.", "section": "Module 4: analytics engineering with dbt", "question": "Build - Why do my Fact_trips only contain a few days of data?" }, { "text": "Check if you specified if_exists argument correctly when writing data from GCS to BigQuery. When I wrote my automated flow for each month of the years 2019 and 2020 for green and yellow data I had specified if_exists=\"replace\" while I was experimenting with the flow setup. Once you want to run the flow for all months in 2019 and 2020 make sure to set if_exists=\"append\"\nif_exists=\"replace\" will replace the whole table with only the month data that you are writing into BigQuery in that one iteration -> you end up with only one month in BigQuery (the last one you inserted)\nif_exists=\"append\" will append the new monthly data -> you end up with data from all months", "section": "Module 4: analytics engineering with dbt", "question": "Build - Why do my fact_trips only contain one month of data?" }, { "text": "R: After the second SELECT, change this line:\ndate_trunc('month', pickup_datetime) as revenue_month,\nTo this line:\ndate_trunc(pickup_datetime, month) as revenue_month,\nMake sure that \u201cmonth\u201d isn\u2019t surrounded by quotes!", "section": "Module 4: analytics engineering with dbt", "question": "BigQuery returns an error when I try to run the dm_monthly_zone_revenue.sql model." }, { "text": "For this instead:\n{{ dbt_utils.generate_surrogate_key([ \n field_a, \n field_b, \n field_c,\n \u2026,\n field_z\n]) }}", "section": "Module 4: analytics engineering with dbt", "question": "Replace: \n{{ dbt_utils.surrogate_key([ \n field_a, \n field_b, \n field_c,\n \u2026,\n field_z \n]) }}" }, { "text": "Remove the dataset from BigQuery which was created by dbt and run dbt run again so that it will recreate the dataset in BigQuery with the correct location", "section": "Module 4: analytics engineering with dbt", "question": "I changed location in dbt, but dbt run still gives me an error" }, { "text": "Remove the dataset from BigQuery created by dbt and run again (with test disabled) to ensure the dataset created has all the rows.\nDBT - Why am I getting a new dataset after running my CI/CD Job? / What is this new dbt dataset in BigQuery?\nAnswer: when you create the CI/CD job, under \u2018Compare Changes against an environment (Deferral) make sure that you select \u2018 No; do not defer to another environment\u2019 - otherwise dbt won\u2019t merge your dev models into production models; it will create a new environment called \u2018dbt_cloud_pr_number of pull request\u2019", "section": "Module 4: analytics engineering with dbt", "question": "I ran dbt run without specifying variable which gave me a table of 100 rows. I ran again with the variable value specified but my table still has 100 rows in BQ." }, { "text": "Vic created three different datasets in the videos.. dbt_<name> was used for development and you used a production dataset for the production environment. What was the use for the staging dataset?\nR: Staging, as the name suggests, is like an intermediate between the raw datasets and the fact and dim tables, which are the finished product, so to speak. You'll notice that the datasets in staging are materialised as views and not tables.\nVic didn't use it for the project, you just need to create production and dbt_name + trips_data_all that you had already.", "section": "Module 4: analytics engineering with dbt", "question": "Why do we need the Staging dataset?" }, { "text": "Try removing the \u201cnetwork: host\u201d line in docker-compose.", "section": "Module 4: analytics engineering with dbt", "question": "DBT Docs Served but Not Accessible via Browser" }, { "text": "Go to Account settings >> Project >> Analytics >> Click on your connection >> go all the way down to Location and type in the GCP location just as displayed in GCP (e.g. europe-west6). You might need to reupload your GCP key.\nDelete your dataset in GBQ\nRebuild project: dbt build\nNewly built dataset should be in the correct location", "section": "Module 4: analytics engineering with dbt", "question": "BigQuery adapter: 404 Not found: Dataset was not found in location europe-west6" }, { "text": "Create a new branch to edit. More on this can be found here in the dbt docs.", "section": "Module 4: analytics engineering with dbt", "question": "Dbt+git - Main branch is \u201cread-only\u201d" }, { "text": "Create a new branch for development, then you can merge it to the main branch\nCreate a new branch and switch to this branch. It allows you to make changes. Then you can commit and push the changes to the \u201cmain\u201d branch.", "section": "Module 4: analytics engineering with dbt", "question": "Dbt+git - It appears that I can't edit the files because I'm in read-only mode. Does anyone know how I can change that?" }, { "text": "Error:\nTriggered by pull requests\nThis feature is only available for dbt repositories connected through dbt Cloud's native integration with Github, Gitlab, or Azure DevOps\nSolution: Contrary to the guide on DTC repo, don\u2019t use the Git Clone option. Use the Github one instead. Step-by-step guide to UN-LINK Git Clone and RE-LINK with Github in the next entry below", "section": "Module 4: analytics engineering with dbt", "question": "Dbt deploy + Git CI - cannot create CI checks job for deployment to Production. See more discussion in slack chat" }, { "text": "If you\u2019re trying to configure CI with Github and on the job\u2019s options you can\u2019t see Run on Pull Requests? on triggers, you have to reconnect with Github using native connection instead clone by SSH. Follow these steps:\nOn Profile Settings > Linked Accounts connect your Github account with dbt project allowing the permissions asked. More info at https://docs.getdbt.com/docs/collaborate/git/connect-gith\nDisconnect your current Github\u2019s configuration from Account Settings > Projects (analytics) > Github connection. At the bottom left appears the button Disconnect, press it.\nOnce we have confirmed the change, we can configure it again. This time, choose Github and it will appear in all repositories which you have allowed to work with dbt. Select your repository and it\u2019s ready.\nGo to the Deploy > job configuration\u2019s page and go down until Triggers and now you can see the option Run on Pull Requests:", "section": "Module 4: analytics engineering with dbt", "question": "Dbt deploy + Git CI - Unable to configure Continuous Integration (CI) with Github" }, { "text": "If you're following video DE Zoomcamp 4.3.1 - Building the First DBT Models, you may have encountered an issue at 14:25 where the Lineage graph isn't displayed and a Compilation Error occurs, as shown in the attached image. Don't worry - a quick fix for this is to simply save your schema.yml file. Once you've done this, you should be able to view your Lineage graph without any further issues.", "section": "Module 4: analytics engineering with dbt", "question": "Compilation Error (Model 'model.my_new_project.stg_green_tripdata' (models/staging/stg_green_tripdata.sql) depends on a source named 'staging.green_trip_external' which was not found)" }, { "text": "> in macro test_accepted_values (tests/generic/builtin.sql)\n> called by test accepted_values_stg_green_tripdata_Payment_type__False___var_payment_type_values_ (models/staging/schema.yml)\nRemember that you have to add to dbt_project.yml the vars:\nvars:\npayment_type_values: [1, 2, 3, 4, 5, 6]", "section": "Module 4: analytics engineering with dbt", "question": "'NoneType' object is not iterable" }, { "text": "You will face this issue if you copied and pasted the exact macro directly from data-engineering-zoomcamp repo.\nBigQuery adapter: Retry attempt 1 of 1 after error: BadRequest('No matching signature for operator CASE for argument types: STRING, INT64, STRING, INT64, STRING, INT64, STRING, INT64, STRING, INT64, STRING, INT64, STRING, NULL at [35:5]; reason: invalidQuery, location: query, message: No matching signature for operator CASE for argument types: STRING, INT64, STRING, INT64, STRING, INT64, STRING, INT64, STRING, INT64, STRING, INT64, STRING, NULL at [35:5]')\nWhat you\u2019d have to do is to change the data type of the numbers (1, 2, 3 etc.) to text by inserting \u2018\u2019, as the initial \u2018payment_type\u2019 data type should be string (Note: I extracted and loaded the green trips data using Google BQ Marketplace)\n{#\nThis macro returns the description of the payment_type\n#}\n{% macro get_payment_type_description(payment_type) -%}\ncase {{ payment_type }}\nwhen '1' then 'Credit card'\nwhen '2' then 'Cash'\nwhen '3' then 'No charge'\nwhen '4' then 'Dispute'\nwhen '5' then 'Unknown'\nwhen '6' then 'Voided trip'\nend\n{%- endmacro %}", "section": "Module 4: analytics engineering with dbt", "question": "dbt macro errors with get_payment_type_description(payment_type)" }, { "text": "The dbt error log contains a link to BigQuery. When you follow it you will see your query and the problematic line will be highlighted.", "section": "Module 4: analytics engineering with dbt", "question": "Troubleshooting in dbt:" }, { "text": "It is a default behaviour of dbt to append custom schema to initial schema. To override this behaviour simply create a macro named \u201cgenerate_schema_name.sql\u201d:\n{% macro generate_schema_name(custom_schema_name, node) -%}\n{%- set default_schema = target.schema -%}\n{%- if custom_schema_name is none -%}\n{{ default_schema }}\n{%- else -%}\n{{ custom_schema_name | trim }}\n{%- endif -%}\n{%- endmacro %}\nNow you can override default custom schema in \u201cdbt_project.yml\u201d:", "section": "Module 4: analytics engineering with dbt", "question": "Why changing the target schema to \u201cmarts\u201d actually creates a schema named \u201cdbt_marts\u201d instead?" }, { "text": "There is a project setting which allows you to set `Project subdirectory` in dbt cloud:", "section": "Module 4: analytics engineering with dbt", "question": "How to set subdirectory of the github repository as the dbt project root" }, { "text": "Remember that you should modify accordingly your .sql models, to read from existing table names in BigQuery/postgres db\nExample: select * from {{ source('staging',<your table name in the database>') }}", "section": "Module 4: analytics engineering with dbt", "question": "Compilation Error : Model 'model.XXX' (models/<model_path>/XXX.sql) depends on a source named '<a table name>' which was not found" }, { "text": "Make sure that you create a pull request from your Development branch to the Production branch (main by default). After that, check in your \u2018seeds\u2019 folder if the seed file is inside it.\nAnother thing to check is your .gitignore file. Make sure that the .csv extension is not included.", "section": "Module 4: analytics engineering with dbt", "question": "Compilation Error : Model '<model_name>' (<model_path>) depends on a node named '<seed_name>' which was not found (Production Environment)" }, { "text": "1. Go to your dbt cloud service account\n1. Adding the [Storage Object Admin,Storage Admin] role in addition tco BigQuery Admin.", "section": "Module 4: analytics engineering with dbt", "question": "When executing dbt run after using fhv_tripdata as an external table: you get \u201cAccess Denied: BigQuery BigQuery: Permission denied\u201d" }, { "text": "Problem: when injecting data to bigquery, you may face the type error. This is because pandas by default will parse integer columns with missing value as float type.\nSolution:\nOne way to solve this problem is to specify/ cast data type Int64 during the data transformation stage.\nHowever, you may be lazy to type all the int columns. If that is the case, you can simply use convert_dtypes to infer the data type\n# Make pandas to infer correct data type (as pandas parse int with missing as float)\ndf.fillna(-999999, inplace=True)\ndf = df.convert_dtypes()\ndf = df.replace(-999999, None)", "section": "Module 4: analytics engineering with dbt", "question": "How to automatically infer the column data type (pandas missing value issues)?" }, { "text": "Seed files loaded from directory with name \u2018seed\u2019, that\u2019s why you should rename dir with name \u2018data\u2019 to \u2018seed\u2019", "section": "Module 4: analytics engineering with dbt", "question": "When loading github repo raise exception that \u2018taxi_zone_lookup\u2019 not found" }, { "text": "Check the .gitignore file and make sure you don\u2019t have *.csv in it\n\nDbt error 404 was not found in location\nMy specific error:\nRuntime Error in rpc request (from remote system.sql) 404 Not found: Table dtc-de-0315:trips_data_all.green_tripdata_partitioned was not found in location europe-west6 Location: europe-west6 Job ID: 168ee9bd-07cd-4ca4-9ee0-4f6b0f33897c\nMake sure all of your datasets have the correct region and not a generalised region:\nEurope-west6 as opposed to EU\n\nMatch this in dbt settings:\ndbt -> projects -> optional settings -> manually set location to match", "section": "Module 4: analytics engineering with dbt", "question": "\u2018taxi_zone_lookup\u2019 not found" }, { "text": "The easiest way to avoid these errors is by ingesting the relevant data in a .csv.gz file type. Then, do:\nCREATE OR REPLACE EXTERNAL TABLE `dtc-de.trips_data_all.fhv_tripdata`\nOPTIONS (\nformat = 'CSV',\nuris = ['gs://dtc_data_lake_dtc-de-updated/data/fhv_all/fhv_tripdata_2019-*.csv.gz']\n);\nAs an example. You should no longer have any data type issues for week 4.", "section": "Module 4: analytics engineering with dbt", "question": "Data type errors when ingesting with parquet files" }, { "text": "This is due to the way the deduplication is done in the two staging files.\nSolution: add order by in the partition by part of both staging files. Keep adding columns to order by until the number of rows in the fact_trips table is consistent when re-running the fact_trips model.\nExplanation (a bit convoluted, feel free to clarify, correct etc.)\nWe partition by vendor id and pickup_datetime and choose the first row (rn=1) from all these partitions. These partitions are not ordered, so every time we run this, the first row might be a different one. Since the first row is different between runs, it might or might not contain an unknown borough. Then, in the fact_trips model we will discard a different number of rows when we discard all values with an unknown borough.", "section": "Module 4: analytics engineering with dbt", "question": "Inconsistent number of rows when re-running fact_trips model" }, { "text": "If you encounter data type error on trip_type column, it may due to some nan values that isn\u2019t null in bigquery.\nSolution: try casting it to FLOAT datatype instead of NUMERIC", "section": "Module 4: analytics engineering with dbt", "question": "Data Type Error when running fact table" }, { "text": "This error could result if you are using some select * query without mentioning the name of table for ex:\nwith dim_zones as (\nselect * from `engaged-cosine-374921`.`dbt_victoria_mola`.`dim_zones`\nwhere borough != 'Unknown'\n),\nfhv as (\nselect * from `engaged-cosine-374921`.`dbt_victoria_mola`.`stg_fhv_tripdata`\n)\nselect * from fhv\ninner join dim_zones as pickup_zone\non fhv.PUlocationID = pickup_zone.locationid\ninner join dim_zones as dropoff_zone\non fhv.DOlocationID = dropoff_zone.locationid\n);\nTo resolve just replace use : select fhv.* from fhv", "section": "Module 4: analytics engineering with dbt", "question": "CREATE TABLE has columns with duplicate name locationid." }, { "text": "Some ehail fees are null and casting them to integer gives Bad int64 value: 0.0 error,\nSolution:\nUsing safe_cast returns NULL instead of throwing an error. So use safe_cast from dbt_utils function in the jinja code for casting into integer as follows:\n{{ dbt_utils.safe_cast('ehail_fee', api.Column.translate_type(\"integer\"))}} as ehail_fee,\nCan also just use safe_cast(ehail_fee as integer) without relying on dbt_utils.", "section": "Module 4: analytics engineering with dbt", "question": "Bad int64 value: 0.0 error" }, { "text": "You might encounter this when building the fact_trips.sql model. The issue may be with the payment_type_description field.\nUsing safe_cast as above, would cause the entire field to become null. A better approach is to drop the offending decimal place, then cast to integer.\ncast(replace({{ payment_type }},'.0','') as integer)\nBad int64 value: 1.0 error (again)\n\nI found that there are more columns causing the bad INT64: ratecodeid and trip_type on Green_tripdata table.\nYou can use the queries below to address them:\nCAST(\nREGEXP_REPLACE(CAST(rate_code AS STRING), r'\\.0', '') AS INT64\n) AS ratecodeid,\nCAST(\nCASE\nWHEN REGEXP_CONTAINS(CAST(trip_type AS STRING), r'\\.\\d+') THEN NULL\nELSE CAST(trip_type AS INT64)\nEND AS INT64\n) AS trip_type,", "section": "Module 4: analytics engineering with dbt", "question": "Bad int64 value: 2.0/1.0 error" }, { "text": "The two solution above don\u2019t work for me - I used the line below in `stg_green_trips.sql` to replace the original ehail_fee line:\n`{{ dbt.safe_cast('ehail_fee', api.Column.translate_type(\"numeric\"))}} as ehail_fee,`", "section": "Module 4: analytics engineering with dbt", "question": "DBT - Error on building fact_trips.sql: Parquet column 'ehail_fee' has type DOUBLE which does not match the target cpp_type INT64. File: gs://<gcs bucket>/<table>/green_taxi_2019-01.parquet\")" }, { "text": "Remember to add a space between the variable and the value. Otherwise, it won't be interpreted as a dictionary.\nIt should be:\ndbt run --var 'is_test_run: false'", "section": "Module 4: analytics engineering with dbt", "question": "The - vars argument must be a YAML dictionary, but was of type str" }, { "text": "You don't need to change the environment type. If you are following the videos, you are creating a Production Deployment, so the only available option is the correct one.'", "section": "Module 4: analytics engineering with dbt", "question": "Not able to change Environment Type as it is greyed out and inaccessible" }, { "text": "Database Error in model stg_yellow_tripdata (models/staging/stg_yellow_tripdata.sql)\nAccess Denied: Table taxi-rides-ny-339813-412521:trips_data_all.yellow_tripdata: User does not have permission to query table taxi-rides-ny-339813-412521:trips_data_all.yellow_tripdata, or perhaps it does not exist in location US.\ncompiled Code at target/run/taxi_rides_ny/models/staging/stg_yellow_tripdata.sql\nIn my case, I was set up in a different branch, so always check the branch you are working on. Change the 04-analytics-engineering/taxi_rides_ny/models/staging/schema.yml file in the\nsources:\n- name: staging\ndatabase: your_database_name\nIf this error will continue when running dbt job, As for changing the branch for your job, you can use the \u2018Custom Branch\u2019 settings in your dbt Cloud environment. This allows you to run your job on a different branch than the default one (usually main). To do this, you need to:\nGo to an environment and select Settings to edit it\nSelect Only run on a custom branch in General settings\nEnter the name of your custom branch (e.g. HW)\nClick Save\nCould not parse the dbt project. please check that the repository contains a valid dbt project\nRunning the Environment on the master branch causes this error, you must activate \u201cOnly run on a custom branch\u201d checkbox and specify the branch you are working when Environment is setup.", "section": "Module 4: analytics engineering with dbt", "question": "Access Denied: Table taxi-rides-ny-339813-412521:trips_data_all.yellow_tripdata: User does not have permission to query table taxi-rides-ny-339813-412521:trips_data_all.yellow_tripdata, or perhaps it does not exist in location US." }, { "text": "Change to main branch, make a pull request from the development branch.\nNote: this will take you to github.\nApprove the merging and rerun you job, it would work as planned now", "section": "Module 4: analytics engineering with dbt", "question": "Made change to your modelling files and commit the your development branch, but Job still runs on old file?" }, { "text": "Before you can develop some data model on dbt, you should create development environment and set some parameter on it. After the model being developed, we should also create deployment environment to create and run some jobs.", "section": "Module 4: analytics engineering with dbt", "question": "Setup - I\u2019ve set Github and Bigquery to dbt successfully. Why nothing showed in my Develop tab?" }, { "text": "Error Message:\nInvestigate Sentry error: ProtocolError \"Invalid input ConnectionInputs.SEND_HEADERS in state ConnectionState.CLOSED\"\nSolution:\nreference\nRun it again because it happens sometimes. Or wait a few minutes, it will continue.", "section": "Module 4: analytics engineering with dbt", "question": "Prefect Agent retrieving runs from queue sometimes fails with httpx.LocalProtocolError" }, { "text": "My taxi data was loaded into gcs with etl_web_to_gcs.py script that converts csv data into parquet. Then I placed raw data trips into external tables and when I executed dbt run I got an error message: Parquet column 'passenger_count' has type INT64 which does not match the target cpp_type DOUBLE. It is because several columns in files have different formats of data.\nWhen I added df[col] = df[col].astype('Int64') transformation to the columns: passenger_count, payment_type, RatecodeID, VendorID, trip_type it went ok. Several people also faced this error and more about it you can read on the slack channel.", "section": "Module 4: analytics engineering with dbt", "question": "BigQuery returns an error when i try to run \u2018dbt run\u2019:" }, { "text": "Use the syntax below instead if the code in the tutorial is not working.\ndbt run --select stg_green_tripdata --vars '{\"is_test_run\": false}'", "section": "Module 4: analytics engineering with dbt", "question": "Running dbt run --models stg_green_tripdata --var 'is_test_run: false' is not returning anything:" }, { "text": "Following dbt with BigQuery on Docker readme.md, after `docker-compose build` and `docker-compose run dbt-bq-dtc init`, encountered error `ModuleNotFoundError: No module named 'pytz'`\nSolution:\nAdd `RUN python -m pip install --no-cache pytz` in the Dockerfile under `FROM --platform=$build_for python:3.9.9-slim-bullseye as base`", "section": "Module 4: analytics engineering with dbt", "question": "DBT - Error: No module named 'pytz' while setting up dbt with docker" }, { "text": "If you have problems editing dbt_project.yml when using Docker after \u2018docker-compose run dbt-bq-dtc init\u2019, to change profile \u2018taxi_rides_ny\u2019 to 'bq-dbt-workshop\u2019, just run:\nsudo chown -R username path\nDBT - Internal Error: Profile should not be None if loading is completed\nWhen running dbt debug, change the directory to the newly created subdirectory (e.g: the newly created `taxi_rides_ny` directory, which contains the dbt project).", "section": "Module 4: analytics engineering with dbt", "question": "\u200b\u200bVS Code: NoPermissions (FileSystemError): Error: EACCES: permission denied (linux)" }, { "text": "When running a query on BigQuery sometimes could appear a this table is not on the specified location error.\nFor this problem there is not a straightforward solution, you need to dig a little, but the problem could be one of these:\nCheck the locations of your bucket, datasets and tables. Make sure they are all on the same one.\nChange the query settings to the location you are in: on the query window select more -> query settings -> select the location\nCheck if all the paths you are using in your query to your tables are correct: you can click on the table -> details -> and copy the path.", "section": "Module 4: analytics engineering with dbt", "question": "Google Cloud BigQuery Location Problems" }, { "text": "This happens because we have moved the dbt project to another directory on our repo.\nOr might be that you\u2019re on a different branch than is expected to be merged from / to.\nSolution:\nGo to the projects window on dbt cloud -> settings -> edit -> and add directory (the extra path to the dbt project)\nFor example:\n/week5/taxi_rides_ny\nMake sure your file explorer path and this Project settings path matches and there\u2019s no files waiting to be committed to github if you\u2019re running the job to deploy to PROD.\nAnd that you had setup the PROD environment to check in the main branch, or whichever you specified.\nIn the picture below, I had set it to ella2024 to be checked as \u201cproduction-ready\u201d by the \u201cfreshness\u201d check mark at the PROD environment settings. So each time I merge a branch from something else into ella2024 and then trigger the PR, the CI check job would kick-in. But we still do need to Merge and close the PR manually, I believe, that part is not automated.\nYou set up the PROD custom branch (if not default main) in the Environment setup screen.", "section": "Module 4: analytics engineering with dbt", "question": "DBT Deploy - This dbt Cloud run was cancelled because a valid dbt project was not found." }, { "text": "When you are creating the pull request and running the CI, dbt is creating a new schema on BIgQuery. By default that new schema will be created on \u2018US\u2019 location, if you have your dataset, schemas and tables on \u2018EU\u2019 that will generate an error and the pull request will not be accepted. To change that location to \u2018EU\u2019 on the connection to BigQuery from dbt we need to add the location \u2018EU\u2019 on the connection optional settings:\nDbt -> project -> settings -> connection BIgQuery -> OPtional Settings -> Location -> EU", "section": "Module 4: analytics engineering with dbt", "question": "DBT Deploy + CI - Location Problems on BigQuery" }, { "text": "When running trying to run the dbt project on prod there is some things you need to do and check on your own:\nFirst Make the pull request and Merge the branch into the main.\nMake sure you have the latest version, if you made changes to the repo in another place.\nCheck if the dbt_project.yml file is accessible to the project, if not check this solution (Dbt: This dbt Cloud run was cancelled because a valid dbt project was not found.).\nCheck if the name you gave to the dataset on BigQuery is the same you put on the dataset spot on the production environment created on dbt cloud.", "section": "Module 4: analytics engineering with dbt", "question": "DBT Deploy - Error When trying to run the dbt project on Prod" }, { "text": "In the step in this video (DE Zoomcamp 4.3.1 - Build the First dbt Models), after creating `stg_green_tripdata.sql` and clicking `build`, I encountered an error saying dataset not found in location EU. The default location for dbt Bigquery is the US, so when generating the new Bigquery schema for dbt, unless specified, the schema locates in the US.\nSolution:\nTurns out I forgot to specify Location to be `EU` when adding connection details.\nDevelop -> Configure Cloud CLI -> Projects -> taxi_rides_ny -> (connection) Bigquery -> Edit -> Location (Optional) -> type `EU` -> Save", "section": "Module 4: analytics engineering with dbt", "question": "DBT - Error: \u201c404 Not found: Dataset <dataset_name>:<dbt_schema_name> was not found in location EU\u201d after building from stg_green_tripdata.sql" }, { "text": "Issue: If you\u2019re having problems loading the FHV_20?? data from the github repo into GCS and then into BQ (input file not of type parquet), you need to do two things. First, append the URL Template link with \u2018?raw=true\u2019 like so:\nURL_TEMPLATE = URL_PREFIX + \"/fhv_tripdata_{{ execution_date.strftime(\\'%Y-%m\\') }}.parquet?raw=true\"\nSecond, update make sure the URL_PREFIX is set to the following value:\n\nURL_PREFIX = \"https://github.com/alexeygrigorev/datasets/blob/master/nyc-tlc/fhv\"\nIt is critical that you use this link with the keyword blob. If your link has \u2018tree\u2019 here, replace it. Everything else can stay the same, including the curl -sSLf command. \u2018", "section": "Module 4: analytics engineering with dbt", "question": "Homework - Ingesting FHV_20?? data" }, { "text": "I found out that the easies way to upload datasets form github for the homework is utilising this script git_csv_to_gcs.py. Thank you Lidia!!\nIt is similar to a script that Alexey provided us in 03-data-warehouse/extras/web_to_gcs.py", "section": "Module 4: analytics engineering with dbt", "question": "Homework - Ingesting NYC TLC Data" }, { "text": "If you have to securely put your credentials for a project and, probably, push it to a git repository then the best option is to use an environment variable\nFor example for web_to_gcs.py or git_csv_to_gcs.py we have to set these variables:\nGOOGLE_APPLICATION_CREDENTIALS\nGCP_GCS_BUCKET\nThe easises option to do it is to use .env (dotenv).\nInstall it and add a few lines of code that inject these variables for your project\npip install python-dotenv\nfrom dotenv import load_dotenv\nimport os\n# Load environment variables from .env file\nload_dotenv()\n# Now you can access environment variables like GCP_GCS_BUCKET and GOOGLE_APPLICATION_CREDENTIALS\ncredentials_path = os.getenv(\"GOOGLE_APPLICATION_CREDENTIALS\")\nBUCKET = os.environ.get(\"GCP_GCS_BUCKET\")", "section": "Module 4: analytics engineering with dbt", "question": "How to set environment variable easily for any credentials" }, { "text": "If you uploaded manually the fvh 2019 csv files, you may face errors regarding date types. Try to create an the external table in bigquery but define the pickup_datetime and dropoff_datetime to be strings\nCREATE OR REPLACE EXTERNAL TABLE `gcp_project.trips_data_all.fhv_tripdata` (\ndispatching_base_num STRING,\npickup_datetime STRING,\ndropoff_datetime STRING,\nPUlocationID STRING,\nDOlocationID STRING,\nSR_Flag STRING,\nAffiliated_base_number STRING\n)\nOPTIONS (\nformat = 'csv',\nuris = ['gs://bucket/*.csv']\n);\nThen when creating the fhv core model in dbt, use TIMESTAMP(CAST(()) to ensure it first parses as a string and then convert it to timestamp.\nwith fhv_tripdata as (\nselect * from {{ ref('stg_fhv_tripdata') }}\n),\ndim_zones as (\nselect * from {{ ref('dim_zones') }}\nwhere borough != 'Unknown'\n)\nselect fhv_tripdata.dispatching_base_num,\nTIMESTAMP(CAST(fhv_tripdata.pickup_datetime AS STRING)) AS pickup_datetime,\nTIMESTAMP(CAST(fhv_tripdata.dropoff_datetime AS STRING)) AS dropoff_datetime,", "section": "Module 4: analytics engineering with dbt", "question": "Invalid date types after Ingesting FHV data through CSV files: Could not parse 'pickup_datetime' as a timestamp" }, { "text": "If you uploaded manually the fvh 2019 parquet files manually after downloading from https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2019-*.parquet you may face errors regarding date types while loading the data in a landing table (say fhv_tripdata). Try to create an the external table with the schema defines as following and load each month in a loop.\n-----Correct load with schema defination----will not throw error----------------------\nCREATE OR REPLACE EXTERNAL TABLE `dw-bigquery-week-3.trips_data_all.external_tlc_fhv_trips_2019` (\ndispatching_base_num STRING,\npickup_datetime TIMESTAMP,\ndropoff_datetime TIMESTAMP,\nPUlocationID FLOAT64,\nDOlocationID FLOAT64,\nSR_Flag FLOAT64,\nAffiliated_base_number STRING\n)\nOPTIONS (\nformat = 'PARQUET',\nuris = ['gs://project id/fhv_2019_8.parquet']\n);\nCan Also USE uris = ['gs://project id/fhv_2019_*.parquet'] (THIS WILL remove the need for the loop and can be done for all month in single RUN )\n\u2013 THANKYOU FOR THIS \u2013", "section": "Module 4: analytics engineering with dbt", "question": "Invalid data types after Ingesting FHV data through parquet files: Could not parse SR_Flag as Float64,Couldn\u2019t parse datetime column as timestamp,couldn\u2019t handle NULL values in PULocationID,DOLocationID" }, { "text": "When accessing Looker Studio through the Google Cloud Project console, you may be prompted to subscribe to the Pro version and receive the following errors:\nInstead, navigate to https://lookerstudio.google.com/navigation/reporting which will take you to the free version.", "section": "Module 4: analytics engineering with dbt", "question": "Google Looker Studio - you have used up your 30-day trial" }, { "text": "Ans: Dbt provides a mechanism called \"ref\" to manage dependencies between models. By referencing other models using the \"ref\" keyword in SQL, dbt automatically understands the dependencies and ensures the correct execution order.\nLoading FHV Data goes into slumber using Mage?\nTry loading the data using jupyter notebooks in a local environment. There might be bandwidth issues with Mage.\nLoad the data into a pandas dataframe using the urls, make necessary transformations, upload the gcp bucket / alternatively download the parquet/csv files locally and then upload to GCP manually.\nRegion Mismatch in DBT and BigQuery\nIf you are using the datasets copied into BigQuery from BigQuery public datasets, the region will be set as US by default and hence it is much easier to set your dbt profile location as US while transforming the tables and views. \nYou can change the location as follows:", "section": "Module 4: analytics engineering with dbt", "question": "How does dbt handle dependencies between models?" }, { "text": "Use the PostgreSQL COPY FROM feature that is compatible with csv files\nCOPY table_name [ ( column_name [, ...] ) ]\nFROM { 'filename' | PROGRAM 'command' | STDIN }\n[ [ WITH ] ( option [, ...] ) ]\n[ WHERE condition ]", "section": "Module 4: analytics engineering with dbt", "question": "What is the fastest way to upload taxi data to dbt-postgres?" }, { "text": "Update the line:\nWith:", "section": "Module 5: pyspark", "question": "When configuring the profiles.yml file for dbt-postgres with jinja templates with environment variables, I'm getting \"Credentials in profile \"PROFILE_NAME\", target: 'dev', invalid: '5432'is not of type 'integer'" }, { "text": "Install SDKMAN:\ncurl -s \"https://get.sdkman.io\" | bash\nsource \"$HOME/.sdkman/bin/sdkman-init.sh\"\nUsing SDKMAN, install Java 11 and Spark 3.3.2:\nsdk install java 11.0.22-tem\nsdk install spark 3.3.2\nOpen a new terminal or run the following in the same shell:\nsource \"$HOME/.sdkman/bin/sdkman-init.sh\"\nVerify the locations and versions of Java and Spark that were installed:\necho $JAVA_HOME\njava -version\necho $SPARK_HOME\nspark-submit --version", "section": "Module 5: pyspark", "question": "Setting up Java and Spark (with PySpark) on Linux (Alternative option using SDKMAN)" }, { "text": "If you\u2019re seriously struggling to set things up \"locally\" (here locally meaning non/partly-managed environment like own laptop, a VM or Codespaces) you can use the following guide to use Spark in Google Colab:\nhttps://medium.com/gitconnected/launch-spark-on-google-colab-and-connect-to-sparkui-342cad19b304\nStarter notebook:\nhttps://github.com/aaalexlit/medium_articles/blob/main/Spark_in_Colab.ipynb\nIt\u2019s advisable to spend some time setting things up locally rather than jumping right into this solution.", "section": "Module 5: pyspark", "question": "PySpark - Setting Spark up in Google Colab" }, { "text": "If after installing Java (either jdk or openjdk), Hadoop and Spark, and setting the corresponding environment variables you find the following error when spark-shell is run at CMD:\njava.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x3c947bc5) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed\nmodule @0x3c947bc5\nSolution: Java 17 or 19 is not supported by Spark. Spark 3.x: requires Java 8/11/16. Install Java 11 from the website provided in the windows.md setup file.", "section": "Module 5: pyspark", "question": "Spark-shell: unable to load native-hadoop library for platform - Windows" }, { "text": "I found this error while executing the user defined function in Spark (crazy_stuff_udf). I am working on Windows and using conda. After following the setup instructions, I found that the PYSPARK_PYTHON environment variable was not set correctly, given that conda has different python paths for each environment.\nSolution:\npip install findspark on the command line inside proper environment\nAdd to the top of the script\nimport findspark\nfindspark.init()", "section": "Module 5: pyspark", "question": "PySpark - Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases." }, { "text": "This is because Python 3.11 has some inconsistencies with such an old version of Spark. The solution is a downgrade in the Python version. Python 3.9 using a conda environment takes care of it. Or install newer PySpark >= 3.5.1 works for me (Ella) [source].", "section": "Module 5: pyspark", "question": "PySpark - TypeError: code() argument 13 must be str, not int , while executing `import pyspark` (Windows/ Spark 3.0.3 - Python 3.11)" }, { "text": "If anyone is a Pythonista or becoming one (which you will essentially be one along this journey), and desires to have all python dependencies under same virtual environment (e.g. conda) as done with prefect and previous exercises, simply follow these steps\nInstall OpenJDK 11,\non MacOS: $ brew install java11\nAdd export PATH=\"/opt/homebrew/opt/openjdk@11/bin:$PATH\"\nto ~/.bashrc or ~/zshrc\nActivate working environment (by pipenv / poetry / conda)\nRun $ pip install pyspark\nWork with exercises as normal\nAll default commands of spark will be also available at shell session under activated enviroment.\nHope this can help!\nP.s. you won\u2019t need findspark to firstly initialize.\nPy4J - Py4JJavaError: An error occurred while calling (...) java.net.ConnectException: Connection refused: no further information;\nIf you're getting `Py4JavaError` with a generic root cause, such as the described above (Connection refused: no further information). You're most likely using incompatible versions of the JDK or Python with Spark.\nAs of the current latest Spark version (3.5.0), it supports JDK 8 / 11 / 17. All of which can be easily installed with SDKMan! on macOS or Linux environments\n\n$ sdk install java 17.0.10-librca\n$ sdk install spark 3.5.0\n$ sdk install hadoop 3.3.5\nAs PySpark 3.5.0 supports Python 3.8+ make sure you're setting up your virtualenv with either 3.8 / 3.9 / 3.10 / 3.11 (Most importantly avoid using 3.12 for now as not all libs in the data-science/engineering ecosystem are fully package for that)\n\n\n$ conda create -n ENV_NAME python=3.11\n$ conda activate ENV_NAME\n$ pip install pyspark==3.5.0\nThis setup makes installing `findspark` and the likes of it unnecessary. Happy coding.\nPy4J - Py4JJavaError: An error occurred while calling o54.parquet. Or any kind of Py4JJavaError that show up after run df.write.parquet('zones')(On window)\nThis assume you already correctly set up the PATH in the nano ~/.bashrc\nHere my\nexport JAVA_HOME=\"/c/tools/jdk-11.0.21\"\nexport PATH=\"${JAVA_HOME}/bin:${PATH}\"\nexport HADOOP_HOME=\"/c/tools/hadoop-3.2.0\"\nexport PATH=\"${HADOOP_HOME}/bin:${PATH}\"\nexport SPARK_HOME=\"/c/tools/spark-3.3.2-bin-hadoop3\"\nexport PATH=\"${SPARK_HOME}/bin:${PATH}\"\nexport PYTHONPATH=\"${SPARK_HOME}/python/:$PYTHONPATH\"\nexport PYTHONPATH=\"${SPARK_HOME}spark-3.5.1-bin-hadoop3py4j-0.10.9.5-src.zip:$PYTHONPATH\"\nYou also need to add environment variables correctly which paths to java jdk, spark and hadoop through\nGo to Stephenlaye2/winutils3.3.0: winutils.exe hadoop.dll and hdfs.dll binaries for hadoop windows (github.com), download the right winutils for hadoop-3.2.0. Then create a new folder,bin and put every thing in side to make a /c/tools/hadoop-3.2.0/bin(You might not need to do this, but after testing it without the /bin I could not make it to work)\nThen follow the solution in this video: How To Resolve Issue with Writing DataFrame to Local File | winutils | msvcp100.dll (youtube.com)\nRemember to restart IDE and computer, After the error An error occurred while calling o54.parquet. is fixed but new errors like o31.parquet. Or o35.parquet. appear.", "section": "Module 5: pyspark", "question": "Java+Spark - Easy setup with miniconda env (worked on MacOS)" }, { "text": "After installing all including pyspark (and it is successfully imported), but then running this script on the jupyter notebook\nimport pyspark\nfrom pyspark.sql import SparkSession\nspark = SparkSession.builder \\\n.master(\"local[*]\") \\\n.appName('test') \\\n.getOrCreate()\ndf = spark.read \\\n.option(\"header\", \"true\") \\\n.csv('taxi+_zone_lookup.csv')\ndf.show()\nit gives the error:\nRuntimeError: Java gateway process exited before sending its port number\n\u2705The solution (for me) was:\npip install findspark on the command line and then\nAdd\nimport findspark\nfindspark.init()\nto the top of the script.\nAnother possible solution is:\nCheck that pyspark is pointing to the correct location.\nRun pyspark.__file__. It should be list /home/<your user name>/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/__init__.py if you followed the videos.\nIf it is pointing to your python site-packages remove the pyspark directory there and check that you have added the correct exports to you .bashrc file and that there are not any other exports which might supersede the ones provided in the course content.\nTo add to the solution above, if the errors persist in regards to setting the correct path for spark, an alternative solution for permanent path setting solve the error is to set environment variables on system and user environment variables following this tutorial: Install Apache PySpark on Windows PC | Apache Spark Installation Guide\nOnce everything is installed, skip to 7:14 to set up environment variables. This allows for the environment variables to be set permanently.", "section": "Module 5: pyspark", "question": "lsRuntimeError: Java gateway process exited before sending its port number" }, { "text": "Even after installing pyspark correctly on linux machine (VM ) as per course instructions, faced a module not found error in jupyter notebook .\nThe solution which worked for me(use following in jupyter notebook) :\n!pip install findspark\nimport findspark\nfindspark.init()\nThereafter , import pyspark and create spark contex<<t as usual\nNone of the solutions above worked for me till I ran !pip3 install pyspark instead !pip install pyspark.\nFilter based on conditions based on multiple columns\nfrom pyspark.sql.functions import col\nnew_final.filter((new_final.a_zone==\"Murray Hill\") & (new_final.b_zone==\"Midwood\")).show()\nKrishna Anand", "section": "Module 5: pyspark", "question": "Module Not Found Error in Jupyter Notebook ." }, { "text": "You need to look for the Py4J file and note the version of the filename. Once you know the version, you can update the export command accordingly, this is how you check yours:\n` ls ${SPARK_HOME}/python/lib/ ` and then you add it in the export command, mine was:\nexport PYTHONPATH=\u201d${SPARK_HOME}/python/lib/Py4J-0.10.9.5-src.zip:${PYTHONPATH}\u201d\nMake sure that the version under `${SPARK_HOME}/python/lib/` matches the filename of py4j or you will encounter `ModuleNotFoundError: No module named 'py4j'` while executing `import pyspark`.\nFor instance, if the file under `${SPARK_HOME}/python/lib/` was `py4j-0.10.9.3-src.zip`.\nThen the export PYTHONPATH statement above should be changed to `export PYTHONPATH=\"${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH\"` appropriately.\nAdditionally, you can check for the version of \u2018py4j\u2019 of the spark you\u2019re using from here and update as mentioned above.\n~ Abhijit Chakraborty: Sometimes, even with adding the correct version of py4j might not solve the problem. Simply run pip install py4j and problem should be resolved.", "section": "Module 5: pyspark", "question": "Py4JJavaError - ModuleNotFoundError: No module named 'py4j'` while executing `import pyspark`" }, { "text": "If below does not work, then download the latest available py4j version with\nconda install -c conda-forge py4j\nTake care of the latest version number in the website to replace appropriately.\nNow add\nexport PYTHONPATH=\"${SPARK_HOME}/python/:$PYTHONPATH\"\nexport PYTHONPATH=\"${SPARK_HOME}/python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH\"\nin your .bashrc file.", "section": "Module 5: pyspark", "question": "Py4J Error - ModuleNotFoundError: No module named 'py4j' (Solve with latest version)" }, { "text": "Even after we have exported our paths correctly you may find that even though Jupyter is installed you might not have Jupyter Noteboopgak for one reason or another. Full instructions are found here (for my walkthrough) or here (where I got the original instructions from) but are included below. These instructions include setting up a virtual environment (handy if you are on your own machine doing this and not a VM):\nFull steps:\nUpdate and upgrade packages:\nsudo apt update && sudo apt -y upgrade\nInstall Python:\nsudo apt install python3-pip python3-dev\nInstall Python virtualenv:\nsudo -H pip3 install --upgrade pip\nsudo -H pip3 install virtualenv\nCreate a Python Virtual Environment:\nmkdir notebook\ncd notebook\nvirtualenv jupyterenv\nsource jupyterenv/bin/activate\nInstall Jupyter Notebook:\npip install jupyter\nRun Jupyter Notebook:\njupyter notebook", "section": "Module 5: pyspark", "question": "Exception: Jupyter command `jupyter-notebook` not found." }, { "text": "Code executed:\ndf = spark.read.parquet(pq_path)\n\u2026 some operations on df \u2026\ndf.write.parquet(pq_path, mode=\"overwrite\")\njava.io.FileNotFoundException: File file:/home/xxx/code/data/pq/fhvhv/2021/02/part-00021-523f9ad5-14af-4332-9434-bdcb0831f2b7-c000.snappy.parquet does not exist\nThe problem is that Sparks performs lazy transformations, so the actual action that trigger the job is df.write, which does delete the parquet files that is trying to read (mode=\u201doverwrite\u201d)\n\u2705Solution: Write to a different directorydf\ndf.write.parquet(pq_path_temp, mode=\"overwrite\")", "section": "Module 5: pyspark", "question": "Error java.io.FileNotFoundException" }, { "text": "You need to create the Hadoop /bin directory manually and add the downloaded files in there, since the shell script provided for Windows installation just puts them in /c/tools/hadoop-3.2.0/ .", "section": "Module 5: pyspark", "question": "Hadoop - FileNotFoundException: Hadoop bin directory does not exist , when trying to write (Windows)" }, { "text": "Actually Spark SQL is one independent \u201ctype\u201d of SQL - Spark SQL.\nThe several SQL providers are very similar:\nSELECT [attributes]\nFROM [table]\nWHERE [filter]\nGROUP BY [grouping attributes]\nHAVING [filtering the groups]\nORDER BY [attribute to order]\n(INNER/FULL/LEFT/RIGHT) JOIN [table2]\nON [attributes table joining table2] (...)\nWhat differs the most between several SQL providers are built-in functions.\nFor Built-in Spark SQL function check this link: https://spark.apache.org/docs/latest/api/sql/index.html\nExtra information on SPARK SQL :\nhttps://databricks.com/glossary/what-is-spark-sql#:~:text=Spark%20SQL%20is%20a%20Spark,on%20existing%20deployments%20and%20data.", "section": "Module 5: pyspark", "question": "Which type of SQL is used in Spark? Postgres? MySQL? SQL Server?" }, { "text": "\u2705Solution: I had two notebooks running, and the one I wanted to look at had opened a port on localhost:4041.\nIf a port is in use, then Spark uses the next available port number. It can be even 4044. Clean up after yourself when a port does not work or a container does not run.\nYou can run spark.sparkContext.uiWebUrl\nand result will be some like\n'http://172.19.10.61:4041'", "section": "Module 5: pyspark", "question": "The spark viewer on localhost:4040 was not showing the current run" }, { "text": "\u2705Solution: replace Java Developer Kit 11 with Java Developer Kit 8.\nJava - RuntimeError: Java gateway process exited before sending its port number\nShows java_home is not set on the notebook log\nhttps://sparkbyexamples.com/pyspark/pyspark-exception-java-gateway-process-exited-before-sending-the-driver-its-port-number/\nhttps://twitter.com/drkrishnaanand/status/1765423415878463839", "section": "Module 5: pyspark", "question": "Java - java.lang.NoSuchMethodError: sun.nio.ch.DirectBuffer.cleaner()Lsun/misc/Cleaner Error during repartition call (conda pyspark installation)" }, { "text": "\u2705I got it working using `gcs-connector-hadoop-2.2.5-shaded.jar` and Spark 3.1\nI also added the google_credentials.json and .p12 to auth with gcs. These files are downloadable from GCP Service account.\nTo create the SparkSession:\nspark = SparkSession.builder.master('local[*]') \\\n.appName('spark-read-from-bigquery') \\\n.config('BigQueryProjectId','razor-project-xxxxxxx) \\\n.config('BigQueryDatasetLocation','de_final_data') \\\n.config('parentProject','razor-project-xxxxxxx) \\\n.config(\"google.cloud.auth.service.account.enable\", \"true\") \\\n.config(\"credentialsFile\", \"google_credentials.json\") \\\n.config(\"GcpJsonKeyFile\", \"google_credentials.json\") \\\n.config(\"spark.driver.memory\", \"4g\") \\\n.config(\"spark.executor.memory\", \"2g\") \\\n.config(\"spark.memory.offHeap.enabled\",True) \\\n.config(\"spark.memory.offHeap.size\",\"5g\") \\\n.config('google.cloud.auth.service.account.json.keyfile', \"google_credentials.json\") \\\n.config(\"fs.gs.project.id\", \"razor-project-xxxxxxx\") \\\n.config(\"fs.gs.impl\", \"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem\") \\\n.config(\"fs.AbstractFileSystem.gs.impl\", \"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS\") \\\n.getOrCreate()", "section": "Module 5: pyspark", "question": "Spark fails when reading from BigQuery and using `.show()` on `SELECT` queries" }, { "text": "While creating a SparkSession using the config spark.jars.packages as com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.23.2\nspark = SparkSession.builder.master('local').appName('bq').config(\"spark.jars.packages\", \"com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.23.2\").getOrCreate()\nautomatically downloads the required dependency jars and configures the connector, removing the need to manage this dependency. More details available here", "section": "Module 5: pyspark", "question": "Spark BigQuery connector Automatic configuration" }, { "text": "Link to Slack Thread : has anyone figured out how to read from GCP data lake instead of downloading all the taxi data again?\nThere\u2019s a few extra steps to go into reading from GCS with PySpark\n1.) IMPORTANT: Download the Cloud Storage connector for Hadoop here: https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#clusters\nAs the name implies, this .jar file is what essentially connects PySpark with your GCS\n2.) Move the .jar file to your Spark file directory. I installed Spark using homebrew on my MacOS machine and I had to create a /jars directory under \"/opt/homebrew/Cellar/apache-spark/3.2.1/ (where my spark dir is located)\n3.) In your Python script, there are a few extra classes you\u2019ll have to import:\nimport pyspark\nfrom pyspark.sql import SparkSession\nfrom pyspark.conf import SparkConf\nfrom pyspark.context import SparkContext\n4.) You must set up your configurations before building your SparkSession. Here\u2019s my code snippet:\nconf = SparkConf() \\\n.setMaster('local[*]') \\\n.setAppName('test') \\\n.set(\"spark.jars\", \"/opt/homebrew/Cellar/apache-spark/3.2.1/jars/gcs-connector-hadoop3-latest.jar\") \\\n.set(\"spark.hadoop.google.cloud.auth.service.account.enable\", \"true\") \\\n.set(\"spark.hadoop.google.cloud.auth.service.account.json.keyfile\", \"path/to/google_credentials.json\")\nsc = SparkContext(conf=conf)\nsc._jsc.hadoopConfiguration().set(\"fs.AbstractFileSystem.gs.impl\", \"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS\")\nsc._jsc.hadoopConfiguration().set(\"fs.gs.impl\", \"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem\")\nsc._jsc.hadoopConfiguration().set(\"fs.gs.auth.service.account.json.keyfile\", \"path/to/google_credentials.json\")\nsc._jsc.hadoopConfiguration().set(\"fs.gs.auth.service.account.enable\", \"true\")\n5.) Once you run that, build your SparkSession with the new parameters we\u2019d just instantiated in the previous step:\nspark = SparkSession.builder \\\n.config(conf=sc.getConf()) \\\n.getOrCreate()\n6.) Finally, you\u2019re able to read your files straight from GCS!\ndf_green = spark.read.parquet(\"gs://{BUCKET}/green/202*/\")", "section": "Module 5: pyspark", "question": "Spark Cloud Storage connector" }, { "text": "from pyarrow.parquet import ParquetFile\npf = ParquetFile('fhvhv_tripdata_2021-01.parquet')\n#pyarrow builds tables, not dataframes\ntbl_small = next(pf.iter_batches(batch_size = 1000))\n#this function converts the table to a dataframe of manageable size\ndf = tbl_small.to_pandas()\nAlternatively without PyArrow:\ndf = spark.read.parquet('fhvhv_tripdata_2021-01.parquet')\ndf1 = df.sort('DOLocationID').limit(1000)\npdf = df1.select(\"*\").toPandas()\ngcsu", "section": "Module 5: pyspark", "question": "How can I read a small number of rows from the parquet file directly?" }, { "text": "Probably you\u2019ll encounter this if you followed the video \u20185.3.1 - First Look at Spark/PySpark\u2019 and used the parquet file from the TLC website (csv was used in the video).\nWhen defining the schema, the PULocation and DOLocationID are defined as IntegerType. This will cause an error because the Parquet file is INT64 and you\u2019ll get an error like:\nParquet column cannot be converted in file [...] Column [...] Expected: int, Found: INT64\nChange the schema definition from IntegerType to LongType and it should work", "section": "Module 5: pyspark", "question": "DataType error when creating Spark DataFrame with a specified schema?" }, { "text": "df_finalx=df_finalw.select([col(x).alias(x.replace(\" \",\"\")) for x in df_finalw.columns])\nKrishna Anand", "section": "Module 5: pyspark", "question": "Remove white spaces from column names in Pyspark" }, { "text": "This error comes up on the Spark video 5.3.1 - First Look at Spark/PySpark,\nbecause as at the creation of the video, 2021 data was the most recent which utilised csv files but as at now its parquet.\nSo when you run the command spark.createDataFrame(df1_pandas).show(),\nYou get the Attribute error. This is caused by the pandas version 2.0.0 which seems incompatible with Spark 3.3.2, so to fix it you have to downgrade pandas to 1.5.3 using the command pip install -U pandas==1.5.3\nAnother option is adding the following after importing pandas, if one does not want to downgrade pandas version (source) :\npd.DataFrame.iteritems = pd.DataFrame.items\nNote that this problem is solved with Spark versions from 3.4.1", "section": "Module 5: pyspark", "question": "AttributeError: 'DataFrame' object has no attribute 'iteritems'" }, { "text": "Another alternative is to install pandas 2.0.1 (it worked well as at the time of writing this), and it is compatible with Pyspark 3.5.1. Make sure to add or edit your environment variable like this:\nexport SPARK_HOME=\"${HOME}/spark/spark-3.5.1-bin-hadoop3\"\nexport PATH=\"${SPARK_HOME}/bin:${PATH}\"", "section": "Module 5: pyspark", "question": "AttributeError: 'DataFrame' object has no attribute 'iteritems'" }, { "text": "Open a CMD terminal in administrator mode\ncd %SPARK_HOME%\nStart a master node: bin\\spark-class org.apache.spark.deploy.master.Master\nStart a worker node: bin\\spark-class org.apache.spark.deploy.worker.Worker spark://<master_ip>:<port> --host <IP_ADDR>\nbin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 --host <IP_ADDR>\nspark://<master_ip>:<port>: copy the address from the previous command, in my case it was spark://localhost:7077\nUse --host <IP_ADDR> if you want to run the worker on a different machine. For now leave it empty.\nNow you can access Spark UI through localhost:8080\nHomework for Module 5:\nDo not refer to the homework file located under /05-batch/code/. The correct file is located under\nhttps://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/cohorts/2024/05-batch/homework.md", "section": "Module 5: pyspark", "question": "Spark Standalone Mode on Windows" }, { "text": "You can either type the export command every time you run a new session, add it to the .bashrc/ which you can find in /home or run this command at the beginning of your homebook:\nimport findspark\nfindspark.init()", "section": "Module 5: pyspark", "question": "Export PYTHONPATH command in linux is temporary" }, { "text": "I solved this issue: unzip the file with:\nf\nbefore creating head.csv", "section": "Module 5: pyspark", "question": "Compressed file ended before the end-of-stream marker was reached" }, { "text": "In the code along from Video 5.3.3 Alexey downloads the CSV files from the NYT website and gzips them in their bash script. If we now (2023) follow along but download the data from the GH course Repo, it will already be zippes as csv.gz files. Therefore we zip it again if we follow the code from the video exactly. This then leads to gibberish outcome when we then try to cat the contents or count the lines with zcat, because the file is zipped twitch and zcat only unzips it once.\n\u2705solution: do not gzip the files downloaded from the course repo. Just wget them and save them as they are as csv.gz files. Then the zcat command and the showSchema command will also work\nURL=\"${URL_PREFIX}/${TAXI_TYPE}/${TAXI_TYPE}_tripdata_${YEAR}-${FMONTH}.csv.gz\"\nLOCAL_PREFIX=\"data/raw/${TAXI_TYPE}/${YEAR}/${FMONTH}\"\nLOCAL_FILE=\"${TAXI_TYPE}_tripdata_${YEAR}_${FMONTH}.csv.gz\"\nLOCAL_PATH=\"${LOCAL_PREFIX}/${LOCAL_FILE}\"\necho \"downloading ${URL} to ${LOCAL_PATH}\"\nmkdir -p ${LOCAL_PREFIX}\nwget ${URL} -O ${LOCAL_PATH}\necho \"compressing ${LOCAL_PATH}\"\n# gzip ${LOCAL_PATH} <- uncomment this line", "section": "Module 5: pyspark", "question": "Compression Error: zcat output is gibberish, seems like still compressed" }, { "text": "Occurred while running : spark.createDataFrame(df_pandas).show()\nThis error is usually due to the python version, since spark till date of 2 march 2023 doesn\u2019t support python 3.11, try creating a new env with python version 3.8 and then run this command.\nOn the virtual machine, you can create a conda environment (here called myenv) with python 3.10 installed:\nconda create -n myenv python=3.10 anaconda\nThen you must run conda activate myenv to run python 3.10. Otherwise you\u2019ll still be running version 3.11. You can deactivate by typing conda deactivate.", "section": "Module 5: pyspark", "question": "PicklingError: Could not serialise object: IndexError: tuple index out of range." }, { "text": "Make sure you have your credentials of your GCP in your VM under the location defined in the script.", "section": "Module 5: pyspark", "question": "Connecting from local Spark to GCS - Spark does not find my google credentials as shown in the video?" }, { "text": "To run spark in docker setup\n1. Build bitnami spark docker\na. clone bitnami repo using command\ngit clone https://github.com/bitnami/containers.git\n(tested on commit 9cef8b892d29c04f8a271a644341c8222790c992)\nb. edit file `bitnami/spark/3.3/debian-11/Dockerfile` and update java and spark version as following\n\"python-3.10.10-2-linux-${OS_ARCH}-debian-11\" \\\n\"java-17.0.5-8-3-linux-${OS_ARCH}-debian-11\" \\\nreference: https://github.com/bitnami/containers/issues/13409\nc. build docker image by navigating to above directory and running docker build command\nnavigate cd bitnami/spark/3.3/debian-11/\nbuild command docker build -t spark:3.3-java-17 .\n2. run docker compose\nusing following file\n```yaml docker-compose.yml\nversion: '2'\nservices:\nspark:\nimage: spark:3.3-java-17\nenvironment:\n- SPARK_MODE=master\n- SPARK_RPC_AUTHENTICATION_ENABLED=no\n- SPARK_RPC_ENCRYPTION_ENABLED=no\n- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no\n- SPARK_SSL_ENABLED=no\nvolumes:\n- \"./:/home/jovyan/work:rw\"\nports:\n- '8080:8080'\n- '7077:7077'\nspark-worker:\nimage: spark:3.3-java-17\nenvironment:\n- SPARK_MODE=worker\n- SPARK_MASTER_URL=spark://spark:7077\n- SPARK_WORKER_MEMORY=1G\n- SPARK_WORKER_CORES=1\n- SPARK_RPC_AUTHENTICATION_ENABLED=no\n- SPARK_RPC_ENCRYPTION_ENABLED=no\n- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no\n- SPARK_SSL_ENABLED=no\nvolumes:\n- \"./:/home/jovyan/work:rw\"\nports:\n- '8081:8081'\nspark-nb:\nimage: jupyter/pyspark-notebook:java-17.0.5\nenvironment:\n- SPARK_MASTER_URL=spark://spark:7077\nvolumes:\n- \"./:/home/jovyan/work:rw\"\nports:\n- '8888:8888'\n- '4040:4040'\n```\nrun command to deploy docker compose\ndocker-compose up\nAccess jupyter notebook using link logged in docker compose logs\nSpark master url is spark://spark:7077", "section": "Module 5: pyspark", "question": "Spark docker-compose setup" }, { "text": "To do this\npip install gcsfs,\nThereafter copy the uri path to the file and use \ndf = pandas.read_csc(gs://path)", "section": "Module 5: pyspark", "question": "How do you read data stored in gcs on pandas with your local computer?" }, { "text": "Error:\nspark.createDataFrame(df_pandas).schema\nTypeError: field Affiliated_base_number: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>\nSolution:\nAffiliated_base_number is a mix of letters and numbers (you can check this with a preview of the table), so it cannot be set to DoubleType (only for double-precision numbers). The suitable type would be StringType. Spark inferSchema is more accurate than Pandas infer type method in this case. You can set it to true while reading the csv, so you don\u2019t have to take out any data from your dataset. Something like this can help:\ndf = spark.read \\\n.options(\nheader = \"true\", \\\ninferSchema = \"true\", \\\n) \\\n.csv('path/to/your/csv/file/')\nSolution B:\nIt's because some rows in the affiliated_base_number are null and therefore it is assigned the datatype String and this cannot be converted to type Double. So if you really want to convert this pandas df to a pyspark df only take the rows from the pandas df that are not null in the 'Affiliated_base_number' column. Then you will be able to apply the pyspark function createDataFrame.\n# Only take rows that have no null values\npandas_df= pandas_df[pandas_df.notnull().all(1)]", "section": "Module 5: pyspark", "question": "TypeError when using spark.createDataFrame function on a pandas df" }, { "text": "Default executor memory is 1gb. This error appeared when working with the homework dataset.\nError: MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory\nScaling row group sizes to 95.00% for 8 writers\nSolution:\nIncrease the memory of the executor when creating the Spark session like this:\nRemember to restart the Jupyter session (ie. close the Spark session) or the config won\u2019t take effect.", "section": "Module 5: pyspark", "question": "MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory" }, { "text": "Change the working directory to the spark directory:\nif you have setup up your SPARK_HOME variable, use the following;\ncd %SPARK_HOME%\nif not, use the following;\ncd <path to spark installation>\nCreating a Local Spark Cluster\nTo start Spark Master:\nbin\\spark-class org.apache.spark.deploy.master.Master --host localhost\nStarting up a cluster:\nbin\\spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 --host localhost", "section": "Module 5: pyspark", "question": "How to spark standalone cluster is run on windows OS" }, { "text": "I added PYTHONPATH, JAVA_HOME and SPARK_HOME to ~/.bashrc, import pyspark worked ok in iPython in terminal, but couldn\u2019t be found in .ipynb opened in VS Code\nAfter adding new lines to ~/.bashrc, need to restart the shell to activate the new lines, do either\nsource ~/.bashrc\nexec bash\nInstead of configuring paths in ~/.bashrc, I created .env file in the root of my workspace:", "section": "Module 5: pyspark", "question": "Env variables set in ~/.bashrc are not loaded to Jupyter in VS Code" }, { "text": "I don\u2019t use visual studio, so I did it the old fashioned way: ssh -L 8888:localhost:8888 <my user>@<VM IP> (replace user and IP with the ones used by the GCP VM, e.g. : ssh -L 8888:localhost:8888 myuser@34.140.188.1", "section": "Module 5: pyspark", "question": "How to port forward outside VS Code" }, { "text": "If you are doing wc -l fhvhv_tripdata_2021-01.csv.gz with the gzip file as the file argument, you will get a different result, obviously! Since the file is compressed.\nUnzip the file and then do wc -l fhvhv_tripdata_2021-01.csv to get the right results.", "section": "Module 5: pyspark", "question": "\u201cwc -l\u201d is giving a different result then shown in the video" }, { "text": "when trying to:\nURL=\"spark://$HOSTNAME:7077\"\nspark-submit \\\n--master=\"{$URL}\" \\\n06_spark_sql.py \\\n--input_green=data/pq/green/2021/*/ \\\n--input_yellow=data/pq/yellow/2021/*/ \\\n--output=data/report-2021\nand you get errors like the following (SUMMARIZED):\nWARN Utils: Your hostname, <HOSTNAME> resolves to a loopback address..\nWARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to \"WARN\".\nException in thread \"main\" org.apache.spark.SparkException: Master must either be yarn or start with spark, mesos, k8s, or local at \u2026\nTry replacing --master=\"{$URL}\"\nwith --master=$URL (edited)\nExtra edit for spark version 3.4.2 - if encountering:\n`Error: Unrecognized option: --master=`\n\u2192 Replace `--master=\"{$URL}\"` with `--master \"${URL}\"`", "section": "Module 5: pyspark", "question": "`spark-submit` errors" }, { "text": "If you are seeing this (or similar) error when attempting to write to parquet, it is likely an issue with your path variables.\nFor Windows, create a new User Variable \u201cHADOOP_HOME\u201d that points to your Hadoop directory. Then add \u201c%HADOOP_HOME%\\bin\u201d to the PATH variable.\nAdditional tips can be found here: https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io", "section": "Module 5: pyspark", "question": "Hadoop - Exception in thread \"main\" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z" }, { "text": "Change the hadoop version to 3.0.1.Replace all the files in the local hadoop bin folder with the files in this repo: winutils/hadoop-3.0.1/bin at master \u00b7 cdarlint/winutils (github.com)\nIf this does not work try to change other versions found in this repository.\nFor more information please see this link: This version of %1 is not compatible with the version of Windows you're running \u00b7 Issue #20 \u00b7 cdarlint/winutils (github.com)", "section": "Module 5: pyspark", "question": "Java.io.IOException. Cannot run program \u201cC:\\hadoop\\bin\\winutils.exe\u201d. CreateProcess error=216, This version of 1% is not compatible with the version of Windows you are using." }, { "text": "Fix is to set the flag like the error states. Get your project ID from your dashboard and set it like so:\ngcloud dataproc jobs submit pyspark \\\n--cluster=my_cluster \\\n--region=us-central1 \\\n--project=my-dtc-project-1010101 \\\ngs://my-dtc-bucket-id/code/06_spark_sql.py\n-- \\\n\u2026", "section": "Module 5: pyspark", "question": "Dataproc - ERROR: (gcloud.dataproc.jobs.submit.pyspark) The required property [project] is not currently set. It can be set on a per-command basis by re-running your command with the [--project] flag." }, { "text": "Go to %SPARK_HOME%\\bin\nRun spark-class org.apache.spark.deploy.master.Master to run the master. This will give you a URL of the form spark://ip:port\nRun spark-class org.apache.spark.deploy.worker.Worker spark://ip:port to run the worker. Make sure you use the URL you obtained in step 2.\nCreate a new Jupyter notebook:\nspark = SparkSession.builder \\\n.master(\"spark://{ip}:7077\") \\\n.appName('test') \\\n.getOrCreate()\nCheck on Spark UI the master, worker and app.", "section": "Module 5: pyspark", "question": "Run Local Cluster Spark in Windows 10 with CMD" }, { "text": "This occurs because you are not logged in \u201cgcloud auth login\u201d and maybe the project id is not settled. Then type in a terminal:\ngcloud auth login\nThis will open a tab in the browser, accept the terms, after that close the tab if you want. Then set the project is like:\ngcloud config set project <YOUR PROJECT_ID>\nThen you can run the command to upload the pq dir to a GCS Bucket:\ngsutil -m cp -r pq/ <YOUR URI from gsutil>/pq", "section": "Module 5: pyspark", "question": "lServiceException: 401 Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist)." }, { "text": "When submit a job, it might throw an error about Java in log panel within Dataproc. I changed the Versioning Control when I created a cluster, so it means that I delete the cluster and created a new one, and instead of choosing Debian-Hadoop-Spark, I switch to Ubuntu 20.02-Hadoop3.3-Spark3.3 for Versioning Control feature, the main reason to choose this is because I have the same Ubuntu version in mi laptop, I tried to find documentation to sustent this but unfortunately I couldn't nevertheless it works for me.", "section": "Module 5: pyspark", "question": "py4j.protocol.Py4JJavaError GCP" }, { "text": "Use both repartition and coalesce, like so:\ndf = df.repartition(6)\ndf = df.coalesce(6)\ndf.write.parquet('fhv/2019/10', mode='overwrite')", "section": "Module 5: pyspark", "question": "Repartition the Dataframe to 6 partitions using df.repartition(6) - got 8 partitions instead" }, { "text": "Possible solution - Try to forward the port using ssh cli instead of vs code.\nRun > \u201cssh -L <local port>:<VM host/ip>:<VM port> <ssh hostname>\u201d\nssh hostname is the name you specified in the ~/.ssh/config file\nIn case of Jupyter Notebook run\n\u201cssh -L 8888:localhost:8888 gcp-vm\u201d\nfrom your local machine\u2019s cli.\nNOTE: If you logout from the session, the connection would break. Also while creating the spark session notice the block's log because sometimes it fails to run at 4040 and then switches to 4041.\n~Abhijit Chakrborty: If you are having trouble accessing localhost ports from GCP VM consider adding the forwarding instructions to .ssh/config file as following:\n```\nHost <hostname>\nHostname <external-gcp-ip>\nUser xxxx\nIdentityFile yyyy\nLocalForward 8888 localhost:8888\nLocalForward 8080 localhost:8080\nLocalForward 5432 localhost:5432\nLocalForward 4040 localhost:4040\n```\nThis should automatically forward all ports and will enable accessing localhost ports.", "section": "Module 5: pyspark", "question": "Jupyter Notebook or SparkUI not loading properly at localhost after port forwarding from VS code?" }, { "text": "~ Abhijit Chakraborty\n`sdk list java` to check for available java sdk versions.\n`sdk install java 11.0.22-amzn` as java-11.0.22-amzn was available for my codespace.\nclick on Y if prompted to change the default java version.\nCheck for java version using `java -version `.\nIf working fine great, else `sdk default java 11.0.22-amzn` or whatever version you have installed.", "section": "Module 5: pyspark", "question": "Installing Java 11 on codespaces" }, { "text": "Sometimes while creating a dataproc cluster on GCP, the following error is encountered.\nSolution: As mentioned here, sometimes there might not be enough resources in the given region to allocate the request. Usually, gets freed up in a bit and one can create a cluster. \u2013 abhirup ghosh\nSolution 2: Changing the type of boot-disk from PD-Balanced to PD-Standard, in terraform, helped solve the problem.- Sundara Kumar Padmanabhan", "section": "Module 5: pyspark", "question": "Error: Insufficient 'SSD_TOTAL_GB' quota. Requested 500.0, available 470.0." }, { "text": "Pyspark converts the difference of two TimestampType values to Python's native datetime.timedelta object. The timedelta object only stores the duration in terms of days, seconds, and microseconds. Each of the three units of time must be manually converted into hours in order to express the total duration between the two timestamps using only hours.\nAnother way for achieving this is using the datediff (sql function). It receives this parameters\nUpper Date: the closest date you have. For example dropoff_datetime\nLower Date: the farthest date you have. For example pickup_datetime\nAnd the result is returned in terms of days, so you could multiply the result for 24 in order to get the hours.", "section": "Module 5: pyspark", "question": "Homework - how to convert the time difference of two timestamps to hours" }, { "text": "This version combination worked for me:\nPySpark = 3.3.2\nPandas = 1.5.3\n\nIf it still has an error,", "section": "Module 5: pyspark", "question": "PicklingError: Could not serialize object: IndexError: tuple index out of range" }, { "text": "Run this before SparkSession\nimport os\nimport sys\nos.environ['PYSPARK_PYTHON'] = sys.executable\nos.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable", "section": "Module 5: pyspark", "question": "Py4JJavaError: An error occurred while calling o180.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 6) (host.docker.internal executor driver): org.apache.spark.SparkException: Python worker failed to connect back." }, { "text": "import os\nimport sys\nos.environ['PYSPARK_PYTHON'] = sys.executable\nos.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable\nDataproc Pricing: https://cloud.google.com/dataproc/pricing#on_gke_pricing", "section": "Module 5: pyspark", "question": "RuntimeError: Python in worker has different version 3.11 than that in driver 3.10, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set." }, { "text": "Ans: No, you can submit a job to DataProc from your local computer by installing gsutil (https://cloud.google.com/storage/docs/gsutil_install) and configuring it. Then, you can execute the following command from your local computer.\ngcloud dataproc jobs submit pyspark \\\n--cluster=de-zoomcamp-cluster \\\n--region=europe-west6 \\\ngs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql.py \\\n-- \\\n--input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2020/*/ \\\n--input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2020/*/ \\\n--output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2020 (edited)", "section": "Module 5: pyspark", "question": "Dataproc Qn: Is it essential to have a VM on GCP for running Dataproc and submitting jobs ?" }, { "text": "AttributeError: 'DataFrame' object has no attribute 'iteritems'\nthis is because the method inside the pyspark refers to a package that has been already deprecated\n(https://stackoverflow.com/questions/76404811/attributeerror-dataframe-object-has-no-attribute-iteritems)\nYou can do this code below, which is mentioned in the stackoverflow link above:\nQ: DE Zoomcamp 5.6.3 - Setting up a Dataproc Cluster I cannot create a cluster and get this message. I tried many times as the FAQ said, but it didn't work. What can I do?\nError\nInsufficient 'SSD_TOTAL_GB' quota. Requested 500.0, available 250.0.\nRequest ID: 17942272465025572271\nA: The master and worker nodes are allocated a maximum of 250 GB of memory combined. In the configuration section, adhere to the following specifications:\nMaster Node:\nMachine type: n2-standard-2\nPrimary disk size: 85 GB\nWorker Node:\nNumber of worker nodes: 2\nMachine type: n2-standard-2\nPrimary disk size: 80 GB\nYou can allocate up to 82.5 GB memory for worker nodes, keeping in mind that the total memory allocated across all nodes cannot exceed 250 GB.", "section": "Module 5: pyspark", "question": "In module 5.3.1, trying to run spark.createDataFrame(df_pandas).show() returns error" }, { "text": "The MacOS setup instruction (https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/05-batch/setup/macos.md#installing-java) for setting the JAVA_HOME environment variable is for Intel-based Macs which have a default install location at /usr/local/. If you have an Apple Silicon mac, you will have to set JAVA_HOME to /opt/homebrew/, specifically in your .bashrc or .zshrc:\nexport JAVA_HOME=\"/opt/homebrew/opt/openjdk/bin\"\nexport PATH=\"$JAVA_HOME:$PATH\"\nConfirm that your path was correctly set by running the command: which java\nYou should expect to see the output:\n/opt/homebrew/opt/openjdk/bin/java\nReference: https://docs.brew.sh/Installation", "section": "Module 6: streaming with kafka", "question": "Setting JAVA_HOME with Homebrew on Apple Silicon" }, { "text": "Check Docker Compose File:\nEnsure that your docker-compose.yaml file is correctly configured with the necessary details for the \"control-center\" service. Check the service name, image name, ports, volumes, environment variables, and any other configurations required for the container to start.\nOn Mac OSX 12.2.1 (Monterey) I could not start the kafka control center. I opened Docker Desktop and saw docker images still running from week 4, which I did not see when I typed \u201cdocker ps.\u201d I deleted them in docker desktop and then had no problem starting up the kafka environment.", "section": "Module 6: streaming with kafka", "question": "Could not start docker image \u201ccontrol-center\u201d from the docker-compose.yaml file." }, { "text": "Solution from Alexey: create a virtual environment and run requirements.txt and the python files in that environment.\nTo create a virtual env and install packages (run only once)\npython -m venv env\nsource env/bin/activate\npip install -r ../requirements.txt\nTo activate it (you'll need to run it every time you need the virtual env):\nsource env/bin/activate\nTo deactivate it:\ndeactivate\nThis works on MacOS, Linux and Windows - but for Windows the path is slightly different (it's env/Scripts/activate)\nAlso the virtual environment should be created only to run the python file. Docker images should first all be up and running.", "section": "Module 6: streaming with kafka", "question": "Module \u201ckafka\u201d not found when trying to run producer.py" }, { "text": "ImportError: DLL load failed while importing cimpl: The specified module could not be found\nVerify Python Version:\nMake sure you are using a compatible version of Python with the Avro library. Check the Python version and compatibility requirements specified by the Avro library documentation.\n... you may have to load librdkafka-5d2e2910.dll in the code. Add this before importing avro:\nfrom ctypes import CDLL\nCDLL(\"C:\\\\Users\\\\YOUR_USER_NAME\\\\anaconda3\\\\envs\\\\dtcde\\\\Lib\\\\site-packages\\\\confluent_kafka.libs\\librdkafka-5d2e2910.dll\")\nIt seems that the error may occur depending on the OS and python version installed.\nALTERNATIVE:\nImportError: DLL load failed while importing cimpl\n\u2705SOLUTION: $env:CONDA_DLL_SEARCH_MODIFICATION_ENABLE=1 in Powershell.\nYou need to set this DLL manually in Conda Env.\nSource: https://githubhot.com/repo/confluentinc/confluent-kafka-python/issues/1186?page=2", "section": "Module 6: streaming with kafka", "question": "Error importing cimpl dll when running avro examples" }, { "text": "\u2705SOLUTION: pip install confluent-kafka[avro].\nFor some reason, Conda also doesn't include this when installing confluent-kafka via pip.\nMore sources on Anaconda and confluent-kafka issues:\nhttps://github.com/confluentinc/confluent-kafka-python/issues/590\nhttps://github.com/confluentinc/confluent-kafka-python/issues/1221\nhttps://stackoverflow.com/questions/69085157/cannot-import-producer-from-confluent-kafka", "section": "Module 6: streaming with kafka", "question": "ModuleNotFoundError: No module named 'avro'" }, { "text": "If you get an error while running the command python3 stream.py worker\nRun pip uninstall kafka-python\nThen run pip install kafka-python==1.4.6\nWhat is the use of Redpanda ?\nRedpanda: Redpanda is built on top of the Raft consensus algorithm and is designed as a high-performance, low-latency alternative to Kafka. It uses a log-centric architecture similar to Kafka but with different underlying principles.\nRedpanda is a powerful, yet simple, and cost-efficient streaming data platform that is compatible with Kafka\u00ae APIs while eliminating Kafka complexity.", "section": "Module 6: streaming with kafka", "question": "Error while running python3 stream.py worker" }, { "text": "Got this error because the docker container memory was exhausted. The dta file was upto 800MB but my docker container does not have enough memory to handle that.\nSolution was to load the file in chunks with Pandas, then create multiple parquet files for each dat file I was processing. This worked smoothly and the issue was resolved.", "section": "Module 6: streaming with kafka", "question": "Negsignal:SIGKILL while converting dta files to parquet format" }, { "text": "Copy the file found in the Java example: data-engineering-zoomcamp/week_6_stream_processing/java/kafka_examples/src/main/resources/rides.csv", "section": "Module 6: streaming with kafka", "question": "data-engineering-zoomcamp/week_6_stream_processing/python/resources/rides.csv is missing" }, { "text": "tip:As the videos have low audio so I downloaded them and used VLC media player with putting the audio to the max 200% of original audio and the audio became quite good or try to use auto caption generated on Youtube directly.\nKafka Python Videos - Rides.csv\nThere is no clear explanation of the rides.csv data that the producer.py python programs use. You can find that here https://raw.githubusercontent.com/DataTalksClub/data-engineering-zoomcamp/2bd33e89906181e424f7b12a299b70b19b7cfcd5/week_6_stream_processing/python/resources/rides.csv.", "section": "Module 6: streaming with kafka", "question": "Kafka- python videos have low audio and hard to follow up" }, { "text": "If you have this error, it most likely that your kafka broker docker container is not working.\nUse docker ps to confirm\nThen in the docker compose yaml file folder, run docker compose up -d to start all the instances.", "section": "Module 6: streaming with kafka", "question": "kafka.errors.NoBrokersAvailable: NoBrokersAvailable" }, { "text": "Ankush said we can focus on horizontal scaling option.\n\u201cthink of scaling in terms of scaling from consumer end. Or consuming message via horizontal scaling\u201d", "section": "Module 6: streaming with kafka", "question": "Kafka homwork Q3, there are options that support scaling concept more than the others:" }, { "text": "If you get this error, know that you have not built your sparks and juypter images. This images aren\u2019t readily available on dockerHub.\nIn the spark folder, run ./build.sh from a bash cli to to build all images before running docker compose", "section": "Module 6: streaming with kafka", "question": "How to fix docker compose error: Error response from daemon: pull access denied for spark-3.3.1, repository does not exist or may require 'docker login': denied: requested access to the resource is denied" }, { "text": "Run this command in terminal in the same directory (/docker/spark):\nchmod +x build.sh", "section": "Module 6: streaming with kafka", "question": "Python Kafka: ./build.sh: Permission denied Error" }, { "text": "Restarting all services worked for me:\ndocker-compose down\ndocker-compose up", "section": "Module 6: streaming with kafka", "question": "Python Kafka: \u2018KafkaTimeoutError: Failed to update metadata after 60.0 secs.\u2019 when running stream-example/producer.py" }, { "text": "While following tutorial 13.2 , when running ./spark-submit.sh streaming.py, encountered the following error:\n\u2026\n24/03/11 09:48:36 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://localhost:7077...\n24/03/11 09:48:36 INFO TransportClientFactory: Successfully created connection to localhost/127.0.0.1:7077 after 10 ms (0 ms spent in bootstraps)\n24/03/11 09:48:54 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors\n24/03/11 09:48:56 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://localhost:7077\u2026\n24/03/11 09:49:16 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://localhost:7077...\n24/03/11 09:49:36 WARN StandaloneSchedulerBackend: Application ID is not initialized yet.\n24/03/11 09:49:36 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.\n\u2026\npy4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.SparkSession.\n: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.\n\u2026\nSolution:\nDowngrade your local PySpark to 3.3.1 (same as Dockerfile)\nThe reason for the failed connection in my case was the mismatch of PySpark versions. You can see that from the logs of spark-master in the docker container.\nSolution 2:\nCheck what Spark version your local machine has\npyspark \u2013version\nspark-submit \u2013version\nAdd your version to SPARK_VERSION in build.sh", "section": "Module 6: streaming with kafka", "question": "Python Kafka: ./spark-submit.sh streaming.py - ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up." }, { "text": "Start a new terminal\nRun: docker ps\nCopy the CONTAINER ID of the spark-master container\nRun: docker exec -it <spark_master_container_id> bash\nRun: cat logs/spark-master.out\nCheck for the log when the error happened\nGoogle the error message from there", "section": "Module 6: streaming with kafka", "question": "Python Kafka: ./spark-submit.sh streaming.py - How to check why Spark master connection fails" }, { "text": "Make sure your java version is 11 or 8.\nCheck your version by:\njava --version\nCheck all your versions by:\n/usr/libexec/java_home -V\nIf you already have got java 11 but just not selected as default, select the specific version by:\nexport JAVA_HOME=$(/usr/libexec/java_home -v 11.0.22)\n(or other version of 11)", "section": "Module 6: streaming with kafka", "question": "Python Kafka: ./spark-submit.sh streaming.py Error: py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext." }, { "text": "In my set up, all of the dependencies listed in gradle.build were not installed in <project_name>-1.0-SNAPSHOT.jar.\nSolution:\nIn build.gradle file, I added the following at the end:\nshadowJar {\narchiveBaseName = \"java-kafka-rides\"\narchiveClassifier = ''\n}\nAnd then in the command line ran \u2018gradle shadowjar\u2019, and run the script from java-kafka-rides-1.0-SNAPSHOT.jar created by the shadowjar", "section": "Module 6: streaming with kafka", "question": "Java Kafka: <project_name>-1.0-SNAPSHOT.jar errors: package xxx does not exist even after gradle build" }, { "text": "confluent-kafka: `pip install confluent-kafka` or `conda install conda-forge::python-confluent-kafka`\nfastavro: pip install fastavro\nAbhirup Ghosh\nCan install Faust Library for Module 6 Python Version due to dependency conflicts?\nThe Faust repository and library is no longer maintained - https://github.com/robinhood/faust\nIf you do not know Java, you now have the option to follow the Python Videos 6.13 & 6.14 here https://www.youtube.com/watch?v=BgAlVknDFlQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=80 and follow the RedPanda Python version here https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/06-streaming/python/redpanda_example - NOTE: I highly recommend watching the Java videos to understand the concept of streaming but you can skip the coding parts - all will become clear when you get to the Python videos and RedPanda files.", "section": "Module 6: streaming with kafka", "question": "Python Kafka: Installing dependencies for python3 06-streaming/python/avro_example/producer.py" }, { "text": "In the project directory, run:\njava -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java", "section": "Module 6: streaming with kafka", "question": "Java Kafka: How to run producer/consumer/kstreams/etc in terminal" }, { "text": "For example, when running JsonConsumer.java, got:\nConsuming form kafka started\nRESULTS:::0\nRESULTS:::0\nRESULTS:::0\nOr when running JsonProducer.java, got:\nException in thread \"main\" java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.SaslAuthenticationException: Authentication failed\nSolution:\nMake sure in the scripts in src/main/java/org/example/ that you are running (e.g. JsonConsumer.java, JsonProducer.java), the StreamsConfig.BOOTSTRAP_SERVERS_CONFIG is the correct server url (e.g. europe-west3 from example vs europe-west2)\nMake sure cluster key and secrets are updated in src/main/java/org/example/Secrets.java (KAFKA_CLUSTER_KEY and KAFKA_CLUSTER_SECRET)", "section": "Module 6: streaming with kafka", "question": "Java Kafka: When running the producer/consumer/etc java scripts, no results retrieved or no message sent" }, { "text": "Situation: in VS Code, usually there will be a triangle icon next to each test. I couldn\u2019t see it at first and had to do some fixes.\nSolution:\n(Source)\nVS Code\n\u2192 Explorer (first icon on the left navigation bar)\n\u2192 JAVA PROJECTS (bottom collapsable)\n\u2192 icon next in the rightmost position to JAVA PROJECTS\n\u2192 clean Workspace\n\u2192 Confirm by clicking Reload and Delete\nNow you will be able to see the triangle icon next to each test like what you normally see in python tests.\nE.g.:\nYou can also add classes and packages in this window instead of creating files in the project directory", "section": "Module 6: streaming with kafka", "question": "Java Kafka: Tests are not picked up in VSCode" }, { "text": "In Confluent Cloud:\nEnvironment \u2192 default (or whatever you named your environment as) \u2192 The right navigation bar \u2192 \u201cStream Governance API\u201d \u2192 The URL under \u201cEndpoint\u201d\nAnd create credentials from Credentials section below it", "section": "Module 6: streaming with kafka", "question": "Confluent Kafka: Where can I find schema registry URL?" }, { "text": "You can check the version of your local spark using spark-submit --version. In the build.sh file of the Python folder, make sure that SPARK_VERSION matches your local version. Similarly, make sure the pyspark you pip installed also matches this version.", "section": "Module 6: streaming with kafka", "question": "How do I check compatibility of local and container Spark versions?" }, { "text": "According to https://github.com/dpkp/kafka-python/\n\u201cDUE TO ISSUES WITH RELEASES, IT IS SUGGESTED TO USE https://github.com/wbarnha/kafka-python-ng FOR THE TIME BEING\u201d\nUse pip install kafka-python-ng instead", "section": "Project", "question": "How to fix the error \"ModuleNotFoundError: No module named 'kafka.vendor.six.moves'\"?" }, { "text": "Each submitted project will be evaluated by 3 (three) randomly assigned students that have also submitted the project.\nYou will also be responsible for grading the projects from 3 fellow students yourself. Please be aware that: not complying to this rule also implies you failing to achieve the Certificate at the end of the course.\nThe final grade you get will be the median score of the grades you get from the peer reviewers.\nAnd of course, the peer review criteria for evaluating or being evaluated must follow the guidelines defined here.", "section": "Project", "question": "How is my capstone project going to be evaluated?" }, { "text": "There is only ONE project for this Zoomcamp. You do not need to submit or create two projects. There are simply TWO chances to pass the course. You can use the Second Attempt if you a) fail the first attempt b) do not have the time due to other engagements such as holiday or sickness etc. to enter your project into the first attempt.", "section": "Project", "question": "Project 1 & Project 2" }, { "text": "See a list of datasets here: https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_7_project/datasets.md", "section": "Project", "question": "Does anyone know nice and relatively large datasets?" }, { "text": "You need to redefine the python environment variable to that of your user account", "section": "Project", "question": "How to run python as start up script?" }, { "text": "Initiate a Spark Session\nspark = (SparkSession\n.builder\n.appName(app_name)\n.master(master=master)\n.getOrCreate())\nspark.streams.resetTerminated()\nquery1 = spark\n.readStream\n\u2026\n\u2026\n.load()\nquery2 = spark\n.readStream\n\u2026\n\u2026\n.load()\nquery3 = spark\n.readStream\n\u2026\n\u2026\n.load()\nquery1.start()\nquery2.start()\nquery3.start()\nspark.streams.awaitAnyTermination() #waits for any one of the query to receive kill signal or error failure. This is asynchronous\n# On the contrary query3.start().awaitTermination() is a blocking ex call. Works well when we are reading only from one topic.", "section": "Project", "question": "Spark Streaming - How do I read from multiple topics in the same Spark Session" }, { "text": "Transformed data can be moved in to azure blob storage and then it can be moved in to azure SQL DB, instead of moving directly from databricks to Azure SQL DB.", "section": "Project", "question": "Data Transformation from Databricks to Azure SQL DB" }, { "text": "The trial dbt account provides access to dbt API. Job will still be needed to be added manually. Airflow will run the job using a python operator calling the API. You will need to provide api key, job id, etc. (be careful not committing it to Github).\nDetailed explanation here: https://docs.getdbt.com/blog/dbt-airflow-spiritual-alignment\nSource code example here: https://github.com/sungchun12/airflow-toolkit/blob/95d40ac76122de337e1b1cdc8eed35ba1c3051ed/dags/examples/dbt_cloud_example.py", "section": "Project", "question": "Orchestrating dbt with Airflow" }, { "text": "https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index.html\nhttps://airflow.apache.org/docs/apache-airflow-providers-google/stable/_modules/airflow/providers/google/cloud/operators/dataproc.html\nGive the following roles to you service account:\nDataProc Administrator\nService Account User (explanation here)\nUse DataprocSubmitPySparkJobOperator, DataprocDeleteClusterOperator and DataprocCreateClusterOperator.\nWhen using DataprocSubmitPySparkJobOperator, do not forget to add:\ndataproc_jars = [\"gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.24.0.jar\"]\nBecause DataProc does not already have the BigQuery Connector.", "section": "Project", "question": "Orchestrating DataProc with Airflow" }, { "text": "You can trigger your dbt job in Mage pipeline. For this get your dbt cloud api key under settings/Api tokens/personal tokens. Add it safely to your .env\nFor example\ndbt_api_trigger=dbt_**\nNavigate to job page and find api trigger link\nThen create a custom mage Python block with a simple http request like here\nfrom dotenv import load_dotenv\nfrom pathlib import Path\ndotenv_path = Path('/home/src/.env')\nload_dotenv(dotenv_path=dotenv_path)\ndbt_api_trigger= os.getenv(dbt_api_trigger)\nurl = f\"https://cloud.getdbt.com/api/v2/accounts/{dbt_account_id}/jobs/<job_id>/run/\"\nheaders = {\n \"Authorization\": f\"Token {dbt_api_trigger}\",\n \"Content-Type\": \"application/json\" }\nbody = {\n \"cause\": \"Triggered via API\"\n }\n response = requests.post(url, headers=headers, json=body)\nvoila! You triggered dbt job form your mage pipeline.", "section": "Project", "question": "Orchestrating dbt cloud with Mage" }, { "text": "The slack thread : thttps://datatalks-club.slack.com/archives/C01FABYF2RG/p1677678161866999\nThe question is that sometimes even if you take plenty of effort to document every single step, and we can't even sure if the person doing the peer review will be able to follow-up, so how this criteria will be evaluated?\nAlex clarifies: \u201cIdeally yes, you should try to re-run everything. But I understand that not everyone has time to do it, so if you check the code by looking at it and try to spot errors, places with missing instructions and so on - then it's already great\u201d", "section": "Project", "question": "Project evaluation - Reproducibility" }, { "text": "The key valut in Azure cloud is used to store credentials or passwords or secrets of different tech stack used in Azure. For example if u do not want to expose the password in SQL database, then we can save the password under a given name and use them in other Azure stack.", "section": "Project", "question": "Key Vault in Azure cloud stack" }, { "text": "You can get the version of py4j from inside docker using this command\ndocker exec -it --user airflow airflow-airflow-scheduler-1 bash -c \"ls /opt/spark/python/lib\"", "section": "Project", "question": "Spark docker - `ModuleNotFoundError: No module named 'py4j'` while executing `import pyspark`" }, { "text": "Either use conda or pip for managing venv, using both of them together will cause incompatibility.\nIf you\u2019re using conda, install psycopg2 using the conda-forge channel, which may handle the architecture compatibility automatically\nconda install -c conda-forge psycopg2\nIf pip, do the normal install\npip install psycopg2", "section": "Project", "question": "psycopg2 complains of incompatible environment e.g x86 instead of amd" }, { "text": "This is not a FAQ but more of an advice if you want to set up dbt locally, I did it in the following way:\nI had the postgres instance from week 2 (year 2024) up (the docker-compose)\nmkdir dbt\nvi dbt/profiles.yml\nAnd here I attached this content (only the required fields) and replaced them with the proper values (for instance mine where in the .env file of the folder of week 2 docker stuff)\ncd dbt && git clone https://github.com/dbt-labs/dbt-starter-project\nmkdir project && cd project && mv dbt-starter-project/* .\nMake sure that you align the profile name in profiles.yml with the dbt_project.yml file\nAdd this line anywhere on the dbt_project.yml file:\nconfig-version: 2\ndocker run --network=mage-zoomcamp_default --mount type=bind,source=/<your-path>/dbt/project,target=/usr/app --mount type=bind,source=/<your-path>/profiles.yml,target=/root/.dbt/profiles.yml ghcr.io/dbt-labs/dbt-postgres ls\nIf you have trouble run\ndocker run --network=mage-zoomcamp_default --mount type=bind,source=/<your-path>/dbt/project,target=/usr/app --mount type=bind,source=/<your-path>/profiles.yml,target=/root/.dbt/profiles.yml ghcr.io/dbt-labs/dbt-postgres debug", "section": "Project", "question": "Setting up dbt locally with Docker and Postgres" }, { "text": "The following line should be included in pyspark configuration\n# Example initialization of SparkSession variable\nspark = (SparkSession.builder\n.master(...)\n.appName(...)\n# Add the following configuration\n.config(\"spark.jars.packages\", \"com.google.cloud.spark:spark-3.5-bigquery:0.37.0\")\n)", "section": "Project", "question": "How to connect Pyspark with BigQuery?" }, { "text": "Install the astronomer-cosmos package as a dependency. (see Terraform example).\nMake a new folder, dbt/, inside the dags/ folder of your Composer GCP bucket and copy paste your dbt-core project there. (see example)\nEnsure your profiles.yml is configured to authenticate with a service account key. (see BigQuery example)\nCreate a new DAG using the DbtTaskGroup class and a ProfileConfig specifying a profiles_yml_filepath that points to the location of your JSON key file. (see example)\nYour dbt lineage graph should now appear as tasks inside a task group like this:", "section": "Course Management Form for Homeworks", "question": "How to run a dbt-core project as an Airflow Task Group on Google Cloud Composer using a service account JSON key" }, { "text": "The display name listed on the leaderboard is an auto-generated randomized name. You can edit it to be a nickname, or your real name, if you prefer. Your entry on the Leaderboard is the one highlighted in teal(?) / light green (?).\nThe Certificate name should be your actual name that you want to appear on your certificate after completing the course.\nThe \"Display on Leaderboard\" option indicates whether you want your name to be listed on the course leaderboard.\nQuestion: Is it possible to create external tables in BigQuery using URLs, such as those from the NY Taxi data website?\nAnswer: Not really, only Bigtable, Cloud Storage, and Google Drive are supported data stores.", "section": "Workshop 1 - dlthub", "question": "Edit Course Profile." }, { "text": "Answer: To run the provided code, ensure that the 'dlt[duckdb]' package is installed. You can do this by executing the provided installation command: !pip install dlt[duckdb]. If you\u2019re doing it locally, be sure to also have duckdb pip installed (even before the duckdb package is loaded).", "section": "Workshop 1 - dlthub", "question": "How do I install the necessary dependencies to run the code?" }, { "text": "If you are running Jupyter Notebook on a fresh new Codespace or in local machine with a new Virtual Environment, you will need this package to run the starter Jupyter Notebook offered by the teacher. Execute this:\npip install jupyter", "section": "Workshop 1 - dlthub", "question": "Other packages needed but not listed" }, { "text": "Alternatively, you can switch to in-file storage with:", "section": "Workshop 1 - dlthub", "question": "How can I use DuckDB In-Memory database with dlt ?" }, { "text": "After loading, you should have a total of 8 records, and ID 3 should have age 33\nQuestion: Calculate the sum of ages of all the people loaded as described above\nThe sum of all eight records' respective ages is too big to be in the choices. You need to first filter out the people whose occupation is equal to None in order to get an answer that is close to or present in the given choices. \ud83d\ude03\n----------------------------------------------------------------------------------------\nFIXED = use a raw string and keep the file:/// at the start of your file path\nI'm having an issue with the dlt workshop notebook. The 'Load to Parquet file' section specifically. No matter what I change the file path to, it's still saving the dlt files directly to my C drive.\n# Set the bucket_url. We can also use a local folder\nos.environ['DESTINATION__FILESYSTEM__BUCKET_URL'] = r'file:///content/.dlt/my_folder'\nurl = \"https://storage.googleapis.com/dtc_zoomcamp_api/yellow_tripdata_2009-06.jsonl\"\n# Define your pipeline\npipeline = dlt.pipeline(\npipeline_name='my_pipeline',\ndestination='filesystem',\ndataset_name='mydata'\n)\n# Run the pipeline with the generator we created earlier.\nload_info = pipeline.run(stream_download_jsonl(url), table_name=\"users\", loader_file_format=\"parquet\")\nprint(load_info)\n# Get a list of all Parquet files in the specified folder\nparquet_files = glob.glob('/content/.dlt/my_folder/mydata/users/*.parquet')\n# show parquet files\nfor file in parquet_files:\nprint(file)", "section": "Workshop 2 - RisingWave", "question": "Homework - dlt Exercise 3 - Merge a generator concerns" }, { "text": "Check the contents of the repository with ls - the command.sh file should be in the root folder\nIf it is not, verify that you had cloned the correct repository - https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04", "section": "Workshop 2 - RisingWave", "question": "command.sh Error - source: no such file or directory: command.sh" }, { "text": "psql is a command line tool that is installed alongside PostgreSQL DB, but since we've always been running PostgreSQL in a container, you've only got `pgcli`, which lacks the feature to run a sql script into the DB. Besides, having a command line for each database flavor you'll have to deal with as a Data Professional is far from ideal.\nSo, instead, you can use usql. Check the docs for details on how to install for your OS. On macOS, it supports `homebrew`, and on Windows, it supports scoop.\nSo, to run the taxi_trips.sql script with usql:", "section": "Workshop 2 - RisingWave", "question": "psql - command not found: psql (alternative install)" }, { "text": "If you encounter this error and are certain that you have docker compose installed, but typically run it as docker compose without the hyphen, then consider editing command.sh file by removing the hyphen from \u2018docker-compose\u2019. Example:\nstart-cluster() {\ndocker compose -f docker/docker-compose.yml up -d\n}", "section": "Workshop 2 - RisingWave", "question": "Setup - source command.sh - error: \u201cdocker-compose\u201d not found" }, { "text": "ERROR: The Compose file './docker/docker-compose.yml' is invalid because:\nInvalid top-level property \"x-image\". Valid top-level sections for this Compose file are: version, services, networks, volumes, secrets, configs, and extensions starting with \"x-\".\nYou might be seeing this error because you're using the wrong Compose file version. Either specify a supported version (e.g \"2.2\" or \"3.3\") and place your service definitions under the `services` key, or omit the `version` key and place your service definitions at the root of the file to use version 1.\nFor more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/\nIf you encounter the above error and have docker-compose installed, try updating your version of docker-compose. At the time of reporting this issue (March 17 2024), Ubuntu does not seem to support a docker-compose version high enough to run the required docker images. If you have this error and are on a Ubuntu machine, consider starting a VM with a Debian machine or look for an alternative way to download docker-compose at the latest version on your machine.", "section": "Workshop 2 - RisingWave", "question": "Setup - start-cluster error: Invalid top-level property x-image" }, { "text": "Ans: [source] Yes, it is so that we can observe the changes as we\u2019re working on the queries in real-time. The script is changing the date timestamp to the current time, so our queries with the now()filter would work. Open another terminal tab to copy+paste the queries while the stream-kafka script is running in the background.\nNoel: I have recently increased this up to 100 at a time, you may pull the latest changes from the repository.", "section": "Workshop 2 - RisingWave", "question": "stream-kafka Qn: Is it expected that the records are being ingested 10 at a time?" }, { "text": "Ans: No, it is not.", "section": "Workshop 2 - RisingWave", "question": "Setup - Qn: Is kafka install required for the RisingWave workshop? [source]" }, { "text": "Ans: about 7GB free for all the containers to be provisioned and then the psql still needs to run and ingest the taxi data, so maybe 10gb in total?", "section": "Workshop 2 - RisingWave", "question": "Setup - Qn: How much free disk space should we have? [source]" }, { "text": "Replace psycopg2==2.9.9 with psycopg2-binary in the requirements.txt file [source] [another]\nWhen you open another terminal to run the psql, remember to do the source command.sh step for each terminal session\n---------------------------------------------------------------------------------------------", "section": "Workshop 2 - RisingWave", "question": "Psycopg2 - issues when running stream-kafka script" }, { "text": "If you\u2019re using an Anaconda installation:\nCd home/\nConda install gcc\nSource back to your RisingWave Venv - source .venv/bin/activate\nPip install psycopg2-binary\nPip install -r requirements.txt\nFor some reason this worked - the Conda base doesn\u2019t have the GCC installed - (GNU Compiler Collection) a compiler system that supports various programming languages. Without this the it fails to install pyproject.toml-based projects\n\u201cIt's possible that in your specific environment, the gcc installation was required at the system level rather than within the virtual environment. This can happen if the build process for psycopg2 tries to access system-level dependencies during installation.\nInstalling gcc in your main Python installation (Conda) would make it available system-wide, allowing any Python environment to access it when necessary for building packages.\u201d\ngcc stands for GNU Compiler Collection. It is a compiler system developed by the GNU Project that supports various programming languages, including C, C++, Objective-C, and Fortran.\nGCC is widely used for compiling source code written in these languages into executable programs or libraries. It's a key tool in the software development process, particularly in the compilation stage where source code is translated into machine code that can be executed by a computer's processor.\nIn addition to compiling source code, GCC also provides various optimization options, debugging support, and extensive documentation, making it a powerful and versatile tool for developers across different platforms and architectures.\n\u2014-----------------------------------------------------------------------------------", "section": "Workshop 2 - RisingWave", "question": "Psycopg2 - `Could not build wheels for psycopg2, which is required to install pyproject.toml-based projects`" }, { "text": "Below I have listed some steps I took to rectify this and potentially other minor errors, in Windows:\nUse the git bash terminal in windows.\nActivate python venv from git bash: source .venv/Scripts/activate\nModify the seed_kafka.py file: in the first line, replace python3 with python.\nNow from git bash, run the seed-kafka cmd. It should work now.\nAdditional Notes:\nYou can connect to the RisingWave cluster from Powershell with the command psql -h localhost -p 4566 -d dev -U root , otherwise it asks for a password.\nThe equivalent of source commands.sh in Powershell is . .\\commands.sh from the workshop directory.\nHope this can save you from some trouble in case you're doing this workshop on Windows like I am.\n\u2014--------------------------------------------------------------------------------------", "section": "Workshop 2 - RisingWave", "question": "Psycopg2 InternalError: Failed to run the query - when running the seed-kafka command after initial setup." }, { "text": "In case the script gets stuck on\n%3|1709652240.100|FAIL|rdkafka#producer-2| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused (after 0ms in state CONNECT)gre\nafter trying to load the trip data, check the logs of the message_queue container in docker. If it keeps restarting with Could not initialize seastar: std::runtime_error (insufficient physical memory: needed 4294967296 available 4067422208) as the last message, then go to the docker-compose file in the docker folder of the project and change the \u2018memory\u2019 command for the message_queue service for some lower value.\nSolution: lower the memory allocation of the service \u201cmessage_queue\u201d in your docker-compose file from 4GB. If you have the \u201cinsufficient physical memory\u201d error message (try 3GB)\nIssue: Running psql -f risingwave-sql/table/trip_data.sql after starting services with \u2018default\u2019 values using docker-compose up gives the error \u201cpsql:risingwave-sql/table/trip_data.sql:61: ERROR: syntax error at or near \".\" LINE 60: properties.bootstrap.server='message_queue:29092'\u201d\nSolution: Make sure you have run source commands.sh in each terminal window", "section": "Workshop 2 - RisingWave", "question": "Running stream-kafka script gets stuck on a loop with Connection Refused" }, { "text": "Use seed-kafka instead of stream-kafka to get a static set of results.", "section": "Workshop 2 - RisingWave", "question": "For the homework questions is there a specific number of records that have to be processed to obtain the final answer?" }, { "text": "It is best to use the order by and limit clause on the query to the materialized view instead of the materialized view creation in order to guarantee consistent results\nHomework - The answers in the homework do not match the provided options: You must follow the following steps: 1. clean-cluster 2. docker volume prune and use seed-kafka instead of stream-kafka. Ensure that the number of records is 100K.", "section": "Workshop 2 - RisingWave", "question": "Homework - Materialized view does not guarantee order by warning" }, { "text": "For this workshop, and if you are following the view from Noel (2024) this requires you to install postgres to use it on your terminal. Found this steps (commands) to get it done [source]:\nwget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -\nsudo sh -c 'echo \"deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main\" >> /etc/apt/sources.list.d/pgdg.list'\nsudo apt update\napt install postgresql postgresql-contrib\n(comment): now let\u2019s check the service for postgresql\nservice postgresql status\n(comment) If down: use the next command\nservice postgresql start\n(comment) And your are done", "section": "Workshop 2 - RisingWave", "question": "How to install postgress on Linux like OS" }, { "text": "Refer to the solution given in the first solution here:\nhttps://stackoverflow.com/questions/24683221/xdg-open-no-method-available-even-after-installing-xdg-utils\nInstead of w3m use any other browser of your choice.\nIt is just trying to open the index.html file. Which you can do from your File Explorer/Finder. If you\u2019re on wsl try using explorer.exe index.html", "section": "Workshop 2 - RisingWave", "question": "Unable to Open Dashboard as xdg-open doesn\u2019t open any browser" }, { "text": "Example Error:\nWhen attempting to execute a Python script named seed-kafka.py or server.py with the following shebang line specifying Python 3 as the interpreter:\nUsers may encounter the following error in a Unix-like environment:\nThis error indicates that there is a problem with the Python interpreter path specified in the shebang line. The presence of the \\r character suggests that the script was edited or created in a Windows environment, causing the interpreter path to be incorrect when executed in Unix-like environments.\n2 Solutions:\nEither one or the other\nUpdate Shebang Line:\nVerify Python Interpreter Path: Use the which python3 command to determine the path to the Python 3 interpreter available in the current environment.\nUpdate Shebang Line: Open the script file in a text editor. Modify the shebang line to point to the correct Python interpreter path found in the previous step. Ensure that the shebang line is consistent with the Python interpreter path in the execution environment.\nExample Shebang Line:\nReplace /usr/bin/env python3 with the correct Python interpreter path found using which python3.\nConvert Line Endings:\nUse the dos2unix command-line tool to convert the line endings of the script from Windows-style to Unix-style.\nThis removes the extraneous carriage return characters (\\r), resolving issues related to unexpected tokens and ensuring compatibility with Unix-like environments.\nExample Command:", "section": "Workshop 2 - RisingWave", "question": "Resolving Python Interpreter Path Inconsistencies in Unix-like Environments" }, { "text": "Ans : Windowing in streaming SQL involves defining a time-based or row-based boundary for data processing. It allows you to analyze and aggregate data over specific time intervals or based on the number of events received, providing a way to manage and organize streaming data for analysis.", "section": "Workshop 2 - RisingWave", "question": "How does windowing work in Sql?" }, { "text": "Python 3.12.1, is not compatible with kafka-python-2.0.2. Therefore, instead of running \"pip install kafka-python\", you can resolve the issue by using \"pip install git+https://github.com/dpkp/kafka-python.git\". If you have already installed kafka-python, you need to run \"pip uninstall kafka-python\" before executing \"pip install git+https://github.com/dpkp/kafka-python.git\" to resolve the compatibility issue.\nQ:In the Mage pipeline, individual blocks run successfully. However, when executing the pipeline as a whole, some blocks fail.\nA: I have the following key-value pair in io_config.yaml file configured but still Mage blocks failed to generate OAuth and authenticate with GCP: GOOGLE_SERVICE_ACC_KEY_FILEPATH: \"{{ env_var('GCP_CREDENTIALS') }}\". The GCP_CREDENTIALS variable holds the full path to the service account key's JSON file. Adding the following line within the failed code block resolved the issue: os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = os.environ.get('GCP_CREDENTIALS').\nThis occurs because the path to profiles.yml is not correctly specified. You can rectify this by:\n\u201cexport DBT_PROFILES_DBT=path/to/profiles.yml\u201d\nEg., /home/src/magic-zoomcamp/dbt/project_name/\nDo the similar for DBT_PROJECT_DIR if getting similar issue with dbt_project.yml.\nOnce DIRs are set,:\n\u201cdbt debug \u2013config-dir\u201d\nThis would update your paths. To maintain same path across sessions, use the path variables in your .env file.\nTo add triggers in mage pipelines via CLI, you can create a trigger of type API, and copy the API links.\nEg. link: http://localhost:6789/api/pipeline_schedules/10/pipeline_runs/f3a1a4228fc64cfd85295b668c93f3b2\nThen create a trigger.py as such:\nimport os\nimport requests\nclass MageTrigger:\nOPTIONS = {\n\"<pipeline_name>\": {\n\"trigger_id\": 10,\n\"key\": \"f3a1a4228fc64cfd85295b668c93f3b2\"\n}\n}\n@staticmethod\ndef trigger_pipeline(pipeline_name, variables=None):\ntrigger_id = MageTrigger.OPTIONS[pipeline_name][\"trigger_id\"]\nkey = MageTrigger.OPTIONS[pipeline_name][\"key\"]\nendpoint = f\"http://localhost:6789/api/pipeline_schedules/{trigger_id}/pipeline_runs/{key}\"\nheaders = {'Content-Type': 'application/json'}\npayload = {}\nif variables is not None:\npayload['pipeline_run'] = {'variables': variables}\nresponse = requests.post(endpoint, headers=headers, json=payload)\nreturn response\nMageTrigger.trigger_pipeline(\"<pipeline_name>\")\nFinally, after the mage server is up an running, simply this command:\npython trigger.py from mage directory in terminal.\nCan I do data partitioning & clustering run by dbt pipeline, or I would need to do this manually in BigQuery afterwards?\nYou can use this configuration in your DBT model:\n{\n\"field\": \"<field name>\",\n\"data_type\": \"<timestamp | date | datetime | int64>\",\n\"granularity\": \"<hour | day | month | year>\"\n# Only required if data_type is \"int64\"\n\"range\": {\n\"start\": <int>,\n\"end\": <int>,\n\"interval\": <int>\n}\n}\nand for clustering\n{{\nconfig(\nmaterialized = \"table\",\ncluster_by = \"order_id\",\n)\n}}\nmore details in: https://docs.getdbt.com/reference/resource-configs/bigquery-configs", "section": "Triggers in Mage via CLI", "question": "Encountering the error \"ModuleNotFoundError: No module named 'kafka.vendor.six.moves'\" when running \"from kafka import KafkaProducer\" in Jupyter Notebook for Module 6 Homework?" }, { "text": "Docker Commands\n# Create a Docker Image from a base image\nDocker run -it ubuntu bash\n#List docker images\nDocker images list\n#List Running containers\nDocker ps -a\n#List with full container ids\nDocker ps -a --no-trunc\n#Add onto existing image to create new image\nDocker commit -a <User_Name> -m \"Message\" container_id New_Image_Name\n# Create a Docker Image with an entrypoint from a base image\nDocker run -it --entry_point=bash python:3.11\n#Attach to a stopped container\nDocker start -ai <Container_Name>\n#Attach to a running container\ndocker exec -it <Container_ID> bash\n#copying from host to container\nDocker cp <SRC_PATH/file> <containerid>:<dest_path>\n#copying from container to host\nDocker cp <containerid>:<Srct_path> <Dest Path on host/file>\n#Create an image from a docker file\nDocker build -t <Image_Name> <Location of Dockerfile>\n#DockerFile Options and best practices\nhttps://devopscube.com/build-docker-image/\n#Docker delete all images forcefully\ndocker rmi -f $(docker images -aq)\n#Docker delete all containers forcefully\ndocker rm -f $(docker ps -qa)\n#docker compose creation\nhttps://www.composerize.com/\nGCP Commands\n1. Create SSH Keys\n2. Added to the Settings of Compute Engine VM Instance\n3. SSH-ed into the VM Instance with a config similar to following\nHost my-website.com\nHostName my-website.com\nUser my-user\nIdentityFile ~/.ssh/id_rsa\n4. Installed Anaconda by installing the sh file through bash <Anaconda.sh>\n5. Install Docker after\na. Sudo apt-get update\nb. Sudo apt-get docker\n6. To run Docker without SUDO permissions\na. https://github.com/sindresorhus/guides/blob/main/docker-without-sudo.md\n7. Google cloud remote copy\na. gcloud compute scp LOCAL_FILE_PATHVM_NAME:REMOTE_DIR\nInstall GCP Cloud SDK on Docker Machine\nhttps://stackoverflow.com/questions/23247943/trouble-installing-google-cloud-sdk-in-ubuntu\nsudo apt-get install apt-transport-https ca-certificates gnupg && echo \"deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main\"| sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list&& curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && sudo apt-get update && sudo apt-get install google-cloud-sdk && sudo apt-get install google-cloud-sdk-app-engine-java && sudo apt-get install google-cloud-sdk-app-engine-python && gcloud init\nAnaconda Commands\n#Activate environment\nConda Activate <environment_name>\n#DeActivate environment\nConda DeActivate <environment_name>\n#Start iterm without conda environment\nconda config --set auto_activate_base false\n# Using Conda forge as default (Community driven packaging recipes and solutions)\nhttps://conda-forge.org/docs/user/introduction.html\nconda --version\nconda update conda\nconda config --add channels conda-forge\nconda config --set channel_priority strict\n#Using Libmamba as Solver\nconda install pgcli --solver=libmamba\nLinux/MAC Commands\nStarting and Stopping Services on Linux\n\u25cf \tsudo systemctl start postgresql\n\u25cf \tsudo systemctl stop postgresql\nStarting and Stopping Services on MAC\n\u25cf launchctl start postgresql\n\u25cf launchctl stop postgresql\nIdentifying processes listening to a Port across MAC/Linux\nsudo lsof -i -P -n | grep LISTEN\n$ sudo netstat -tulpn | grep LISTEN\n$ sudo ss -tulpn | grep LISTEN\n$ sudo lsof -i:22 ## see a specific port such as 22 ##\n$ sudo nmap -sTU -O IP-address-Here\nInstalling a package on Debian\nsudo apt install <packagename>\nListing all package on Debian\nDpkg -l | grep <packagename>\nUnInstalling a package on Debian\nSudo apt remove <packagename>\nSudo apt autoclean && sudo apt autoremove\nList all Processes on Debian/Ubuntu\nPs -aux\napt-get update && apt-get install procps\napt-get install iproute2 for ss -tulpn\n#Postgres Install\nsudo sh -c 'echo \"deb https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main\" > /etc/apt/sources.list.d/pgdg.list'\nwget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -\nsudo apt-get update\nsudo apt-get -y install postgresql\n#Changing Postgresql port to 5432\n- sudo service postgresql stop - sed -e 's/^port.*/port = 5432/' /etc/postgresql/10/main/postgresql.conf > postgresql.conf\n- sudo chown postgres postgresql.conf\n- sudo mv postgresql.conf /etc/postgresql/10/main\n- sudo systemctl restart postgresql", "section": "Triggers in Mage via CLI", "question": "Basic Commands" } ] }, { "course": "machine-learning-zoomcamp", "documents": [ { "text": "Machine Learning Zoomcamp FAQ\nThe purpose of this document is to capture frequently asked technical questions.\nWe did this for our data engineering course and it worked quite well. Check this document for inspiration on how to structure your questions and answers:\nData Engineering Zoomcamp FAQ\nIn the course GitHub repository there\u2019s a link. Here it is: https://airtable.com/shryxwLd0COOEaqXo\nwork", "section": "General course-related questions", "question": "How do I sign up?" }, { "text": "The course videos are pre-recorded, you can start watching the course right now.\nWe will also occasionally have office hours - live sessions where we will answer your questions. The office hours sessions are recorded too.\nYou can see the office hours as well as the pre-recorded course videos in the course playlist on YouTube.", "section": "General course-related questions", "question": "Is it going to be live? When?" }, { "text": "Everything is recorded, so you won\u2019t miss anything. You will be able to ask your questions for office hours in advance and we will cover them during the live stream. Also, you can always ask questions in Slack.", "section": "General course-related questions", "question": "What if I miss a session?" }, { "text": "The bare minimum. The focus is more on practice, and we'll cover the theory only on the intuitive level.: https://mlbookcamp.com/article/python\nFor example, we won't derive the gradient update rule for logistic regression (there are other great courses for that), but we'll cover how to use logistic regression and make sense of the results.", "section": "General course-related questions", "question": "How much theory will you cover?" }, { "text": "Yes! We'll cover some linear algebra in the course, but in general, there will be very few formulas, mostly code.\nHere are some interesting videos covering linear algebra that you can already watch: ML Zoomcamp 1.8 - Linear Algebra Refresher from Alexey Grigorev or the excellent playlist from 3Blue1Brown Vectors | Chapter 1, Essence of linear algebra. Never hesitate to ask the community for help if you have any question.\n(M\u00e9lanie Fouesnard)", "section": "General course-related questions", "question": "I don't know math. Can I take the course?" }, { "text": "The process is automated now, so you should receive the email eventually. If you haven\u2019t, check your promotions tab in Gmail as well as spam.\nIf you unsubscribed from our newsletter, you won't get course related updates too.\nBut don't worry, it\u2019s not a problem. To make sure you don\u2019t miss anything, join the #course-ml-zoomcamp channel in Slack and our telegram channel with announcements. This is enough to follow the course.", "section": "General course-related questions", "question": "I filled the form, but haven't received a confirmation email. Is it normal?" }, { "text": "Approximately 4 months, but may take more if you want to do some extra activities (an extra project, an article, etc)", "section": "General course-related questions", "question": "How long is the course?" }, { "text": "Around ~10 hours per week. Timur Kamaliev did a detailed analysis of how much time students of the previous cohort needed to spend on different modules and projects. Full article", "section": "General course-related questions", "question": "How much time do I need for this course?" }, { "text": "Yes, if you finish at least 2 out of 3 projects and review 3 peers\u2019 Projects by the deadline, you will get a certificate. This is what it looks like: link. There\u2019s also a version without a robot: link.", "section": "General course-related questions", "question": "Will I get a certificate?" }, { "text": "Yes, it's possible. See the previous answer.", "section": "General course-related questions", "question": "Will I get a certificate if I missed the midterm project?" }, { "text": "Check this article. If you know everything in this article, you know enough. If you don\u2019t, read the article and join the coursIntroduction to Pythone too :)\nIntroduction to Python \u2013 Machine Learning Bookcamp\nYou can follow this English course from the OpenClassrooms e-learning platform, which is free and covers the python basics for data analysis: Learn Python Basics for Data Analysis - OpenClassrooms . It is important to know some basics such as: how to run a Jupyter notebook, how to import libraries (and what libraries are), how to declare a variable (and what variables are) and some important operations regarding data analysis.\n(M\u00e9lanie Fouesnard)", "section": "General course-related questions", "question": "How much Python should I know?" }, { "text": "For the Machine Learning part, all you need is a working laptop with an internet connection. The Deep Learning part is more resource intensive, but for that you can use a cloud (we use Saturn cloud but can be anything else).\n(Rileen Sinha; based on response by Alexey on Slack)", "section": "General course-related questions", "question": "Any particular hardware requirements for the course or everything is mostly cloud? TIA! Couldn't really find this in the FAQ." }, { "text": "Here is an article that worked for me: https://knowmledge.com/2023/12/07/ml-zoomcamp-2023-project/", "section": "General course-related questions", "question": "How to setup TensorFlow with GPU support on Ubuntu?" }, { "text": "Here\u2019s how you join a in Slack: https://slack.com/help/articles/205239967-Join-a-channel\nClick \u201cAll channels\u201d at the top of your left sidebar. If you don't see this option, click \u201cMore\u201d to find it.\nBrowse the list of public channels in your workspace, or use the search bar to search by channel name or description.\nSelect a channel from the list to view it.\nClick Join Channel.\nDo we need to provide the GitHub link to only our code corresponding to the homework questions?\nYes. You are required to provide the URL to your repo in order to receive a grade", "section": "General course-related questions", "question": "I\u2019m new to Slack and can\u2019t find the course channel. Where is it?" }, { "text": "Yes, you can. You won\u2019t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers\u2019 Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.", "section": "General course-related questions", "question": "The course has already started. Can I still join it?" }, { "text": "The course is available in the self-paced mode too, so you can go through the materials at any time. But if you want to do it as a cohort with other students, the next iterations will happen in September 2023, September 2024 (and potentially other Septembers as well).", "section": "General course-related questions", "question": "When does the next iteration start?" }, { "text": "No, it\u2019s not possible. The form is closed after the due date. But don\u2019t worry, homework is not mandatory for finishing the course.", "section": "General course-related questions", "question": "Can I submit the homework after the due date?" }, { "text": "Welcome to the course! Go to the course page (http://mlzoomcamp.com/), scroll down and start going through the course materials. Then read everything in the cohort folder for your cohort\u2019s year.\nClick on the links and start watching the videos. Also watch office hours from previous cohorts. Go to DTC youtube channel and click on Playlists and search for {course yyyy}. ML Zoomcamp was first launched in 2021.\nOr you can just use this link: http://mlzoomcamp.com/#syllabus", "section": "General course-related questions", "question": "I just joined. What should I do next? How can I access course materials?" }, { "text": "For the 2023 cohort, you can see the deadlines here (it\u2019s taken from the 2023 cohort page)", "section": "General course-related questions", "question": "What are the deadlines in this course?" }, { "text": "There\u2019s not much difference. There was one special module (BentoML) in the previous iteration of the course, but the rest of the modules are the same as in 2022. The homework this year is different.", "section": "General course-related questions", "question": "What\u2019s the difference between the previous iteration of the course (2022) and this one (2023)?" }, { "text": "We won\u2019t re-record the course videos. The focus of the course and the skills we want to teach remained the same, and the videos are still up-to-date.\nIf you haven\u2019t taken part in the previous iteration, you can start watching the videos. It\u2019ll be useful for you and you will learn new things. However, we recommend using Python 3.10 now instead of Python 3.8.", "section": "General course-related questions", "question": "The course videos are from the previous iteration. Will you release new ones or we\u2019ll use the videos from 2021?" }, { "text": "When you post about what you learned from the course on your social media pages, use the tag #mlzoomcamp. When you submit your homework, there\u2019s a section in the form for putting the links there. Separate multiple links by any whitespace character (linebreak, space, tab, etc).\nFor posting the learning in public links, you get extra scores. But the number of scores is limited to 7 points: if you put more than 7 links in your homework form, you\u2019ll get only 7 points.\nThe same content can be posted to 7 different social sites and still earn you 7 points if you add 7 URLs per week, see Alexey\u2019s reply. (~ ellacharmed)\nFor midterms/capstones, the awarded points are doubled as the duration is longer. So for projects the points are capped at 14 for 14 URLs.", "section": "General course-related questions", "question": "Submitting learning in public links" }, { "text": "You can create your own github repository for the course with your notes, homework, projects, etc.\nThen fork the original course repo and add a link under the 'Community Notes' section to the notes that are in your own repo.\nAfter that's done, create a pull request to sync your fork with the original course repo.\n(By Wesley Barreto)", "section": "General course-related questions", "question": "Adding community notes" }, { "text": "Leaderboard Links:\n2023 - https://docs.google.com/spreadsheets/d/e/2PACX-1vSNK_yGtELX1RJK1SSRl4xiUbD0XZMYS6uwHnybc7Mql-WMnMgO7hHSu59w-1cE7FeFZjkopbh684UE/pubhtml\n2022 - https://docs.google.com/spreadsheets/d/e/2PACX-1vQzLGpva63gb2rIilFnpZMRSb-buyr5oGh8jmDtIb8DANo4n6hDalra_WRCl4EZwO1JvaC4UIS62n5h/pubhtml\nPython Code:\nfrom hashlib import sha1\ndef compute_hash(email):\nreturn sha1(email.lower().encode('utf-8')).hexdigest()\nYou need to call the function as follows:\nprint(compute_hash('YOUR_EMAIL_HERE'))\nThe quotes are required to denote that your email is a string.\n(By Wesley Barreto)\nYou can also use this website directly by entering your email: http://www.sha1-online.com. Then, you just have to copy and paste your hashed email in the \u201cresearch\u201d bar of the leaderboard to get your scores.\n(M\u00e9lanie Fouesnard)", "section": "1. Introduction to Machine Learning", "question": "Computing the hash for the leaderboard and project review" }, { "text": "If you get \u201cwget is not recognized as an internal or external command\u201d, you need to install it.\nOn Ubuntu, run\nsudo apt-get install wget\nOn Windows, the easiest way to install wget is to use Chocolatey:\nchoco install wget\nOr you can download a binary from here and put it to any location in your PATH (e.g. C:/tools/)\nOn Mac, the easiest way to install wget is to use brew.\nBrew install wget\nAlternatively, you can use a Python wget library, but instead of simply using \u201cwget\u201d you\u2019ll need eeeto use\npython -m wget\nYou need to install it with pip first:\npip install wget\nAnd then in your python code, for example in your jupyter notebook, use:\nimport wget\nwget.download(\"URL\")\nThis should download whatever is at the URL in the same directory as your code.\n(Memoona Tahira)\nAlternatively, you can read a CSV file from a URL directly with pandas:\nurl = \"https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv\"\ndf = pd.read_csv(url)\nValid URL schemes include http, ftp, s3, gs, and file.\nIn some cases you might need to bypass https checks:\nimport ssl\nssl._create_default_https_context = ssl._create_unverified_context\nOr you can use the built-in Python functionality for downloading the files:\nimport urllib.request\nurl = \"https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv\"\nurllib.request.urlretrieve(url, \"housing.csv\")\nUrllib.request.urlretrieve() is a standard Python library function available on all devices and platforms. URL requests and URL data retrieval are done with the urllib.request module.\nThe urlretrieve() function allows you to download files from URLs and save them locally. Python programs use it to download files from the internet.\nOn any Python-enabled device or platform, you can use the urllib.request.urlretrieve() function to download the file.\n(Mohammad Emad Sharifi)", "section": "1. Introduction to Machine Learning", "question": "wget is not recognized as an internal or external command" }, { "text": "You can use\n!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv\nTo download the data too. The exclamation mark !, lets you execute shell commands inside your notebooks. This works generally for shell commands such as ls, cp, mkdir, mv etc . . .\nFor instance, if you then want to move your data into a data directory alongside your notebook-containing directory, you could execute the following:\n!mkdir -p ../data/\n!mv housing.csv ../data/", "section": "1. Introduction to Machine Learning", "question": "Retrieving csv inside notebook" }, { "text": "(Tyler Simpson)", "section": "1. Introduction to Machine Learning", "question": "Windows WSL and VS Code\nIf you have a Windows 11 device and would like to use the built in WSL to access linux you can use the Microsoft Learn link Set up a WSL development environment | Microsoft Learn. To connect this to VS Code download the Microsoft verified VS Code extension \u2018WSL\u2019 this will allow you to remotely connect to your WSL Ubuntu instance as if it was a virtual machine." }, { "text": "This is my first time using Github to upload a code. I was getting the below error message when I type\ngit push -u origin master:\nerror: src refspec master does not match any\nerror: failed to push some refs to 'https://github.com/XXXXXX/1st-Homework.git'\nSolution:\nThe error message got fixed by running below commands:\ngit commit -m \"initial commit\"\ngit push origin main\nIf this is your first time to use Github, you will find a great & straightforward tutorial in this link https://dennisivy.com/github-quickstart\n(Asia Saeed)\nYou can also use the \u201cupload file\u201d functionality from GitHub for that\nIf you write your code on Google colab you can also directly share it on your Github.\n(By Pranab Sarma)", "section": "1. Introduction to Machine Learning", "question": "Uploading the homework to Github" }, { "text": "I'm trying to invert the matrix but I got error that the matrix is singular matrix\nThe singular matrix error is caused by the fact that not every matrix can be inverted. In particular, in the homework it happens because you have to pay close attention when dealing with multiplication (the method .dot) since multiplication is not commutative! X.dot(Y) is not necessarily equal to Y.dot(X), so respect the order otherwise you get the wrong matrix.", "section": "1. Introduction to Machine Learning", "question": "Singular Matrix Error" }, { "text": "I have a problem with my terminal. Command\nconda create -n ml-zoomcamp python=3.9\ndoesn\u2019t work. Any of 3.8/ 3.9 / 3.10 should be all fine\nIf you\u2019re on Windows and just installed Anaconda, you can use Anaconda\u2019s own terminal called \u201cAnaconda Prompt\u201d.\nIf you don\u2019t have Anaconda or Miniconda, you should install it first\n(Tatyana Mardvilko)", "section": "1. Introduction to Machine Learning", "question": "Conda is not an internal command" }, { "text": "How do I read the dataset with Pandas in Windows?\nI used the code below but not working\ndf = pd.read_csv('C:\\Users\\username\\Downloads\\data.csv')\nUnlike Linux/Mac OS, Windows uses the backslash (\\) to navigate the files that cause the conflict with Python. The problem with using the backslash is that in Python, the '\\' has a purpose known as an escape sequence. Escape sequences allow us to include special characters in strings, for example, \"\\n\" to add a new line or \"\\t\" to add spaces, etc. To avoid the issue we just need to add \"r\" before the file path and Python will treat it as a literal string (not an escape sequence).\nHere\u2019s how we should be loading the file instead:\ndf = pd.read_csv(r'C:\\Users\\username\\Downloads\\data.csv')\n(Muhammad Awon)", "section": "1. Introduction to Machine Learning", "question": "Read-in the File in Windows OS" }, { "text": "Type the following command:\ngit config -l | grep url\nThe output should look like this:\nremote.origin.url=https://github.com/github-username/github-repository-name.git\nChange this to the following format and make sure the change is reflected using command in step 1:\ngit remote set-url origin \"https://github-username@github.com/github-username/github-repository-name.git\"\n(Added by Dheeraj Karra)", "section": "1. Introduction to Machine Learning", "question": "'403 Forbidden' error message when you try to push to a GitHub repository" }, { "text": "I had a problem when I tried to push my code from Git Bash:\nremote: Support for password authentication was removed on August 13, 2021.\nremote: Please see https://docs.github.com/en/get-started/getting-started-with-git/about-remote-repositories#cloning-with-https-urls for information on currently recommended modes of authentication.\nfatal: Authentication failed for 'https://github.com/username\nSolution:\nCreate a personal access token from your github account and use it when you make a push of your last changes.\nhttps://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent\nBruno Bed\u00f3n", "section": "1. Introduction to Machine Learning", "question": "Fatal: Authentication failed for 'https://github.com/username" }, { "text": "In Kaggle, when you are trying to !wget a dataset from github (or any other public repository/location), you get the following error:\nGetting this error while trying to import data- !wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv\n--2022-09-17 16:55:24-- https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv\nResolving raw.githubusercontent.com (raw.githubusercontent.com)... failed: Temporary failure in name resolution.\nwget: unable to resolve host address 'raw.githubusercontent.com'\nSolution:\nIn your Kaggle notebook settings, turn on the Internet for your session. It's on the settings panel, on the right hand side of the Kaggle screen. You'll be asked to verify your phone number so Kaggle knows you are not a bot.", "section": "1. Introduction to Machine Learning", "question": "wget: unable to resolve host address 'raw.githubusercontent.com'" }, { "text": "I found this video quite helpful: Creating Virtual Environment for Python from VS Code\n[Native Jupiter Notebooks support in VS Code] In VS Code you can also have a native Jupiter Notebooks support, i.e. you do not need to open a web browser to code in a Notebook. If you have port forwarding enabled + run a \u2018jupyter notebook \u2018 command from a remote machine + have a remote connection configured in .ssh/config (as Alexey\u2019s video suggests) - VS Code can execute remote Jupyter Notebooks files on a remote server from your local machine: https://code.visualstudio.com/docs/datascience/jupyter-notebooks.\n[Git support from VS Code] You can work with Github from VSCode - staging and commits are easy from the VS Code\u2019s UI: https://code.visualstudio.com/docs/sourcecontrol/overview\n(Added by Ivan Brigida)", "section": "1. Introduction to Machine Learning", "question": "Setting up an environment using VS Code" }, { "text": "With regards to creating an environment for the project, do we need to run the command \"conda create -n .......\" and \"conda activate ml-zoomcamp\" everytime we open vs code to work on the project?\nAnswer:\n\"conda create -n ....\" is just run the first time to create the environment. Once created, you just need to run \"conda activate ml-zoomcamp\" whenever you want to use it.\n(Added by Wesley Barreto)\nconda env export > environment.yml will also allow you to reproduce your existing environment in a YAML file. You can then recreate it with conda env create -f environment.yml", "section": "1. Introduction to Machine Learning", "question": "Conda Environment Setup" }, { "text": "I was doing Question 7 from Week1 Homework and with step6: Invert XTX, I created the inverse. Now, an inverse when multiplied by the original matrix should return in an Identity matrix. But when I multiplied the inverse with the original matrix, it gave a matrix like this:\nInverse * Original:\n[[ 1.00000000e+00 -1.38777878e-16]\n[ 3.16968674e-13 1.00000000e+00]]\nSolution:\nIt's because floating point math doesn't work well on computers as shown here: https://stackoverflow.com/questions/588004/is-floating-point-math-broken\n(Added by Wesley Barreto)", "section": "1. Introduction to Machine Learning", "question": "Floating Point Precision" }, { "text": "Answer:\nIt prints the information about the dataset like:\nIndex datatype\nNo. of entries\nColumn information with not-null count and datatype\nMemory usage by dataset\nWe use it as:\ndf.info()\n(Added by Aadarsha Shrestha & Emoghena Itakpe)", "section": "1. Introduction to Machine Learning", "question": "What does pandas.DataFrame.info() do?" }, { "text": "Pandas and numpy libraries are not being imported\nNameError: name 'np' is not defined\nNameError: name 'pd' is not defined\nIf you're using numpy or pandas, make sure you use the first few lines before anything else.\nimport pandas as pd\nimport numpy as np\nAdded by Manuel Alejandro Aponte", "section": "1. Introduction to Machine Learning", "question": "NameError: name 'np' is not defined" }, { "text": "What if there were hundreds of columns? How do you get the columns only with numeric or object data in a more concise way?\ndf.select_dtypes(include=np.number).columns.tolist()\ndf.select_dtypes(include='object').columns.tolist()\nAdded by Gregory Morris", "section": "1. Introduction to Machine Learning", "question": "How to select column by dtype" }, { "text": "There are many ways to identify the shape of dataset, one of them is using .shape attribute!\ndf.shape\ndf.shape[0] # for identify the number of rows\ndf.shape[1] # for identify the number of columns\nAdded by Radikal Lukafiardi", "section": "1. Introduction to Machine Learning", "question": "How to identify the shape of dataset in Pandas" }, { "text": "First of all use np.dot for matrix multiplication. When you compute matrix-matrix multiplication you should understand that order of multiplying is crucial and affects the result of the multiplication!\nDimension Mismatch\nTo perform matrix multiplication, the number of columns in the 1st matrix should match the number of rows in the 2nd matrix. You can rearrange the order to make sure that this satisfies the condition.\nAdded by Leah Gotladera", "section": "1. Introduction to Machine Learning", "question": "How to avoid Value errors with array shapes in homework?" }, { "text": "You would first get the average of the column and save it to a variable, then replace the NaN values with the average variable.\nThis method is called imputing - when you have NaN/ null values in a column, but you do not want to get rid of the row because it has valuable information contributing to other columns.\nAdded by Anneysha Sarkar", "section": "1. Introduction to Machine Learning", "question": "Question 5: How and why do we replace the NaN values with average of the column?" }, { "text": "In Question 7 we are asked to calculate\nThe initial problem can be solved by this, where a Matrix X is multiplied by some unknown weights w resulting in the target y.\nAdditional reading and videos:\nOrdinary least squares\nMultiple Linear Regression in Matrix Form\nPseudoinverse Solution to OLS\nAdded by Sylvia Schmitt\nwith commends from Dmytro Durach", "section": "1. Introduction to Machine Learning", "question": "Question 7: Mathematical formula for linear regression" }, { "text": "This is most likely that you interchanged the first step of the multiplication\nYou used instead of\nAdded by Emmanuel Ikpesu", "section": "1. Introduction to Machine Learning", "question": "Question 7: FINAL MULTIPLICATION not having 5 column" }, { "text": "Note, that matrix multiplication (matrix-matrix, matrix-vector multiplication) can be written as * operator in some sources, but performed as @ operator or np.matmul() via numpy. * operator performs element-wise multiplication (Hadamard product).\nnumpy.dot() or ndarray.dot() can be used, but for matrix-matrix multiplication @ or np.matmul() is preferred (as per numpy doc).\nIf multiplying by a scalar numpy.multiply() or * is preferred.\nAdded by Andrii Larkin", "section": "1. Introduction to Machine Learning", "question": "Question 7: Multiplication operators." }, { "text": "If you face an error kind of ImportError: cannot import name 'contextfilter' from 'jinja2' (anaconda\\lib\\site-packages\\jinja2\\__init__.py) when launching a new notebook for a brand new environment.\nSwitch to the main environment and run \"pip install nbconvert --upgrade\".\nAdded by George Chizhmak", "section": "1. Introduction to Machine Learning", "question": "Error launching Jupyter notebook" }, { "text": "If you face this situation and see IPv6 addresses in the terminal, go to your System Settings > Network > your network connection > Details > Configure IPv6 > set to Manually > OK. Then try again", "section": "1. Introduction to Machine Learning", "question": "wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv hangs on MacOS Ventura M1" }, { "text": "Wget doesn't ship with macOS, so there are other alternatives to use.\nNo worries, we got curl:\nexample:\ncurl -o ./housing.csv https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv\nExplanations:\ncurl: a utility for retrieving information from the internet.\n-o: Tell it to store the result as a file.\nfilename: You choose the file's name.\nLinks: Put the web address (URL) here, and cURL will extract data from it and save it under the name you provide.\nMore about it at:\nCurl Documentation\nAdded by David Espejo", "section": "1. Introduction to Machine Learning", "question": "In case you are using mac os and having trouble with WGET" }, { "text": "You can use round() function or f-strings\nround(number, 4) - this will round number up to 4 decimal places\nprint(f'Average mark for the Homework is {avg:.3f}') - using F string\nAlso there is pandas.Series. round idf you need to round values in the whole Series\nPlease check the documentation\nhttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.round.html#pandas.Series.round\nAdded by Olga Rudakova", "section": "2. Machine Learning for Regression", "question": "How to output only a certain number of decimal places" }, { "text": "Here are the crucial links for this Week 2 that starts September 18, 2023\nAsk questions for Live Sessions: https://app.sli.do/event/vsUpjYsayZ8A875Hq8dpUa/live/questions\nCalendar for weekly meetings: https://calendar.google.com/calendar/u/0/r?cid=cGtjZ2tkbGc1OG9yb2lxa2Vwc2g4YXMzMmNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ&pli=1\nWeek 2 HW: https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/02-regression/homework.md\nSubmit HW Week 2: https://docs.google.com/forms/d/e/1FAIpQLSf8eMtnErPFqzzFsEdLap_GZ2sMih-H-Y7F_IuPGqt4fOmOJw/viewform (also available at the bottom of the above link)\nAll HWs: https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/\nGitHub for theory: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp\nYoutube Link: 2.X --- https://www.youtube.com/watch?v=vM3SqPNlStE&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=12\nFAQs: https://docs.google.com/document/d/1LpPanc33QJJ6BSsyxVg-pWNMplal84TdZtq10naIhD8/edit#heading=h.lpz96zg7l47j\n~~Nukta Bhatia~~", "section": "2. Machine Learning for Regression", "question": "How do I get started with Week 2?" }, { "text": "We can use histogram:\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n# Load the data\nurl = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv'\ndf = pd.read_csv(url)\n# EDA\nsns.histplot(df['median_house_value'], kde=False)\nplt.show()\nOR ceck skewness and describe:\nprint(df['median_house_value'].describe())\n# Calculate the skewness of the 'median_house_value' variable\nskewness = df['median_house_value'].skew()\n# Print the skewness value\nprint(\"Skewness of 'median_house_value':\", skewness)\n(Mohammad Emad Sharifi)", "section": "2. Machine Learning for Regression", "question": "Checking long tail of data" }, { "text": "It\u2019s possible that when you follow the videos, you\u2019ll get a Singular Matrix error. We will explain why it happens in the Regularization video. Don\u2019t worry, it\u2019s normal that you have it.\nYou can also have an error because you did the inverse of X once in your code and you\u2019re doing it a second time.\n(Added by C\u00e9cile Guillot)", "section": "2. Machine Learning for Regression", "question": "LinAlgError: Singular matrix" }, { "text": "You can find a detailed description of the dataset ere https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html\nKS", "section": "2. Machine Learning for Regression", "question": "California housing dataset" }, { "text": "I was using for loops to apply rmse to list of y_val and y_pred. But the resulting rmse is all nan.\nI found out that the problem was when my data reached the mean step after squaring the error in the rmse function. Turned out there were nan in the array, then I traced the problem back to where I first started to split the data: I had only use fillna(0) on the train data, not on the validation and test data. So the problem was fixed after I applied fillna(0) to all the dataset (train, val, test). Voila, my for loops to get rmse from all the seed values work now.\nAdded by Sasmito Yudha Husada", "section": "2. Machine Learning for Regression", "question": "Getting NaNs after applying .mean()" }, { "text": "Why should we transform the target variable to logarithm distribution? Do we do this for all machine learning projects?\nOnly if you see that your target is highly skewed. The easiest way to evaluate this is by plotting the distribution of the target variable.\nThis can help to understand skewness and how it can be applied to the distribution of your data set.\nhttps://en.wikipedia.org/wiki/Skewness\nPastor Soto", "section": "2. Machine Learning for Regression", "question": "Target variable transformation" }, { "text": "The dataset can be read directly to pandas dataframe from the github link using the technique shown below\ndfh=pd.read_csv(\"https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv\")\nKrishna Anand", "section": "2. Machine Learning for Regression", "question": "Reading the dataset directly from github" }, { "text": "For users of kaggle notebooks, the dataset can be loaded through widget using the below command. Please remember that ! before wget is essential\n!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv\nOnce the dataset is loaded to the kaggle notebook server, it can be read through the below pandas command\ndf = pd.read_csv('housing.csv')\nHarish Balasundaram", "section": "2. Machine Learning for Regression", "question": "Loading the dataset directly through Kaggle Notebooks" }, { "text": "We can filter a dataset by using its values as below.\ndf = df[(df[\"ocean_proximity\"] == \"<1H OCEAN\") | (df[\"ocean_proximity\"] == \"INLAND\")]\nYou can use | for \u2018OR\u2019, and & for \u2018AND\u2019\nAlternative:\ndf = df[df['ocean_proximity'].isin(['<1H OCEAN', 'INLAND'])]\nRadikal Lukafiardi", "section": "2. Machine Learning for Regression", "question": "Filter a dataset by using its values" }, { "text": "Above users showed how to load the dataset directly from github. Here is another useful way of doing this using the `requests` library:\n# Get data for homework\nimport requests\nurl = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv'\nresponse = requests.get(url)\nif response.status_code == 200:\nwith open('housing.csv', 'wb') as file:\nfile.write(response.content)\nelse:\nprint(\"Download failed.\")\nTyler Simpson", "section": "2. Machine Learning for Regression", "question": "Alternative way to load the data using requests" }, { "text": "When creating a duplicate of your dataframe by doing the following:\nX_train = df_train\nX_val = df_val\nYou\u2019re still referencing the original variable, this is called a shallow copy. You can make sure that no references are attaching both variables and still keep the copy of the data do the following to create a deep copy:\nX_train = df_train.copy()\nX_val = df_val.copy()\nAdded by Ixchel Garc\u00eda", "section": "2. Machine Learning for Regression", "question": "Null column is appearing even if I applied .fillna()" }, { "text": "Yes, you can. Here we implement it ourselves to better understand how it works, but later we will only rely on Scikit-Learn\u2019s functions. If you want to start using it earlier \u2014 feel free to do it", "section": "2. Machine Learning for Regression", "question": "Can I use Scikit-Learn\u2019s train_test_split for this week?" }, { "text": "Yes, you can. We will also do that next week, so don\u2019t worry, you will learn how to do it.", "section": "2. Machine Learning for Regression", "question": "Can I use LinearRegression from Scikit-Learn for this week?" }, { "text": "What are equivalents in Scikit-Learn for the linear regression with and without regularization used in week 2.\nCorresponding function for model without regularization:\nsklearn.linear_model.LinearRegression\nCorresponding function for model with regularization:\nsklearn.linear_model.Ridge\nThe linear model from Scikit-Learn are explained here:\nhttps://scikit-learn.org/stable/modules/linear_model.html\nAdded by Sylvia Schmitt", "section": "2. Machine Learning for Regression", "question": "Corresponding Scikit-Learn functions for Linear Regression (with and without Regularization)" }, { "text": "`r` is a regularization parameter.\nIt\u2019s similar to `alpha` in sklearn.Ridge(), as both control the \"strength\" of regularization (increasing both will lead to stronger regularization), but mathematically not quite, here's how both are used:\nsklearn.Ridge()\n||y - Xw||^2_2 + alpha * ||w||^2_2\nlesson\u2019s notebook (`train_linear_regression_reg` function)\nXTX = XTX + r * np.eye(XTX.shape[0])\n`r` adds \u201cnoise\u201d to the main diagonal to prevent multicollinearity, which \u201cbreaks\u201d finding inverse matrix.", "section": "2. Machine Learning for Regression", "question": "Question 4: what is `r`, is it the same as `alpha` in sklearn.Ridge()?" }, { "text": "Q: \u201cIn lesson 2.8 why is y_pred different from y? After all, we trained X_train to get the weights that when multiplied by X_train should give exactly y, or?\u201d\nA: linear regression is a pretty simple model, it neither can nor should fit 100% (nor any other model, as this would be the sign of overfitting). This picture might illustrate some intuition behind this, imagine X is a single feature:\nAs our model is linear, how would you draw a line to fit all the \"dots\"?\nYou could \"fit\" all the \"dots\" on this pic using something like scipy.optimize.curve_fit (non-linear least squares) if you wanted to, but imagine how it would perform on previously unseen data.\nAdded by Andrii Larkin", "section": "2. Machine Learning for Regression", "question": "Why linear regression doesn\u2019t provide a \u201cperfect\u201d fit?" }, { "text": "One of the questions on the homework calls for using a random seed of 42. When using 42, all my missing values ended up in my training dataframe and not my validation or test dataframes, why is that?\nThe purpose of the seed value is to randomly generate the proportion split. Using a seed of 42 ensures that all learners are on the same page by getting the same behavior (in this case, all missing values ending up in the training dataframe). If using a different seed value (e.g. 9), missing values will then appear in all other dataframes.", "section": "2. Machine Learning for Regression", "question": "Random seed 42" }, { "text": "It is possible to do the shuffling of the dataset with the pandas built-in function pandas.DataFrame.sample.The complete dataset can be shuffled including resetting the index with the following commands:\nSetting frac=1 will result in returning a shuffled version of the complete Dataset.\nSetting random_state=seed will result in the same randomization as used in the course resources.\ndf_shuffled = df.sample(frac=1, random_state=seed)\ndf_shuffled.reset_index(drop=True, inplace=True)\nAdded by Sylvia Schmitt", "section": "2. Machine Learning for Regression", "question": "Shuffling the initial dataset using pandas built-in function" }, { "text": "That\u2019s normal. We all have different environments: our computers have different versions of OS and different versions of libraries \u2014 even different versions of Python.\nIf it\u2019s the case, just select the option that\u2019s closest to your answer", "section": "2. Machine Learning for Regression", "question": "The answer I get for one of the homework questions doesn't match any of the options. What should I do?" }, { "text": "In question 3 of HW02 it is mentioned: \u2018For computing the mean, use the training only\u2019. What does that mean?\nIt means that you should use only the training data set for computing the mean, not validation or test data set. This is how you can calculate the mean\ndf_train['column_name'].mean( )\nAnother option:\ndf_train[\u2018column_name\u2019].describe()\n(Bhaskar Sarma)", "section": "2. Machine Learning for Regression", "question": "Meaning of mean in homework 2, question 3" }, { "text": "When the target variable has a long tail distribution, like in prices, with a wide range, you can transform the target variable with np.log1p() method, but be aware if your target variable has negative values, this method will not work", "section": "2. Machine Learning for Regression", "question": "When should we transform the target variable to logarithm distribution?" }, { "text": "If we try to perform an arithmetic operation between 2 arrays of different shapes or different dimensions, it throws an error like operands could not be broadcast together with shapes. There are some scenarios when broadcasting can occur and when it fails.\nIf this happens sometimes we can use * operator instead of dot() method to solve the issue. So that the error is solved and also we get the dot product.\n(Santhosh Kumar)", "section": "2. Machine Learning for Regression", "question": "ValueError: shapes not aligned" }, { "text": "Copy of a dataframe is made with X_copy = X.copy().\nThis is called creating a deep copy. Otherwise it will keep changing the original dataframe if used like this: X_copy = X.\nAny changes to X_copy will reflect back to X. This is not a real copy, instead it is a \u201cview\u201d.\n(Memoona Tahira)", "section": "2. Machine Learning for Regression", "question": "How to copy a dataframe without changing the original dataframe?" }, { "text": "One of the most important characteristics of the normal distribution is that mean=median=mode, this means that the most popular value, the mean of the distribution and 50% of the sample are under the same value, this is equivalent to say that the area under the curve (black) is the same on the left and on the right. The long tail (red curve) is the result of having a few observations with high values, now the behaviour of the distribution changes, first of all, the area is different on each side and now the mean, median and mode are different. As a consequence, the mean is no longer representative, the range is larger than before and the probability of being on the left or on the right is not the same.\n(Tatiana D\u00e1vila)", "section": "2. Machine Learning for Regression", "question": "What does \u2018long tail\u2019 mean?" }, { "text": "In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range. [Wikipedia] The formula to calculate standard deviation is:\n(Aadarsha Shrestha)", "section": "2. Machine Learning for Regression", "question": "What is standard deviation?" }, { "text": "The application of regularization depends on the specific situation and problem. It is recommended to consider it when training machine learning models, especially with small datasets or complex models, to prevent overfitting. However, its necessity varies depending on the data quality and size. Evaluate each case individually to determine if it is needed.\n(Daniel Mu\u00f1oz Viveros)", "section": "2. Machine Learning for Regression", "question": "Do we need to apply regularization techniques always? Or only in certain scenarios?" }, { "text": "As it speeds up the development:\nprepare_df(initial_df, seed, fill_na_type) - that prepared all 3 dataframes and 3 y_vectors. Fillna() can be done before the initial_df is split.\nOf course, you can reuse other functions: rmse() and train_linear_regression(X,y,r) from the class notebook\n(Ivan Brigida)", "section": "2. Machine Learning for Regression", "question": "Shortcut: define functions for faster execution" }, { "text": "If we have a list or series of data for example x = [1,2,3,4,5]. We can use pandas to find the standard deviation. We can pass our list into panda series and call standard deviation directly on the series pandas.Series(x).std().\n(Quinn Avila)", "section": "2. Machine Learning for Regression", "question": "How to use pandas to find standard deviation" }, { "text": "Numpy and Pandas packages use different equations to compute the standard deviation. Numpy uses population standard deviation, whereas pandas uses sample standard deviation by default.\nNumpy\nPandas\npandas default standard deviation is computed using one degree of freedom. You can change degree in of freedom in NumPy to change this to unbiased estimator by using ddof parameter:\nimport numpy as np\nnp.std(df.weight, ddof=1)\nThe result will be similar if we change the dof = 1 in numpy\n(Harish Balasundaram)", "section": "2. Machine Learning for Regression", "question": "Standard Deviation Differences in Numpy and Pandas" }, { "text": "In pandas you can use built in Pandas function names std() to get standard deviation. For example\ndf['column_name'].std() to get standard deviation of that column.\ndf[['column_1', 'column_2']].std() to get standard deviation of multiple columns.\n(Khurram Majeed)", "section": "2. Machine Learning for Regression", "question": "Standard deviation using Pandas built in Function" }, { "text": "Use \u2018pandas.concat\u2019 function (https://pandas.pydata.org/docs/reference/api/pandas.concat.html) to combine two dataframes. To combine two numpy arrays use numpy.concatenate (https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html) function. So the code would be as follows:\ndf_train_combined = pd.concat([df_train, df_val])\ny_train = np.concatenate((y_train, y_val), axis=0)\n(George Chizhmak)", "section": "2. Machine Learning for Regression", "question": "How to combine train and validation datasets" }, { "text": "The Root Mean Squared Error (RMSE) is one of the primary metrics to evaluate the performance of a regression model. It calculates the average deviation between the model's predicted values and the actual observed values, offering insight into the model's ability to accurately forecast the target variable. To calculate RMSE score:\nLibraries needed\nimport numpy as np\nfrom sklearn.metrics import mean_squared_error\nmse = mean_squared_error(actual_values, predicted_values)\nrmse = np.sqrt(mse)\nprint(\"Root Mean Squared Error (RMSE):\", rmse)\n(Aminat Abolade)", "section": "2. Machine Learning for Regression", "question": "Understanding RMSE and how to calculate RMSE score" }, { "text": "If you would like to use multiple conditions as an example below you will get the error. The correct syntax for OR is |, and for AND is &\n(Olga Rudakova)\n\u2013", "section": "2. Machine Learning for Regression", "question": "What syntax use in Pandas for multiple conditions using logical AND and OR" }, { "text": "I found this video pretty usual for understanding how we got the normal form with linear regression Normal Equation Derivation for Regression", "section": "2. Machine Learning for Regression", "question": "Deep dive into normal equation for regression" }, { "text": "(Hrithik Kumar Advani)", "section": "2. Machine Learning for Regression", "question": "Useful Resource for Missing Data Treatment\nhttps://www.kaggle.com/code/parulpandey/a-guide-to-handling-missing-values-in-python/notebook" }, { "text": "The instruction for applying log transformation to the \u2018median_house_value\u2019 variable is provided before Q3 in the homework for Week-2 under the \u2018Prepare and split the dataset\u2019 heading.\nHowever, this instruction is absent in the subsequent questions of the homework, and I got stuck with Q5 for a long time, trying to figure out why my RMSE was so huge, when it clicked to me that I forgot to apply log transformation to the target variable. Please remember to apply log transformation to the target variable for each question.\n(Added by Soham Mundhada)", "section": "2. Machine Learning for Regression", "question": "Caution for applying log transformation in Week-2 2023 cohort homework" }, { "text": "Version 0.24.2 and Python 3.8.11\n(Added by Diego Giraldo)", "section": "3. Machine Learning for Classification", "question": "What sklearn version is Alexey using in the youtube videos?" }, { "text": "Week 3 HW: https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/03-classification/homework.md\nSubmit HW Week 3: https://docs.google.com/forms/d/e/1FAIpQLSeXS3pqsv_smRkYmVx-7g6KIZDnG29g2s7pdHo-ASKNqtfRFQ/viewform\nAll HWs: https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/\nEvaluation Matrix: https://docs.google.com/spreadsheets/d/e/2PACX-1vQCwqAtkjl07MTW-SxWUK9GUvMQ3Pv_fF8UadcuIYLgHa0PlNu9BRWtfLgivI8xSCncQs82HDwGXSm3/pubhtml\nGitHub for theory: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp\nYoutube Link: 3.X --- https://www.youtube.com/watch?v=0Zw04wdeTQo&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=29\n~~Nukta Bhatia~~", "section": "3. Machine Learning for Classification", "question": "How do I get started with Week 3?" }, { "text": "The error message \u201ccould not convert string to float: \u2018Nissan\u2019\u201d typically occurs when a machine learning model or function is expecting numerical input, but receives a string instead. In this case, it seems like the model is trying to convert the car brand \u2018Nissan\u2019 into a numerical value, which isn\u2019t possible.\nTo resolve this issue, you can encode categorical variables like car brands into numerical values. One common method is one-hot encoding, which creates new binary columns for each category/label present in the original column.\nHere\u2019s an example of how you can perform one-hot encoding using pandas:\nimport pandas as pd\n# Assuming 'data' is your DataFrame and 'brand' is the column with car brands\ndata_encoded = pd.get_dummies(data, columns=['brand'])\nIn this code, pd.get_dummies() creates a new DataFrame where the \u2018brand\u2019 column is replaced with binary columns for each brand (e.g., \u2018brand_Nissan\u2019, \u2018brand_Toyota\u2019, etc.). Each row in the DataFrame has a 1 in the column that corresponds to its brand and 0 in all other brand columns.\n-Mohammad Emad Sharifi-", "section": "3. Machine Learning for Classification", "question": "Could not convert string to float:\u2019Nissan\u2019rt string to float: 'Nissan'" }, { "text": "Solution: Mutual Information score calculates the relationship between categorical variables or discrete variables. So in the homework, because the target which is median_house_value is continuous, we had to change it to binary format which in other words, makes its values discrete as either 0 or 1. If we allowed it to remain in the continuous variable format, the mutual information score could be calculated, but the algorithm would have to divide the continuous variables into bins and that would be highly subjective. That is why continuous variables are not used for mutual information score calculation.\n\u2014Odimegwu David\u2014-", "section": "3. Machine Learning for Classification", "question": "Why did we change the targets to binary format when calculating mutual information score in the homework?" }, { "text": "Q2 asks about correlation matrix and converting median_house_value from numeric to binary. Just to make sure here we are only dealing with df_train not df_train_full, right? As the question explicitly mentions the train dataset.\nYes. I think it is only on df_train. The reason behind this is that df_train_full also contains the validation dataset, so at this stage we don't want to make conclusions based on the validation data, since we want to test how we did without using that portion of the data.\nPastor Soto", "section": "3. Machine Learning for Classification", "question": "What data should we use for correlation matrix" }, { "text": "The background of any dataframe can be colored (not only the correlation matrix) based on the numerical values the dataframe contains by using the method pandas.io.formats.style.Styler.background_graident.\nHere an example on how to color the correlation matrix. A color map of choice can get passed, here \u2018viridis\u2019 is used.\n# ensure to have only numerical values in the dataframe before calling 'corr'\ncorr_mat = df_numerical_only.corr()\ncorr_mat.style.background_gradient(cmap='viridis')\nHere is an example of how the coloring will look like using a dataframe containing random values and applying \u201cbackground_gradient\u201d to it.\nnp.random.seed = 3\ndf_random = pd.DataFrame(data=np.random.random(3*3).reshape(3,3))\ndf_random.style.background_gradient(cmap='viridis')\nAdded by Sylvia Schmitt", "section": "3. Machine Learning for Classification", "question": "Coloring the background of the pandas.DataFrame.corr correlation matrix directly" }, { "text": "data_corr = pd.DataFrame(data_num.corr().round(3).abs().unstack().sort_values(ascending=False))\ndata_corr.head(10)\nAdded by Harish Balasundaram\nYou can also use seaborn to create a heatmap with the correlation. The code for doing that:\nsns.heatmap(df[numerical_features].corr(),\nannot=True,\nsquare=True,\nfmt=\".2g\",\ncmap=\"crest\")\nAdded by Cecile Guillot\nYou can refine your heatmap and plot only a triangle, with a blue to red color gradient, that will show every correlation between your numerical variables without redundant information with this function:\nWhich outputs, in the case of churn dataset:\n(M\u00e9lanie Fouesnard)", "section": "3. Machine Learning for Classification", "question": "Identifying highly correlated feature pairs easily through unstack" }, { "text": "Should we perform EDA on the base of train or train+validation or train+validation+test dataset?\nIt's indeed good practice to only rely on the train dataset for EDA. Including validation might be okay. But we aren't supposed to touch the test dataset, even just looking at it isn't a good idea. We indeed pretend that this is the future unseen data\nAlena Kniazeva", "section": "3. Machine Learning for Classification", "question": "What data should be used for EDA?" }, { "text": "Validation dataset helps to validate models and prediction on unseen data. This helps get an estimate on its performance on fresh data. It helps optimize the model.\nEdidiong Esu\nBelow is an extract of Alexey's book explaining this point. Hope is useful\nWhen we apply the fit method, this method is looking at the content of the df_train dictionaries we are passing to the DictVectorizer instance, and fit is figuring out (training) how to map the values of these dictionaries. If categorical, applies one-hot encoding, if numerical it will leave it as it is.\nWith this context, if we apply the fit to the validation model, we are \"giving the answers\" and we are not letting the \"fit\" do its job for data that we haven't seen. By not applying the fit to the validation model we can know how well it was trained.\nBelow is an extract of Alexey's book explaining this point.\nHumberto Rodriguez\nThere is no need to initialize another instance of dictvectorizer after fitting it on the train set as it will overwrite what it learnt from being fit on the train data.\nThe correct way is to fit_transform the train set, and only transform the validation and test sets.\nMemoona Tahira", "section": "3. Machine Learning for Classification", "question": "Fitting DictVectorizer on validation" }, { "text": "For Q5 in homework, should we calculate the smallest difference in accuracy in real values (i.e. -0.001 is less than -0.0002) or in absolute values (i.e. 0.0002 is less than 0.001)?\nWe should select the \u201csmallest\u201d difference, and not the \u201clowest\u201d, meaning we should reason in absolute values.\nIf the difference is negative, it means that the model actually became better when we removed the feature.", "section": "3. Machine Learning for Classification", "question": "Feature elimination" }, { "text": "Instead use the method \u201c.get_feature_names_out()\u201d from DictVectorizer function and the warning will be resolved , but we need not worry about the waning as there won't be any warning\nSanthosh Kumar", "section": "3. Machine Learning for Classification", "question": "FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2" }, { "text": "Fitting the logistic regression takes a long time / kernel crashes when calling predict() with the fitted model.\nMake sure that the target variable for the logistic regression is binary.\nKonrad Muehlberg", "section": "3. Machine Learning for Classification", "question": "Logistic regression crashing Jupyter kernel" }, { "text": "Ridge regression is a linear regression technique used to mitigate the problem of multicollinearity (when independent variables are highly correlated) and prevent overfitting in predictive modeling. It adds a regularization term to the linear regression cost function, penalizing large coefficients.\nsag Solver: The sag solver stands for \"Stochastic Average Gradient.\" It's particularly suitable for large datasets, as it optimizes the regularization term using stochastic gradient descent (SGD). sag can be faster than some other solvers for large datasets.\nAlpha: The alpha parameter controls the strength of the regularization in Ridge regression. A higher alpha value leads to stronger regularization, which means the model will have smaller coefficient values, reducing the risk of overfitting.\nfrom sklearn.linear_model import Ridge\nridge = Ridge(alpha=alpha, solver='sag', random_state=42)\nridge.fit(X_train, y_train)\nAminat Abolade", "section": "3. Machine Learning for Classification", "question": "Understanding Ridge" }, { "text": "DictVectorizer(sparse=True) produces CSR format, which is both more memory efficient and converges better during fit(). Basically it stores non-zero values and indices instead of adding a column for each class of each feature (models of cars produced 900+ columns alone in the current task).\nUsing \u201csparse\u201d format like on the picture above, both via pandas.get_dummies() and DictVectorizer(sparse=False) - is slower (around 6-8min for Q6 task - Linear/Ridge Regression) for high amount of classes (like models of cars for eg) and gives a bit \u201cworse\u201d results in both Logistic and Linear/Ridge Regression, while also producing convergence warnings for Linear/Ridge Regression.\nLarkin Andrii", "section": "3. Machine Learning for Classification", "question": "pandas.get_dummies() and DictVectorizer(sparse=False) produce the same type of one-hot encodings:" }, { "text": "Ridge with sag solver requires feature to be of the same scale. You may get the following warning: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\nPlay with different scalers. See notebook-scaling-ohe.ipynb\nDmytro Durach\n(Oscar Garcia) Use a StandardScaler for the numeric fields and OneHotEncoder (sparce = False) for the categorical features. This help with the warning. Separate the features (num/cat) without using the encoder first and see if that helps.", "section": "3. Machine Learning for Classification", "question": "Convergence Problems in W3Q6" }, { "text": "When encountering convergence errors during the training of a Ridge regression model, consider the following steps:\nFeature Normalization: Normalize your numerical features using techniques like MinMaxScaler or StandardScaler. This ensures that numerical features are on a \tsimilar scale, preventing convergence issues.\nCategorical Feature Encoding: If your dataset includes categorical features, apply \tcategorical encoding techniques such as OneHotEncoder (OHE) to \tconvert them into a numerical format. OHE is commonly used to represent categorical variables as binary vectors, making them compatible with regression models like Ridge.\nCombine Features: After \tnormalizing numerical features and encoding categorical features using OneHotEncoder, combine them to form a single feature matrix (X_train). This combined dataset serves as the input for training the Ridge regression model.\nBy following these steps, you can address convergence errors and enhance the stability of your Ridge model training process. It's important to note that the choice of encoding method, such as OneHotEncoder, is appropriate for handling categorical features in this context.\nYou can find an example here.\n \t\t\t\t\t\t\t\t\t\t\t\tOsman Ali", "section": "3. Machine Learning for Classification", "question": "Dealing with Convergence in Week 3 q6" }, { "text": "A sparse matrix is more memory-efficient because it only stores the non-zero values and their positions in memory. This is particularly useful when working with large datasets with many zero or missing values.\nThe default DictVectorizer configuration is a sparse matrix. For week3 Q6 using the default sparse is an interesting option because of the size of the matrix. Training the model was also more performant and didn\u2019t give an error message like dense mode.\n \t\t\t\t\t\t\t\t\t\t\t\tQuinn Avila", "section": "3. Machine Learning for Classification", "question": "Sparse matrix compared dense matrix" }, { "text": "The warnings on the jupyter notebooks can be disabled/ avoided with the following comments:\nImport warnings\nwarnings.filterwarnings(\u201cignore\u201d)\nKrishna Anand", "section": "3. Machine Learning for Classification", "question": "How to Disable/avoid Warnings in Jupyter Notebooks" }, { "text": "Question: Regarding RMSE, how do we decide on the correct score to choose? In the study group discussion about week two homework, all of us got it wrong and one person had the lowest score selected as well.\nAnswer: You need to find RMSE for each alpha. If RMSE scores are equal, you will select the lowest alpha.\nAsia Saeed", "section": "3. Machine Learning for Classification", "question": "How to select the alpha parameter in Q6" }, { "text": "Question: Could you please help me with HW3 Q3: \"Calculate the mutual information score with the (binarized) price for the categorical variable that we have. Use the training set only.\" What is the second variable that we need to use to calculate the mutual information score?\nAnswer: You need to calculate the mutual info score between the binarized price (above_average) variable & ocean_proximity, the only original categorical variable in the dataset.\nAsia Saeed", "section": "3. Machine Learning for Classification", "question": "Second variable that we need to use to calculate the mutual information score" }, { "text": "Do we need to train the model only with the features: total_rooms, total_bedrooms, population and households? or with all the available features and then pop once at a time each of the previous features and train the model to make the accuracy comparison?\nYou need to create a list of all features in this question and evaluate the model one time to obtain the accuracy, this will be the original accuracy, and then remove one feature each time, and in each time, train the model, find the accuracy and the difference between the original accuracy and the found accuracy. Finally, find out which feature has the smallest absolute accuracy difference.\nWhile calculating differences between accuracy scores while training on the whole model, versus dropping one feature at a time and comparing its accuracy to the model to judge impact of the feature on the accuracy of the model, do we take the smallest difference or smallest absolute difference?\nSince order of subtraction between the two accuracy scores can result in a negative number, we will take its absolute value as we are interested in the smallest value difference, not the lowest difference value. Case in point, if difference is -4 and -2, the smallest difference is abs(-2), and not abs(-4)", "section": "3. Machine Learning for Classification", "question": "Features for homework Q5" }, { "text": "Both work in similar ways. That is, to convert categorical features to numerical variables for use in training the model. But the difference lies in the input. OneHotEncoder uses an array as input while DictVectorizer uses a dictionary.\nBoth will produce the same result. But when we use OneHotEncoder, features are sorted alphabetically. When you use DictVectorizer you stack features that you want.\nTanya Mard", "section": "3. Machine Learning for Classification", "question": "What is the difference between OneHotEncoder and DictVectorizer?" }, { "text": "They are basically the same. There are some key differences with regards to their input/output types, handling of missing values, etc, but they are both techniques to one-hot-encode categorical variables with identical results. The biggest difference is get_dummies are a convenient choice when you are working with Pandas Dataframes, while if you are building a scikit-learn-based machine learning pipeline and need to handle categorical data as part of that pipeline, OneHotEncoder is a more suitable choice. [Abhirup Ghosh]", "section": "3. Machine Learning for Classification", "question": "What is the difference between pandas get_dummies and sklearn OnehotEncoder?" }, { "text": "For the test_train_split question on week 3's homework, are we supposed to use 42 as the random_state in both splits or only the 1st one?\nAnswer: for both splits random_state = 42 should be used\n(Bhaskar Sarma)", "section": "3. Machine Learning for Classification", "question": "Use of random seed in HW3" }, { "text": "Should correlation be calculated after splitting or before splitting. And lastly I know how to find the correlation but how do i find the two most correlated features.\nAnswer: Correlation matrix of your train dataset. Thus, after splitting. Two most correlated features are the ones having the highest correlation coefficient in terms of absolute values.", "section": "3. Machine Learning for Classification", "question": "Correlation before or after splitting the data" }, { "text": "Make sure that the features used in ridge regression model are only NUMERICAL ones not categorical.\nDrop all categorical features first before proceeding.\n(Aileah Gotladera)\nWhile it is True that ridge regression accepts only numerical values, the categorical ones can be useful for your model. You have to transform them using one-hot encoding before training the model. To avoid the error of non convergence, put sparse=True when doing so.\n(Erjon)", "section": "3. Machine Learning for Classification", "question": "Features in Ridge Regression Model" }, { "text": "You need to use all features. and price for target. Don't include the average variable we created before.\nIf you use DictVectorizer then make sure to use sparce=True to avoid convergence errors\nI also used StandardScalar for numerical variable you can try running with or without this\n(Peter Pan)", "section": "3. Machine Learning for Classification", "question": "Handling Column Information for Homework 3 Question 6" }, { "text": "Use sklearn.preprocessing encoders and scalers, e.g. OneHotEncoder, OrdinalEncoder, and StandardScaler.", "section": "3. Machine Learning for Classification", "question": "Transforming Non-Numerical Columns into Numerical Columns" }, { "text": "These both methods receive the dictionary as an input. While the DictVectorizer will store the big vocabulary and takes more memory. FeatureHasher create a vectors with predefined length. They are both used for categorical features.\nWhen you have a high cardinality for categorical features better to use FeatureHasher. If you want to preserve feature names in transformed data and have a small number of unique values is DictVectorizer. But your choice will dependence on your data.\nYou can read more by follow the link https://scikit-learn.org/stable/auto_examples/text/plot_hashing_vs_dict_vectorizer.html\nOlga Rudakova", "section": "3. Machine Learning for Classification", "question": "What is the better option FeatureHasher or DictVectorizer" }, { "text": "(Question by Connie S.)\nThe reason it's good/recommended practice to do it after splitting is to avoid data leakage - you don't want any data from the test set influencing the training stage (similarly from the validation stage in the initial training). See e.g. scikit-learn documentation on \"Common pitfalls and recommended practices\": https://scikit-learn.org/stable/common_pitfalls.html\nAnswered/added by Rileen Sinha", "section": "3. Machine Learning for Classification", "question": "Isn't it easier to use DictVertorizer or get dummies before splitting the data into train/val/test? Is there a reason we wouldn't do this? Or is it the same either way?" }, { "text": "If you are getting 1.0 as accuracy then there is a possibility you have overfitted the model. Dropping the column msrp/price can help you solve this issue.\nAdded by Akshar Goyal", "section": "3. Machine Learning for Classification", "question": "HW3Q4 I am getting 1.0 as accuracy. Should I use the closest option?" }, { "text": "We can use sklearn & numpy packages to calculate Root Mean Squared Error\nfrom sklearn.metrics import mean_squared_error\nimport numpy as np\nRmse = np.sqrt(mean_squared_error(y_pred, y_val/ytest)\nAdded by Radikal Lukafiardi\nYou can also refer to Alexey\u2019s notebook for Week 2:\nhttps://github.com/alexeygrigorev/mlbookcamp-code/blob/master/chapter-02-car-price/02-carprice.ipynb\nwhich includes the following code:\ndef rmse(y, y_pred):\nerror = y_pred - y\nmse = (error ** 2).mean()\nreturn np.sqrt(mse)\n(added by Rileen Sinha)", "section": "3. Machine Learning for Classification", "question": "How to calculate Root Mean Squared Error?" }, { "text": "The solution is to use \u201cget_feature_names_out\u201d instead. See details: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html\nGeorge Chizhmak", "section": "3. Machine Learning for Classification", "question": "AttributeError: 'DictVectorizer' object has no attribute 'get_feature_names'" }, { "text": "To use RMSE without math or numpy, \u2018sklearn.metrics\u2019 has a mean_squared_error function with a squared kwarg (defaults to True). Setting squared to False will return the RMSE.\nfrom sklearn.metrics import mean_squared_error\nrms = mean_squared_error(y_actual, y_predicted, squared=False)\nSee details: https://stackoverflow.com/questions/17197492/is-there-a-library-function-for-root-mean-square-error-rmse-in-python\nAhmed Okka", "section": "3. Machine Learning for Classification", "question": "Root Mean Squared Error" }, { "text": "This article explains different encoding techniques used https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02\nHrithik Kumar Advani", "section": "3. Machine Learning for Classification", "question": "Encoding Techniques" }, { "text": "I got this error multiple times here is the code:\n\u201caccuracy_score(y_val, y_pred >= 0.5)\u201d\nTypeError: 'numpy.float64' object is not callable\nI solve it using\nfrom sklearn import metrics\nmetrics.accuracy_score(y_train, y_pred>= 0.5)\nOMAR Wael", "section": "4. Evaluation Metrics for Classification", "question": "Error in use of accuracy_score from sklearn in jupyter (sometimes)" }, { "text": "Week 4 HW: https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/04-evaluation/homework.md\nAll HWs: https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/\nEvaluation Matrix: https://docs.google.com/spreadsheets/d/e/2PACX-1vQCwqAtkjl07MTW-SxWUK9GUvMQ3Pv_fF8UadcuIYLgHa0PlNu9BRWtfLgivI8xSCncQs82HDwGXSm3/pubhtml\nGitHub for theory: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp\nYouTube Link: 4.X --- https://www.youtube.com/watch?v=gmg5jw1bM8A&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=40\nSci-Kit Learn on Evaluation:\nhttps://scikit-learn.org/stable/model_selection.html\n~~Nukta Bhatia~~", "section": "4. Evaluation Metrics for Classification", "question": "How do I get started with Week 4?" }, { "text": "https://datatalks-club.slack.com/archives/C0288NJ5XSA/p1696475675887119\nMetrics can be used on a series or a dataframe\n~~Ella Sahnan~~", "section": "4. Evaluation Metrics for Classification", "question": "Using a variable to score" }, { "text": "Ie particularly in module-04 homework Qn2 vs Qn5. https://datatalks-club.slack.com/archives/C0288NJ5XSA/p1696760905214979\nRefer to the sklearn docs, random_state is to ensure the \u201crandomness\u201d that is used to shuffle dataset is reproducible, and it usually requires both random_state and shuffle params to be set accordingly.\n~~Ella Sahnan~~", "section": "4. Evaluation Metrics for Classification", "question": "Why do we sometimes use random_state and not at other times?" }, { "text": "How to get classification metrics - precision, recall, f1 score, accuracy simultaneously\nUse classification_report from sklearn. For more info check here.\nAbhishek N", "section": "4. Evaluation Metrics for Classification", "question": "How to get all classification metrics?" }, { "text": "I am getting multiple thresholds with the same F1 score, does this indicate I am doing something wrong or is there a method for choosing? I would assume just pick the lowest?\nChoose the one closest to any of the options\nAdded by Azeez Enitan Edunwale\nYou can always use scikit-learn (or other standard libraries/packages) to verify results obtained using your own code, e.g. you can use \u201cclassification_report\u201d (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) to obtain precision, recall and F1-score.\nAdded by Rileen Sinha", "section": "4. Evaluation Metrics for Classification", "question": "Multiple thresholds for Q4" }, { "text": "Solution description: duplicating the\ndf.churn = (df.churn == 'yes').astype(int)\nThis is causing you to have only 0's in your churn column. In fact, match with the error you are getting: ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0.\nIt is telling us that it only contains 0's.\nDelete one of the below cells and you will get the accuracy\nHumberto Rodriguez", "section": "4. Evaluation Metrics for Classification", "question": "ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0" }, { "text": "Use Yellowbrick. Yellowbrick in a library that combines scikit-learn with matplotlib to produce visualizations for your models. It produces colorful classification reports.\nKrishna Annad", "section": "4. Evaluation Metrics for Classification", "question": "Method to get beautiful classification report" }, { "text": "That\u2019s fine, use the closest option", "section": "4. Evaluation Metrics for Classification", "question": "I\u2019m not getting the exact result in homework" }, { "text": "Check the solutions from the 2021 iteration of the course. You should use roc_auc_score.", "section": "4. Evaluation Metrics for Classification", "question": "Use AUC to evaluate feature importance of numerical variables" }, { "text": "When calculating the ROC AUC score using sklearn.metrics.roc_auc_score the function expects two parameters \u201cy_true\u201d and \u201cy_score\u201d. So for each numerical value in the dataframe it will be passed as the \u201cy_score\u201d to the function and the target variable will get passed a \u201cy_true\u201d each time.\nSylvia Schmitt", "section": "4. Evaluation Metrics for Classification", "question": "Help with understanding: \u201cFor each numerical value, use it as score and compute AUC\u201d" }, { "text": "You must use the `dt_val` dataset to compute the metrics asked in Question 3 and onwards, as you did in Question 2.\nDiego Giraldo", "section": "4. Evaluation Metrics for Classification", "question": "What dataset should I use to compute the metrics in Question 3" }, { "text": "What does this line do?\nKFold(n_splits=n_splits, shuffle=True, random_state=1)\nIf I do it inside the loop [0.01, 0.1, 1, 10] or outside the loop in Q6, HW04 it doesn't make any difference to my answers. I am wondering why and what is the right way, although it doesn't make a difference!\nDid you try using a different random_state? From my understanding, KFold just makes N (which is equal to n_splits) separate pairs of datasets (train+val).\nhttps://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html\nIn my case changing random state changed results\n(Arthur Minakhmetov)\nChanging the random state makes a difference in my case too, but not whether it is inside or outside the for loop. I think I have got the answer. kFold = KFold(n_splits=n_splits, shuffle = True, random_state = 1) is just a generator object and it contains only the information n_splits, shuffle and random_state. The k-fold splitting actually happens in the next for loop for train_idx, val_idx in kFold.split(df_full_train): . So it doesn't matter where we generate the object, before or after the first loop. It will generate the same information. But from the programming point of view, it is better to do it before the loop. No point doing it again and again inside the loop\n(Bhaskar Sarma)\nIn case of KFold(n_splits=n_splits, shuffle=True, random_state=1) and C= [0.01, 0.1, 1, 10], it is better to loop through the different values of Cs as the video explained. I had separate train() and predict() functions, which were reused after dividing the dataset via KFold. The model ran about 10 minutes and provided a good score.\n(Ani Mkrtumyan)", "section": "4. Evaluation Metrics for Classification", "question": "What does KFold do?" }, { "text": "I\u2019m getting \u201cValueError: multi_class must be in ('ovo', 'ovr')\u201d when using roc_auc_score to evaluate feature importance of numerical variables in question 1.\nI was getting this error because I was passing the parameters to roc_auc_score incorrectly (df_train[col] , y_train) . The correct way is to pass the parameters in this way: roc_auc_score(y_train, df_train[col])\nAsia Saeed", "section": "4. Evaluation Metrics for Classification", "question": "ValueError: multi_class must be in ('ovo', 'ovr')" }, { "text": "from tqdm.auto import tqdm\nTqdm - terminal progress bar\nKrishna Anand", "section": "4. Evaluation Metrics for Classification", "question": "Monitoring Wait times and progress of the code execution can be done with:" }, { "text": "Inverting or negating variables with ROC AUC scores less than the threshold is a valuable technique to improve feature importance and model performance when dealing with negatively correlated features. It helps ensure that the direction of the correlation aligns with the expectations of most machine learning algorithms.\nAileah Gotladera", "section": "4. Evaluation Metrics for Classification", "question": "What is the use of inverting or negating the variables less than the threshold?" }, { "text": "In case of using predict(X) for this task we are getting the binary classification predictions which are 0 and 1. This may lead to incorrect evaluation values.\nThe solution is to use predict_proba(X)[:,1], where we get the probability that the value belongs to one of the classes.\nVladimir Yesipov\nPredict_proba shows probailites per class.\nAni Mkrtumyan", "section": "4. Evaluation Metrics for Classification", "question": "Difference between predict(X) and predict_proba(X)[:, 1]" }, { "text": "For churn/not churn predictions, I need help to interpret the following scenario please, what is happening when:\nThe threshold is 1.0\nFPR is 0.0\nAnd TPR is 0.0\nWhen the threshold is 1.0, the condition for belonging to the positive class (churn class) is g(x)>=1.0 But g(x) is a sigmoid function for a binary classification problem. It has values between 0 and 1. This function never becomes equal to outermost values, i.e. 0 and 1.\nThat is why there is no object, for which churn-condition could be satisfied. And that is why there is no any positive (churn) predicted value (neither true positive, nor false positive), if threshold is equal to 1.0\nAlena Kniazeva", "section": "4. Evaluation Metrics for Classification", "question": "Why are FPR and TPR equal to 0.0, when threshold = 1.0?" }, { "text": "Matplotlib has a cool method to annotate where you could provide an X,Y point and annotate with an arrow and text. For example this will show an arrow pointing to the x,y point optimal threshold.\nplt.annotate(f'Optimal Threshold: {optimal_threshold:.2f}\\nOptimal F1 Score: {optimal_f1_score:.2f}',\nxy=(optimal_threshold, optimal_f1_score),\nxytext=(0.3, 0.5),\ntextcoords='axes fraction',\narrowprops=dict(facecolor='black', shrink=0.05))\nQuinn Avila", "section": "4. Evaluation Metrics for Classification", "question": "How can I annotate a graph?" }, { "text": "It's a complex and abstract topic and it requires some time to understand. You can move on without fully understanding the concept.\nNonetheless, it might be useful for you to rewatch the video, or even watch videos/lectures/notes by other people on this topic, as the ROC AUC is one of the most important metrics used in Binary Classification models.", "section": "4. Evaluation Metrics for Classification", "question": "I didn\u2019t fully understand the ROC curve. Can I move on?" }, { "text": "One main reason behind that, is the way of splitting data. For example, we want to split data into train/validation/test with the ratios 60%/20%/20% respectively.\nAlthough the following two options end up with the same ratio, the data itself is a bit different and not 100% matching in each case.\n1)\ndf_train, df_temp = train_test_split(df, test_size=0.4, random_state=42)\ndf_val, df_test = train_test_split(df_temp, test_size=0.5, random_state=42)\n2)\ndf_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)\ndf_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)\nTherefore, I would recommend using the second method which is more consistent with the lessons and thus the homeworks.\nIbraheem Taha", "section": "4. Evaluation Metrics for Classification", "question": "Why do I have different values of accuracy than the options in the homework?" }, { "text": "You can find the intercept between these two curves using numpy diff (https://numpy.org/doc/stable/reference/generated/numpy.diff.html ) and sign (https://numpy.org/doc/stable/reference/generated/numpy.sign.html):\nI suppose here that you have your df_scores ready with your three columns \u2018threshold\u2019, \u2018precision\u2019 and \u2018recall\u2019:\nYou want to know at which index (or indices) you have your intercept between precision and recall (namely: where the sign of the difference between precision and recall changes):\nidx = np.argwhere(\nnp.diff(\nnp.sign(np.array(df_scores[\"precision\"]) - np.array(df_scores[\"recall\"]))\n)\n).flatten()\nYou can print the result to easily read it:\nprint(\nf\"The precision and recall curves intersect at a threshold equal to {df_scores.loc[idx]['threshold']}.\"\n)\n(M\u00e9lanie Fouesnard)", "section": "4. Evaluation Metrics for Classification", "question": "How to find the intercept between precision and recall curves by using numpy?" }, { "text": "In the demonstration video, we are shown how to calculate the precision and recall manually. You can use the Scikit Learn library to calculate the confusion matrix. precision, recall, f1_score without having to first define true positive, true negative, false positive, and false negative.\nfrom sklearn.metrics import precision_score, recall_score, f1_score\nprecision_score(y_true, y_pred, average='binary')\nrecall_score(y_true, y_pred, average='binary')\nf1_score(y_true, y_pred, average='binary')\nRadikal Lukafiardi", "section": "4. Evaluation Metrics for Classification", "question": "Compute Recall, Precision, and F1 Score using scikit-learn library" }, { "text": "Cross-validation evaluates the performance of a model and chooses the best hyperparameters. Cross-validation does this by splitting the dataset into multiple parts (folds), typically 5 or 10. It then trains and evaluates your model multiple times, each time using a different fold as the validation set and the remaining folds as the training set.\n\"C\" is a hyperparameter that is typically associated with regularization in models like Support Vector Machines (SVM) and logistic regression.\nSmaller \"C\" values: They introduce more regularization, which means the model will try to find a simpler decision boundary, potentially underfitting the data. This is because it penalizes the misclassification of training examples more severely.\nLarger \"C\" values: They reduce the regularization effect, allowing the model to fit the training data more closely, potentially overfitting. This is because it penalizes misclassification less severely, allowing the model to prioritize getting training examples correct.\nAminat Abolade", "section": "4. Evaluation Metrics for Classification", "question": "Why do we use cross validation?" }, { "text": "Model evaluation metrics can be easily computed using off the shelf calculations available in scikit learn library. This saves a lot of time and more precise compared to our own calculations from the scratch using numpy and pandas libraries.\nfrom sklearn.metrics import (accuracy_score,\nprecision_score,\nrecall_score,\nf1_score,\nroc_auc_score\n)\naccuracy = accuracy_score(y_val, y_pred)\nprecision = precision_score(y_val, y_pred)\nrecall = recall_score(y_val, y_pred)\nf1 = f1_score(y_val, y_pred)\nroc_auc = roc_auc_score(y_val, y_pred)\nprint(f'Accuracy: {accuracy}')\nprint(f'Precision: {precision}')\nprint(f'Recall: {recall}')\nprint(f'F1-Score: {f1}')\nprint(f'ROC AUC: {roc_auc}')\n(Harish Balasundaram)", "section": "4. Evaluation Metrics for Classification", "question": "Evaluate the Model using scikit learn metrics" }, { "text": "Scikit-learn offers another way: precision_recall_fscore_support\nExample:\nfrom sklearn.metrics import precision_recall_fscore_support\nprecision, recall, fscore, support = precision_recall_fscore_support(y_val, y_val_pred, zero_division=0)\n(Gopakumar Gopinathan)", "section": "4. Evaluation Metrics for Classification", "question": "Are there other ways to compute Precision, Recall and F1 score?" }, { "text": "- ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets.\n- The reason for this recommendation is that ROC curves present an optimistic picture of the model on datasets with a class imbalance.\n-This is because of the use of true negatives in the False Positive Rate in the ROC Curve and the careful avoidance of this rate in the Precision-Recall curve.\n- If the proportion of positive to negative instances changes in a test set, the ROC curves will not change. Metrics such as accuracy, precision, lift and F scores use values from both columns of the confusion matrix. As a class distribution changes these measures will change as well, even if the fundamental classifier performance does not. ROC graphs are based upon TP rate and FP rate, in which each dimension is a strict columnar ratio, so cannot give an accurate picture of performance when there is class imbalance.\n(Anudeep Vanjavakam)", "section": "4. Evaluation Metrics for Classification", "question": "When do I use ROC vs Precision-Recall curves?" }, { "text": "You can use roc_auc_score function from sklearn.metrics module and pass the vector of the target variable (\u2018above_average\u2019) as the first argument and the vector of feature values as the second one. This function will return AUC score for the feature that was passed as a second argument.\n(Denys Soloviov)", "section": "4. Evaluation Metrics for Classification", "question": "How to evaluate feature importance for numerical variables with AUC?" }, { "text": "Precision-recall curve, and thus the score, explicitly depends on the ratio of positive to negative test cases. This means that comparison of the F-score across different problems with differing class ratios is problematic. One way to address this issue is to use a standard class ratio when making such comparisons.\n(George Chizhmak)", "section": "4. Evaluation Metrics for Classification", "question": "Dependence of the F-score on class imbalance" }, { "text": "We can import precision_recall_curve from scikit-learn and plot the graph as follows:\nfrom sklearn.metrics import precision_recall_curve\nprecision, recall, thresholds = precision_recall_curve(y_val, y_predict)\nplt.plot(thresholds, precision[:-1], label='Precision')\nplt.plot(thresholds, recall[:-1], label='Recall')\nplt.legend()\nHrithik Kumar Advani", "section": "4. Evaluation Metrics for Classification", "question": "Quick way to plot Precision-Recall Curve" }, { "text": "For multiclass classification it is important to keep class balance when you split the data set. In this case Stratified k-fold returns folds that contains approximately the sme percentage of samples of each classes.\nPlease check the realisation in sk-learn library:\nhttps://scikit-learn.org/stable/modules/cross_validation.html#stratified-k-fold\nOlga Rudakova", "section": "5. Deploying Machine Learning Models", "question": "What is Stratified k-fold?" }, { "text": "Week 5 HW: https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/05-deployment/homework.md\nAll HWs: https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/\nHW 3 Solution: https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/03-classification/homework_3.ipynb\nEvaluation Matrix: https://docs.google.com/spreadsheets/d/e/2PACX-1vQCwqAtkjl07MTW-SxWUK9GUvMQ3Pv_fF8UadcuIYLgHa0PlNu9BRWtfLgivI8xSCncQs82HDwGXSm3/pubhtml\nGitHub for theory: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp\nYouTube Link: 5.X --- https://www.youtube.com/watch?v=agIFak9A3m8&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=49\n~~~ Nukta Bhatia ~~~", "section": "5. Deploying Machine Learning Models", "question": "How do I get started with Week 5?" }, { "text": "While weeks 1-4 can relatively easily be followed and the associated homework completed with just about any default environment / local setup, week 5 introduces several layers of abstraction and dependencies.\nIt is advised to prepare your \u201chomework environment\u201d with a cloud provider of your choice. A thorough step-by-step guide for doing so for an AWS EC2 instance is provided in an introductory video taken from the MLOPS course here:\nhttps://www.youtube.com/watch?v=IXSiYkP23zo\nNote that (only) small AWS instances can be run for free, and that larger ones will be billed hourly based on usage (but can and should be stopped when not in use).\nAlternative ways are sketched here:\nhttps://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/01-intro/06-environment.md", "section": "5. Deploying Machine Learning Models", "question": "Errors related to the default environment: WSL, Ubuntu, proper Python version, installing pipenv etc." }, { "text": "You\u2019ll need a kaggle account\nGo to settings, API and click `Create New Token`. This will download a `kaggle.json` file which contains your `username` and `key` information\nIn the same location as your Jupyter NB, place the `kaggle.json` file\nRun `!chmod 600 <ENTER YOUR FILEPATH>/kaggle.json`\nMake sure to import os via `import os` and then run:\nos.environ['KAGGLE_CONFIG_DIR'] = <STRING OF YOUR FILE PATH>\nFinally you can run directly in your NB: `!kaggle datasets download -d kapturovalexander/bank-credit-scoring`\nAnd then you can unzip the file and access the CSV via: `!unzip -o bank-credit-scoring.zip`\n>>> Michael Fronda <<<", "section": "5. Deploying Machine Learning Models", "question": "How to download CSV data via Jupyter NB and the Kaggle API, for one seamless experience" }, { "text": "Cd .. (go back)\nLs (see current folders)\nCd \u2018path\u2019/ (go to this path)\nPwd (home)\nCat \u201cfile name\u2019 --edit txt file in ubuntu\nAileah Gotladera", "section": "5. Deploying Machine Learning Models", "question": "Basic Ubuntu Commands:" }, { "text": "Open terminal and type the code below to check the version on your laptop\npython3 --version\nFor windows,\nVisit the official python website at https://www.python.org/downloads/ to download the python version you need for installation\nRun the installer and ensure to check the box that says \u201cAdd Python to PATH\u201d during installation and complete the installation by following the prompts\nOr\nFor Python 3,\nOpen your command prompt or terminal and run the following command:\npip install --upgrade python\nAminat Abolade", "section": "5. Deploying Machine Learning Models", "question": "Installing and updating to the python version 3.10 and higher" }, { "text": "It is quite simple, and you can follow these instructions here:\nhttps://www.youtube.com/watch?v=qYlgUDKKK5A&ab_channel=NeuralNine\nMake sure that you have \u201cVirtual Machine Platform\u201d feature activated in your Windows \u201cFeatures\u201d. To do that, search \u201cfeatures\u201d in the research bar and see if the checkbox is selected. You also need to make sure that your system (in the bios) is able to virtualize. This is usually the case.\nIn the Microsoft Store: look for \u2018Ubuntu\u2019 or \u2018Debian\u2019 (or any linux distribution you want) and install it\nOnce it is downloaded, open the app and choose a username and a password (secured one). When you type your password, nothing will show in the window, which is normal: the writing is invisible.\nYou are now inside of your linux system. You can test some commands such as \u201cpwd\u201d. You are not in your Windows system.\nTo go to your windows system: you need to go back two times with cd ../.. And then go to the \u201cmnt\u201d directory with cd mnt. If you list here your files, you will see your disks. You can move to the desired folder, for example here I moved to the ML_Zoomcamp folder:\nPython should be already installed but you can check it by running sudo apt install python3 command.\nYou can make your actual folder your default folder when you open your Ubuntu terminal with this command : echo \"cd ../../mnt/your/folder/path\" >> ~/.bashrc\nYou can disable bell sounds (when you type something that does not exist for example) by modifying the inputrc file with this command: sudo vim /etc/inputrc\nYou have to uncomment the set bell-style none line -> to do that, press the \u201ci\u201d keyboard letter (for insert) and go with your keyboard to this line. Delete the # and then press the Escape keyboard touch and finally press \u201c:wq\u201d to write (it saves your modifications) then quit.\nYou can check that your modifications are taken into account by opening a new terminal (you can pin it to your task bar so you do not have to go to the Microsoft app each time).\nYou will need to install pip by running this command sudo apt install python3-pip\nNB: I had this error message when trying to install pipenv (https://github.com/microsoft/WSL/issues/5663):\n/sbin/ldconfig.real: Can't link /usr/lib/wsl/lib/libnvoptix_loader.so.1 to libnvoptix.so.1\n/sbin/ldconfig.real: /usr/lib/wsl/lib/libcuda.so.1 is not a symbolic link\nSo I had to create the following symbolic link:\nsudo ln -s /usr/lib/wsl/lib/libcuda.so.1 /usr/lib64/libcuda.so\n(M\u00e9lanie Fouesnard)", "section": "5. Deploying Machine Learning Models", "question": "How to install WSL on Windows 10 and 11 ?" }, { "text": "Do you get errors building the Docker image on the Mac M1 chipset?\nThe error I was getting was:\nCould not open '/lib64/ld-linux-x86-64.so.2': No such file or directory\nThe fix (from here): vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv\nOpen mlbookcamp-code/course-zoomcamp/01-intro/environment/Dockerfile\nReplace line 1 with\nFROM --platform=linux/amd64 ubuntu:latest\nNow build the image as specified. In the end it took over 2 hours to build the image but it did complete in the end.\nDavid Colton", "section": "5. Deploying Machine Learning Models", "question": "Error building Docker images on Mac with M1 silicon" }, { "text": "Import waitress\nprint(waitress.__version__)\nKrishna Anand", "section": "5. Deploying Machine Learning Models", "question": "Method to find the version of any install python libraries in jupyter notebook" }, { "text": "Working on getting Docker installed - when I try running hello-world I am getting the error.\nDocker: Cannot connect to the docker daemon at unix:///var/run/docker.sock. Is the Docker daemon running ?\nSolution description\nIf you\u2019re getting this error on WSL, re-install your docker: remove the docker installation from WSL and install Docker Desktop on your host machine (Windows).\nOn Linux, start the docker daemon with either of these commands:\nsudo dockerd\nsudo service docker start\nAdded by Ugochukwu Onyebuchi", "section": "5. Deploying Machine Learning Models", "question": "Cannot connect to the docker daemon. Is the Docker daemon running?" }, { "text": "After using the command \u201cdocker build -t churn-prediction .\u201d to build the Docker image, the above error is raised and the image is not created.\nIn your Dockerfile, change the Python version in the first line the Python version installed in your system:\nFROM python:3.7.5-slim\nTo find your python version, use the command python --version. For example:\npython --version\n>> Python 3.9.7\nThen, change it on your Dockerfile:\nFROM python:3.9.7-slim\nAdded by Filipe Melo", "section": "5. Deploying Machine Learning Models", "question": "The command '/bin/sh -c pipenv install --deploy --system && rm -rf /root/.cache' returned a non-zero code: 1" }, { "text": "When the facilitator was adding sklearn to the virtual environment in the lectures, he used sklearn==0.24.1 and it ran smoothly. But while doing the homework and you are asked to use the 1.0.2 version of sklearn, it gives errors.\nThe solution is to use the full name of sklearn. That is, run it as \u201cpipenv install scikit-learn==1.0.2\u201d and the error will go away, allowing you to install sklearn for the version in your virtual environment.\nOdimegwu David\nHomework asks you to install 1.3.1\nPipenv install scikit-learn==1.3.1\nUse Pipenv to install Scikit-Learn version 1.3.1\nGopakumar Gopinathan", "section": "5. Deploying Machine Learning Models", "question": "Running \u201cpipenv install sklearn==1.0.2\u201d gives errors. What should I do?" }, { "text": "What is the reason we don\u2019t want to keep the docker image in our system and why do we need to run docker containers with `--rm` flag?\nFor best practice, you don\u2019t want to have a lot of abandoned docker images in your system. You just update it in your folder and trigger the build one more time.\nThey consume extra space on your disk. Unless you don\u2019t want to re-run the previously existing containers, it is better to use the `--rm` option.\nThe right way to say: \u201cWhy do we remove the docker container in our system?\u201d. Well the docker image is still kept; it is the container that is not kept. Upon execution, images are not modified; only containers are.\nThe option `--rm` is for removing containers. The images remain until you remove them manually. If you don\u2019t specify a version when building an image, it will always rebuild and replace the latest tag. `docker images` shows you all the image you have pulled or build so far.\nDuring development and testing you usually specify `--rm` to get the containers auto removed upon exit. Otherwise they get accumulated in a stopped state, taking up space. `docker ps -a` shows you all the containers you have in your host. Each time you change Pipfile (or any file you baked into the container), you rebuild the image under the same tag or a new tag. It\u2019s important to understand the difference between the term \u201cdocker image\u201d and \u201cdocker container\u201d. Image is what we build with all the resources baked in. You can move it around, maintain it in a repository, share it. Then we use the image to spin up instances of it and they are called containers.\nAdded by Muhammad Awon", "section": "5. Deploying Machine Learning Models", "question": "Why do we need the --rm flag" }, { "text": "When you create the dockerfile the name should be dockerfile and needs to be without extension. One of the problems we can get at this point is to create the dockerfile as a dockerfile extension Dockerfile.dockerfile which creates an error when we build the docker image. Instead we just need to create the file without extension: Dockerfile and will run perfectly.\nAdded by Pastor Soto", "section": "5. Deploying Machine Learning Models", "question": "Failed to read Dockerfile" }, { "text": "Refer to the page https://docs.docker.com/desktop/install/mac-install/ remember to check if you have apple chip or intel chip.", "section": "5. Deploying Machine Learning Models", "question": "Install docker on MacOS" }, { "text": "Problem: When I am trying to pull the image with the docker pull svizor/zoomcamp-model command I am getting an error:\nUsing default tag: latest\nError response from daemon: manifest for svizor/zoomcamp-model:latest not found: manifest unknown: manifest unknown\nSolution: The docker by default uses the latest tag to avoid this use the correct tag from image description. In our case use command:\ndocker pull svizor/zoomcamp-model:3.10.12-slim\nAdded by Vladimir Yesipov", "section": "5. Deploying Machine Learning Models", "question": "I cannot pull the image with docker pull command" }, { "text": "Using the command docker images or docker image ls will dump all information for all local Docker images. It is possible to dump the information only for a specified image by using:\ndocker image ls <image name>\nOr alternatively:\ndocker images <image name>\nIn action to that it is possible to only dump specific information provided using the option --format which will dump only the size for the specified image name when using the command below:\ndocker image ls --format \"{{.Size}}\" <image name>\nOr alternatively:\ndocker images --format \"{{.Size}}\" <image name>\nSylvia Schmitt", "section": "5. Deploying Machine Learning Models", "question": "Dumping/Retrieving only the size of for a specific Docker image" }, { "text": "It creates them in\nOSX/Linux: ~/.local/share/virtualenvs/folder-name_cyrptic-hash\nWindows: C:\\Users\\<USERNAME>\\.virtualenvs\\folder-name_cyrptic-hash\nEg: C:\\Users\\Ella\\.virtualenvs\\code-qsdUdabf (for module-05 lesson)\nThe environment name is the name of the last folder in the folder directory where we used the pipenv install command (or any other pipenv command). E.g. If you run any pipenv command in folder path ~/home/user/Churn-Flask-app, it will create an environment named Churn-Flask-app-some_random_characters, and it's path will be like this: /home/user/.local/share/virtualenvs/churn-flask-app-i_mzGMjX.\nAll libraries of this environment will be installed inside this folder. To activate this environment, I will need to cd into the project folder again, and type pipenv shell. In short, the location of the project folder acts as an identifier for an environment, in place of any name.\n(Memoona Tahira)", "section": "5. Deploying Machine Learning Models", "question": "Where does pipenv create environments and how does it name them?" }, { "text": "Launch the container image in interactive mode and overriding the entrypoint, so that it starts a bash command.\ndocker run -it --entrypoint bash <image>\nIf the container is already running, execute a command in the specific container:\ndocker ps (find the container-id)\ndocker exec -it <container-id> bash\n(Marcos MJD)", "section": "5. Deploying Machine Learning Models", "question": "How do I debug a docker container?" }, { "text": "$ docker exec -it 1e5a1b663052 bash\nthe input device is not a TTY. If you are using mintty, try prefixing the command with 'winpty'\nFix:\nwinpty docker exec -it 1e5a1b663052 bash\nA TTY is a terminal interface that supports escape sequences, moving the cursor around, etc.\nWinpty is a Windows software package providing an interface similar to a Unix pty-master for communicating with Windows console programs.\nMore info on terminal, shell, console applications hi and so on:\nhttps://conemu.github.io/en/TerminalVsShell.html\n(Marcos MJD)", "section": "5. Deploying Machine Learning Models", "question": "The input device is not a TTY when running docker in interactive mode (Running Docker on Windows in GitBash)" }, { "text": "Initially, I did not assume there was a model2. I copied the original model1.bin and dv.bin. Then when I tried to load using\nCOPY [\"model2.bin\", \"dv.bin\", \"./\"]\nthen I got the error above in MINGW64 (git bash) on Windows.\nThe temporary solution I found was to use\nCOPY [\"*\", \"./\"]\nwhich I assume combines all the files from the original docker image and the files in your working directory.\nAdded by Muhammed Tan", "section": "5. Deploying Machine Learning Models", "question": "Error: failed to compute cache key: \"/model2.bin\" not found: not found" }, { "text": "Create a virtual environment using the Cmd command (command) and use pip freeze command to write the requirements in the text file\nKrishna Anand", "section": "5. Deploying Machine Learning Models", "question": "Failed to write the dependencies to pipfile and piplock file" }, { "text": "f-String not properly keyed in: does anyone knows why i am getting error after import pickle?\nThe first error showed up because your f-string is using () instead of {} around C. So, should be: f\u2019model_C={C}.bin\u2019\nThe second error as noticed by Sriniketh, your are missing one parenthesis it should be pickle.dump((dv, model), f_out)\n(Humberto R.)", "section": "5. Deploying Machine Learning Models", "question": "f-strings" }, { "text": "This error happens because pipenv is already installed but you can't access it from the path.\nThis error comes out if you run.\npipenv --version\npipenv shell\nSolution for Windows\nOpen this option\nClick here\nClick in Edit Button\nMake sure the next two locations are on the PATH, otherwise, add it.\nC:\\Users\\AppData\\....\\Python\\PythonXX\\\nC:\\Users\\AppData\\....\\Python\\PythonXX\\Scripts\\\nAdded by Alejandro Aponte\nNote: this answer assumes you don\u2019t use Anaconda. For Windows, using Anaconda would be a better choice and less prone to errors.", "section": "5. Deploying Machine Learning Models", "question": "'pipenv' is not recognized as an internal or external command, operable program or batch file." }, { "text": "Following the instruction from video week-5.6, using pipenv to install python libraries throws below error\nSolution to this error is to make sure that you are working with python==3.9 (as informed in the very first lesson of the zoomcamp) and not python==3.10.\nAdded by Hareesh Tummala", "section": "5. Deploying Machine Learning Models", "question": "AttributeError: module \u2018collections\u2019 has no attribute \u2018MutableMapping\u2019" }, { "text": "After entering `pipenv shell` don\u2019t forget to use `exit` before `pipenv --rm`, as it may cause errors when trying to install packages, it is unclear whether you are \u201cin the shell\u201d(using Windows) at the moment as there are no clear markers for it.\nIt can also mess up PATH, if that\u2019s the case, here\u2019s terminal commands for fixing that:\n# for Windows\nset VIRTUAL_ENV \"\"\n# for Unix\nexport VIRTUAL_ENV=\"\"\nAlso manually re-creating removed folder at `C:\\Users\\username\\.virtualenvs\\removed-envname` can help, removed-envname can be seen at the error message.\nAdded by Andrii Larkin", "section": "5. Deploying Machine Learning Models", "question": "Q: ValueError: Path not found or generated: WindowsPath('C:/Users/username/.virtualenvs/envname/Scripts')" }, { "text": "Set the host to \u20180.0.0.0\u2019 on the flask app and dockerfile then RUN the url using localhost.\n(Theresa S.)", "section": "5. Deploying Machine Learning Models", "question": "ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))" }, { "text": "Solution:\nThis error occurred because I used single quotes around the filenames. Stick to double quotes", "section": "5. Deploying Machine Learning Models", "question": "docker build ERROR [x/y] COPY \u2026" }, { "text": "I tried the first solution on Stackoverflow which recommended running `pipenv lock` to update the Pipfile.lock. However, this didn\u2019t resolve it. But the following switch to the pipenv installation worked\nRUN pipenv install --system --deploy --ignore-pipfile", "section": "5. Deploying Machine Learning Models", "question": "Fix error during installation of Pipfile inside Docker container" }, { "text": "Solution\nThis error was because there was another instance of gunicorn running. So I thought of removing this along with the zoomcamp_test image. However, it didn\u2019t let me remove the orphan container. So I did the following\nRunning the following commands\ndocker ps -a <to list all docker containers>\ndocker images <to list images>\ndocker stop <container ID>\ndocker rm <container ID>\ndocker rmi image\nI rebuilt the Docker image, and ran it once again; this time it worked correctly and I was able to serve the test script to the endpoint.", "section": "5. Deploying Machine Learning Models", "question": "How to fix error after running the Docker run command" }, { "text": "I was getting the below error when I rebuilt the docker image although the port was not allocated, and it was working fine.\nError message:\nError response from daemon: driver failed programming external connectivity on endpoint beautiful_tharp (875be95c7027cebb853a62fc4463d46e23df99e0175be73641269c3d180f7796): Bind for 0.0.0.0:9696 failed: port is already allocated.\nSolution description\nIssue has been resolved running the following command:\ndocker kill $(docker ps -q)\nhttps://github.com/docker/for-win/issues/2722\nAsia Saeed", "section": "5. Deploying Machine Learning Models", "question": "Bind for 0.0.0.0:9696 failed: port is already allocated" }, { "text": "I was getting the error on client side with this\nClient Side:\nFile \"C:\\python\\lib\\site-packages\\urllib3\\connectionpool.py\", line 703, in urlopen \u2026\u2026\u2026\u2026\u2026\u2026\u2026..\nraise ConnectionError(err, request=request)\nrequests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))\nSevrer Side:\nIt showed error for gunicorn\nThe waitress cmd was running smoothly from server side\nSolution:\nUse the ip-address as 0.0.0.0:8000 or 0.0.0.0:9696.They are the ones which do work max times\nAamir Wani", "section": "5. Deploying Machine Learning Models", "question": "Bind for 127.0.0.1:5000 showing error" }, { "text": "Install it by using command\n% brew install md5sha1sum\nThen run command to check hash for file to check if they the same with the provided\n% md5sum model1.bin dv.bin\nOlga Rudakova", "section": "5. Deploying Machine Learning Models", "question": "Installing md5sum on Macos" }, { "text": "Problem description:\nI started a web-server in terminal (command window, powershell, etc.). How can I run another python script, which makes a request to this server?\nSolution description:\nJust open another terminal (command window, powershell, etc.) and run a python script.\nAlena Kniazeva", "section": "5. Deploying Machine Learning Models", "question": "How to run a script while a web-server is working?" }, { "text": "Problem description:\nIn video 5.5 when I do pipenv shell and then pipenv run gunicorn --bind 0.0.0.0:9696 predict:app, I get the following warning:\nUserWarning: Trying to unpickle estimator DictVectorizer from version 1.1.1 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.\nSolution description:\nWhen you create a virtual env, you should use the same version of Scikit-Learn that you used for training the model on this case it's 1.1.1. There is version conflicts so we need to make sure our model and dv files are created from the version we are using for the project.\nBhaskar Sarma", "section": "5. Deploying Machine Learning Models", "question": "Version-conflict in pipenv" }, { "text": "If you install packages via pipenv install, and get an error that ends like this:\npipenv.vendor.plette.models.base.ValidationError: {'python_version': '3.9', 'python_full_version': '3.9.13'}\npython_full_version: 'python_version' must not be present with 'python_full_version'\npython_version: 'python_full_version' must not be present with 'python_version'\nDo this:\nopen Pipfile in nano editor, and remove either the python_version or python_full_version line, press CTRL+X, type Y and click Enter to save changed\nType pipenv lock to create the Pipfile.lock.\nDone. Continue what you were doing", "section": "5. Deploying Machine Learning Models", "question": "Python_version and Python_full_version error after running pipenv install:" }, { "text": "If during running the docker build command, you get an error like this:\nYour Pipfile.lock (221d14) is out of date. Expected: (939fe0).\nUsage: pipenv install [OPTIONS] [PACKAGES]...\nERROR:: Aborting deploy\nOption 1: Delete the pipfile.lock via rm Pipfile, and then rebuild the lock via pipenv lock from the terminal before retrying the docker build command.\nOption 2: If it still doesn\u2019t work, remove the pipenv environment, Pipfile and Pipfile.lock, and create a new one before building docker again. Commands to remove pipenv environment and removing pipfiles:\npipenv --rm\nrm Pipfile*", "section": "5. Deploying Machine Learning Models", "question": "Your Pipfile.lock (221d14) is out of date (during Docker build)" }, { "text": "Ans: Pip uninstall waitress mflow. Then reinstall just mlflow. By this time you should have successfully built your docker image so you dont need to reinstall waitress. All good. Happy learning.\nAdded by \ud83c\udd71\ud83c\udd7b\ud83c\udd70\ud83c\udd80", "section": "5. Deploying Machine Learning Models", "question": "You are using windows. Conda environment. You then use waitress instead of gunicorn. After a few runs, suddenly mlflow server fails to run." }, { "text": "Ans: so you have created the env. You need to make sure you're in eu-west-1 (ireland) when you check the EB environments. Maybe you're in a different region in your console.\nAdded by Edidiong Esu", "section": "5. Deploying Machine Learning Models", "question": "Completed creating the environment locally but could not find the environment on AWS." }, { "text": "Running 'pip install waitress' as a command on GitBash was not downloading the executable file 'waitress-serve.exe'. You need this file to be able to run commands with waitress in Git Bash. To solve this:\nopen a Jupyter notebook and run the same command ' pip install waitress'. This way the executable file will be downloaded. The notebook may give you this warning : 'WARNING: The script waitress-serve.exe is installed in 'c:\\Users\\....\\anaconda3\\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.'\nAdd the path where 'waitress-serve.exe' is installed into gitbash's PATH as such:\nenter the following command in gitbash: nano ~/.bashrc\nadd the path to 'waitress-serve.exe' to PATH using this command: export PATH=\"/path/to/waitress:$PATH\"\nclose gitbash and open it again and you should be good to go\nAdded by Bachar Kabalan", "section": "5. Deploying Machine Learning Models", "question": "Installing waitress on Windows via GitBash: \u201cwaitress-serve\u201d command not found" }, { "text": "Q2.1: Use Pipenv to install Scikit-Learn version 1.3.1\nThis is an error I got while executing the above step in the ml-zoomcamp conda environment. The error is not fatal and just warns you that explicit language specifications are not set out in our bash profile. A quick-fix is here:\nhttps://stackoverflow.com/questions/49436922/getting-error-while-trying-to-run-this-command-pipenv-install-requests-in-ma\nBut one can proceed without addressing it.\nAdded by Abhirup Ghosh", "section": "5. Deploying Machine Learning Models", "question": "Warning: the environment variable LANG is not set!" }, { "text": "The provided image FROM svizor/zoomcamp-model:3.10.12-slim has a model and dictvectorizer that should be used for question 6. \"model2.bin\", \"dv.bin\"\nAdded by Quinn Avila", "section": "5. Deploying Machine Learning Models", "question": "Module5 HW Question 6" }, { "text": "https://apps.microsoft.com/detail/windows-terminal/9N0DX20HK701?hl=es-419&gl=CO\nAdded by Dawuta Smit", "section": "5. Deploying Machine Learning Models", "question": "Terminal Used in Week 5 videos:" }, { "text": "Question:\nWhen running\npipenv run waitress-serve --listen=localhost:9696 q4-predict:app\nI get the following:\nThere was an exception (ValueError) importing your module.\nIt had these arguments:\n1. Malformed application 'q4-predict:app'\nAnswer:\nWaitress doesn\u2019t accept a dash in the python file name.\nThe solution is to rename the file replacing a dash with something else for instance with an underscore eg q4_predict.py\nAdded by Alex Litvinov", "section": "5. Deploying Machine Learning Models", "question": "waitress-serve shows Malformed application" }, { "text": "I wanted to have a fast and simple way to check if the HTTP POST requests are working just running a request from command line. This can be done running \u2018curl\u2019. \n(Used with WSL2 on Windows, should also work on Linux and MacOS)\ncurl --json '<json data>' <url>\n# piping the structure to the command\ncat <json file path> | curl --json @- <url>\necho '<json data>' | curl --json @- <url>\n# example using piping\necho '{\"job\": \"retired\", \"duration\": 445, \"poutcome\": \"success\"}'\\\n| curl --json @- http://localhost:9696/predict\nAdded by Sylvia Schmitt", "section": "5. Deploying Machine Learning Models", "question": "Testing HTTP POST requests from command line using curl" }, { "text": "Question:\nWhen executing\neb local run --port 9696\nI get the following error:\nERROR: NotSupportedError - You can use \"eb local\" only with preconfigured, generic and multicontainer Docker platforms.\nAnswer:\nThere are two options to fix this:\nRe-initialize by running eb init -i and choosing the options from a list (the first default option for docker platform should be fine).\nEdit the \u2018.elasticbeanstalk/config.yml\u2019 directly changing the default_platform from Docker to default_platform: Docker running on 64bit Amazon Linux 2023\nThe disadvantage of the second approach is that the option might not be available the following years\nAdded by Alex Litvinov", "section": "5. Deploying Machine Learning Models", "question": "NotSupportedError - You can use \"eb local\" only with preconfigured, generic and multicontainer Docker platforms." }, { "text": "You need to include the protocol scheme: 'http://localhost:9696/predict'.\nWithout the http:// part, requests has no idea how to connect to the remote server.\nNote that the protocol scheme must be all lowercase; if your URL starts with HTTP:// for example, it won\u2019t find the http:// connection adapter either.\nAdded by George Chizhmak", "section": "5. Deploying Machine Learning Models", "question": "Requests Error: No connection adapters were found for 'localhost:9696/predict'." }, { "text": "While running the docker image if you get the same result check which model you are using.\nRemember you are using a model downloading model + python version so remember to change the model in your file when running your prediction test.\nAdded by Ahmed Okka", "section": "5. Deploying Machine Learning Models", "question": "Getting the same result" }, { "text": "Ensure that you used pipenv to install the necessary modules including gunicorn. As pipfiles for virtual environments, you can use pipenv shell and then build+run your docker image. - Akshar Goyal", "section": "5. Deploying Machine Learning Models", "question": "Trying to run a docker image I built but it says it\u2019s unable to start the container process" }, { "text": "You can copy files from your local machine into a Docker container using the docker cp command. Here's how to do it:\nTo copy a file or directory from your local machine into a running Docker container, you can use the `docker cp command`. The basic syntax is as follows:\ndocker cp /path/to/local/file_or_directory container_id:/path/in/container\nHrithik Kumar Advani", "section": "5. Deploying Machine Learning Models", "question": "How do I copy files from my local machine to docker container?" }, { "text": "You can copy files from your local machine into a Docker container using the docker cp command. Here's how to do it:\nIn the Dockerfile, you can provide the folder containing the files that you want to copy over. The basic syntax is as follows:\nCOPY [\"src/predict.py\", \"models/xgb_model.bin\", \"./\"]\t\t\t\t\t\t\t\t\t\t\tGopakumar Gopinathan", "section": "5. Deploying Machine Learning Models", "question": "How do I copy files from a different folder into docker container\u2019s working directory?" }, { "text": "I struggled with the command :\neb init -p docker tumor-diagnosis-serving -r eu-west-1\nWhich resulted in an error when running : eb local run --port 9696\nERROR: NotSupportedError - You can use \"eb local\" only with preconfigured, generic and multicontainer Docker platforms.\nI replaced it with :\neb init -p \"Docker running on 64bit Amazon Linux 2\" tumor-diagnosis-serving -r eu-west-1\nThis allowed the recognition of the Dockerfile and the build/run of the docker container.\nAdded by M\u00e9lanie Fouesnard", "section": "5. Deploying Machine Learning Models", "question": "I can\u2019t create the environment on AWS Elastic Beanstalk with the command proposed during the video" }, { "text": "I had this error when creating a AWS ElasticBean environment: eb create tumor-diagnosis-env\nERROR Instance deployment: Both 'Dockerfile' and 'Dockerrun.aws.json' are missing in your source bundle. Include at least one of them. The deployment failed.\nI did not committed the files used to build the container, particularly the Dockerfile. After a git add and git commit of the modified files, the command works.\nAdded by M\u00e9lanie Fouesnard", "section": "6. Decision Trees and Ensemble Learning", "question": "Dockerfile missing when creating the AWS ElasticBean environment" }, { "text": "Week 6 HW: https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/06-trees/homework.md\nAll HWs: https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/\nHW 4 Solution: https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/04-evaluation/homework_4.ipynb\nEvaluation Matrix: https://docs.google.com/spreadsheets/d/e/2PACX-1vQCwqAtkjl07MTW-SxWUK9GUvMQ3Pv_fF8UadcuIYLgHa0PlNu9BRWtfLgivI8xSCncQs82HDwGXSm3/pubhtml\nGitHub for theory: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp\nYouTube Link: 6.X --- https://www.youtube.com/watch?v=GJGmlfZoCoU&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=57\nFAQs: https://docs.google.com/document/d/1LpPanc33QJJ6BSsyxVg-pWNMplal84TdZtq10naIhD8/edit#heading=h.lpz96zg7l47j\n~~~Nukta Bhatia~~~", "section": "6. Decision Trees and Ensemble Learning", "question": "How to get started with Week 6?" }, { "text": "During the XGBoost lesson, we created a parser to extract the training and validation auc from the standard output. However, we can accomplish that in a more straightforward way.\nWe can use the evals_result parameters, which takes an empty dictionary and updates it for each tree. Additionally, you can store the data in a dataframe and plot it in an easier manner.\nAdded by Daniel Coronel", "section": "6. Decision Trees and Ensemble Learning", "question": "How to get the training and validation metrics from XGBoost?" }, { "text": "You should create sklearn.ensemble.RandomForestRegressor object. It\u2019s rather similar to sklearn.ensemble.RandomForestClassificator for classification problems. Check https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html for more information.\nAlena Kniazeva", "section": "6. Decision Trees and Ensemble Learning", "question": "How to solve regression problems with random forest in scikit-learn?" }, { "text": "In question 6, I was getting ValueError: feature_names must be string, and may not contain [, ] or < when I was creating DMatrix for train and validation\nSolution description\nThe cause of this error is that some of the features names contain special characters like = and <, and I fixed the error by removing them as follows:\nfeatures= [i.replace(\"=<\", \"_\").replace(\"=\",\"_\") for i in features]\nAsia Saeed\nAlternative Solution:\nIn my case the equal sign \u201c=\u201d was not a problem, so in my opinion the first part of Asias solution features= [i.replace(\"=<\", \"_\") should work as well.\nFor me this works:\nfeatures = []\nfor f in dv.feature_names_:\nstring = f.replace(\u201c=<\u201d, \u201c-le\u201d)\nfeatures.append(string)\nPeter Ernicke", "section": "6. Decision Trees and Ensemble Learning", "question": "ValueError: feature_names must be string, and may not contain [, ] or <" }, { "text": "If you\u2019re getting this error, It is likely because the feature names in dv.get_feature_names_out() are a np.ndarray instead of a list so you have to convert them into a list by using the to_list() method.\nAli Osman", "section": "6. Decision Trees and Ensemble Learning", "question": "`TypeError: Expecting a sequence of strings for feature names, got: <class 'numpy.ndarray'> ` when training xgboost model." }, { "text": "If you\u2019re getting TypeError:\n\u201cTypeError: Expecting a sequence of strings for feature names, got: <class 'numpy.ndarray'>\u201d,\nprobably you\u2019ve done this:\nfeatures = dv.get_feature_names_out()\nIt gets you np.ndarray instead of list. Converting to list list(features) will not fix this, read below.\nIf you\u2019re getting ValueError:\n\u201cValueError: feature_names must be string, and may not contain [, ] or <\u201d,\nprobably you\u2019ve either done:\nfeatures = list(dv.get_feature_names_out())\nor:\nfeatures = dv.feature_names_\nreason is what you get from DictVectorizer here looks like this:\n['households',\n'housing_median_age',\n'latitude',\n'longitude',\n'median_income',\n'ocean_proximity=<1H OCEAN',\n'ocean_proximity=INLAND',\n'population',\n'total_bedrooms',\n'total_rooms']\nit has symbols XGBoost doesn\u2019t like ([, ] or <).\nWhat you can do, is either do not specify \u201cfeature_names=\u201d while creating xgb.DMatrix or:\nimport re\nfeatures = dv.feature_names_\npattern = r'[\\[\\]<>]'\nfeatures = [re.sub(pattern, ' ', f) for f in features]\nAdded by Andrii Larkin", "section": "6. Decision Trees and Ensemble Learning", "question": "Q6: ValueError or TypeError while setting xgb.DMatrix(feature_names=)" }, { "text": "To install Xgboost, use the code below directly in your jupyter notebook:\n(Pip 21.3+ is required)\npip install xgboost\nYou can update your pip by using the code below:\npip install --upgrade pip\nFor more about xgbboost and installation, check here:\nhttps://xgboost.readthedocs.io/en/stable/install.html\nAminat Abolade", "section": "6. Decision Trees and Ensemble Learning", "question": "How to Install Xgboost" }, { "text": "Sometimes someone might wonder what eta means in the tunable hyperparameters of XGBoost and how it helps the model.\nETA is the learning rate of the model. XGBoost uses gradient descent to calculate and update the model. In gradient descent, we are looking for the minimum weights that help the model to learn the data very well. This minimum weights for the features is updated each time the model passes through the features and learns the features during training. Tuning the learning rate helps you tell the model what speed it would use in deriving the minimum for the weights.", "section": "6. Decision Trees and Ensemble Learning", "question": "What is eta in XGBoost" }, { "text": "For ensemble algorithms, during the week 6, one bagging algorithm and one boosting algorithm were presented: Random Forest and XGBoost, respectively.\nRandom Forest trains several models in parallel. The output can be, for example, the average value of all the outputs of each model. This is called bagging.\nXGBoost trains several models sequentially: the previous model error is used to train the following model. Weights are used to ponderate the models such as the best models have higher weights and are therefore favored for the final output. This method is called boosting.\nNote that boosting is not necessarily better than bagging.\nM\u00e9lanie Fouesnard\nBagging stands for \u201cBootstrap Aggregation\u201d - it involves taking multiple samples with replacement to derive multiple training datasets from the original training dataset (bootstrapping), training a classifier (e.g. decision trees or stumps for Random Forests) on each such training dataset, and then combining the the predictions (aggregation) to obtain the final prediction. For classification, predictions are combined via voting; for regression, via averaging. Bagging can be done in parallel, since the various classifiers are independent. Bagging decreases variance (but not bias) and is robust against overfitting.\nBoosting, on the other hand, is sequential - each model learns from the mistakes of its predecessor. Observations are given different weights - observations/samples misclassified by the previous classifier are given a higher weight, and this process is continued until a stopping condition is reached (e.g. max. No. of models is reached, or error is acceptably small, etc.). Boosting reduces bias & is generally more accurate than bagging, but can be prone to overfitting.\nRileen", "section": "6. Decision Trees and Ensemble Learning", "question": "What is the difference between bagging and boosting?" }, { "text": "I wanted to directly capture the output from the xgboost training for multiple eta values to a dictionary without the need to run the same cell multiple times and manually editing the eta value in between or copy the code for a second eta value.\nUsing the magic cell command \u201c%%capture output\u201d I was only able to capture the complete output for all iterations for the loop, but. I was able to solve this using the following approach. This is just a code sample to grasp the idea.\n# This would be the content of the Jupyter Notebook cell\nfrom IPython.utils.capture import capture_output\nimport sys\ndifferent_outputs = {}\nfor i in range(3):\nwith capture_output(sys.stdout) as output:\nprint(i)\nprint(\"testing capture\")\ndifferent_outputs[i] = output.stdout\n# different_outputs\n# {0: '0\\ntesting capture\\n',\n# 1: '1\\ntesting capture\\n',\n# 2: '2\\ntesting capture\\n'}\nAdded by Sylvia Schmitt", "section": "6. Decision Trees and Ensemble Learning", "question": "Capture stdout for each iterations of a loop separately" }, { "text": "Calling roc_auc_score() to get auc is throwing the above error.\nSolution to this issue is to make sure that you pass y_actuals as 1st argument and y_pred as 2nd argument.\nroc_auc_score(y_train, y_pred)\nHareesh Tummala", "section": "6. Decision Trees and Ensemble Learning", "question": "ValueError: continuous format is not supported" }, { "text": "When rmse stops improving means, when it stops to decrease or remains almost similar.\nPastor Soto", "section": "6. Decision Trees and Ensemble Learning", "question": "Question 3 of homework 6 if i see that rmse goes up at a certain number of n_estimators but then goes back down lower than it was before, should the answer be the number of n_estimators after which rmse initially went up, or the number after which it was its overall lowest value?" }, { "text": "dot_data = tree.export_graphviz(regr, out_file=None,\nfeature_names=boston.feature_names,\nfilled=True)\ngraphviz.Source(dot_data, format=\"png\")\nKrishna Anand\nfrom sklearn import tree\ntree.plot_tree(dt,feature_names=dv.feature_names_)\nAdded By Ryan Pramana", "section": "6. Decision Trees and Ensemble Learning", "question": "One of the method to visualize the decision trees" }, { "text": "Solution: This problem happens because you use DecisionTreeClassifier instead of DecisionTreeRegressor. You should check if you want to use a Decision tree for classification or regression.\nAlejandro Aponte", "section": "6. Decision Trees and Ensemble Learning", "question": "ValueError: Unknown label type: 'continuous'" }, { "text": "When I run dt = DecisionTreeClassifier() in jupyter in same laptop, each time I re-run it or do (restart kernel + run) I get different values of auc. Some of them are 0.674, 0.652, 0.642, 0.669 and so on. Anyone knows why it could be? I am referring to 7:40-7:45 of video 6.3.\nSolution: try setting the random seed e.g\ndt = DecisionTreeClassifier(random_state=22)\nBhaskar Sarma", "section": "6. Decision Trees and Ensemble Learning", "question": "Different values of auc, each time code is re-run" }, { "text": "They both do the same, it's just less typing from the script.\nAsked by Andrew Katoch, Added by Edidiong Esu", "section": "6. Decision Trees and Ensemble Learning", "question": "Does it matter if we let the Python file create the server or if we run gunicorn directly?" }, { "text": "When I tried to run example from the video using function ping and can not import it. I use import ping and it was unsuccessful. To fix it I use the statement:\n\nfrom [file name] import ping\nOlga Rudakova", "section": "6. Decision Trees and Ensemble Learning", "question": "No module named \u2018ping\u2019?" }, { "text": "The DictVectorizer has a function to get the feature names get_feature_names_out(). This is helpful for example if you need to analyze feature importance but use the dict vectorizer for one hot encoding. Just keep in mind it does return a numpy array so you may need to convert this to a list depending on your usage for example dv.get_feature_names_out() will return a ndarray array of string objects. list(dv.get_feature_names_out()) will convert to a standard list of strings. Also keep in mind that you first need to fit the predictor and response arrays before you have access to the feature names.\nQuinn Avila", "section": "6. Decision Trees and Ensemble Learning", "question": "DictVectorizer feature names" }, { "text": "They both do the same, it's just less typing from the script.", "section": "6. Decision Trees and Ensemble Learning", "question": "Does it matter if we let the Python file create the server or if we run gunicorn directly?" }, { "text": "This error occurs because the list of feature names contains some characters like \"<\" that are not supported. To fix this issue, you can replace those problematic characters with supported ones. If you want to create a consistent list of features with no special characters, you can achieve it like this:\nYou can address this error by replacing problematic characters in the feature names with underscores, like so:\nfeatures = [f.replace('=<', '_').replace('=', '_') for f in features]\nThis code will go through the list of features and replace any instances of \"=<\" with \"\", as well as any \"=\" with \"\", ensuring that the feature names only consist of supported characters.", "section": "6. Decision Trees and Ensemble Learning", "question": "ValueError: feature_names must be string, and may not contain [, ] or <" }, { "text": "To make it easier for us to determine which features are important, we can use a horizontal bar chart to illustrate feature importance sorted by value.\n1. # extract the feature importances from the model\nfeature_importances = list(zip(features_names, rdr_model.feature_importances_))\nimportance_df = pd.DataFrame(feature_importances, columns=['feature_names', 'feature_importances'])\n2. # sort descending the dataframe by using feature_importances value\nimportance_df = importance_df.sort_values(by='feature_importances', ascending=False)\n3. # create a horizontal bar chart\nplt.figure(figsize=(8, 6))\nsns.barplot(x='feature_importances', y='feature_names', data=importance_df, palette='Blues_r')\nplt.xlabel('Feature Importance')\nplt.ylabel('Feature Names')\nplt.title('Feature Importance Chart')\nRadikal Lukafiardi", "section": "6. Decision Trees and Ensemble Learning", "question": "Visualize Feature Importance by using horizontal bar chart" }, { "text": "Instead of using np.sqrt() as the second step. You can extract it using like this way :\nmean_squared_error(y_val, y_predict_val,squared=False)\nAhmed Okka", "section": "6. Decision Trees and Ensemble Learning", "question": "RMSE using metrics.root_meas_square()" }, { "text": "I like this visual implementation of features importance in scikit-learn library:\nhttps://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html\nIt actually adds std.errors to features importance -> so that you can trace stability of features (important for a model\u2019s explainability) over the different params of the model.\nIvan Brigida", "section": "6. Decision Trees and Ensemble Learning", "question": "Features Importance graph" }, { "text": "Expanded error says: xgboost.core.XGBoostError: sklearn needs to be installed in order to use this module. So, sklearn in requirements solved the problem.\nGeorge Chizhmak", "section": "6. Decision Trees and Ensemble Learning", "question": "xgboost.core.XGBoostError: This app has encountered an error. The original error message is redacted to prevent data leaks." }, { "text": "Information gain in Y due to X, or the mutual information of Y and X\nWhere is the entropy of Y. \n\nIf X is completely uninformative about Y:\nIf X is completely informative about Y: )\nHrithik Kumar Advani", "section": "6. Decision Trees and Ensemble Learning", "question": "Information Gain" }, { "text": "Filling in missing values using an entire dataset before splitting for training/testing/validation causes", "section": "6. Decision Trees and Ensemble Learning", "question": "Data Leakage" }, { "text": "Save model by calling \u2018booster.save_model\u2019, see eg\nLoad model:\nDawuta Smit\nThis section is moved to Projects", "section": "8. Neural Networks and Deep Learning", "question": "Serialized Model Xgboost error" }, { "text": "TODO", "section": "8. Neural Networks and Deep Learning", "question": "How to get started with Week 8?" }, { "text": "Create or import your notebook into Kaggle.\nClick on the Three dots at the top right hand side\nClick on Accelerator\nChoose T4 GPU\nKhurram Majeed", "section": "8. Neural Networks and Deep Learning", "question": "How to use Kaggle for Deep Learning?" }, { "text": "Create or import your notebook into Google Colab.\nClick on the Drop Down at the top right hand side\nClick on \u201cChange runtime type\u201d\nChoose T4 GPU\nKhurram Majeed", "section": "8. Neural Networks and Deep Learning", "question": "How to use Google Colab for Deep Learning?" }, { "text": "Connecting your GPU on Saturn Cloud to Github repository is not compulsory, since you can just download the notebook and copy it to the Github folder. But if you like technology to do things for you, then follow the solution description below:\nSolution description: Follow the instructions in these github docs to create an SSH private and public key:\nhttps://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-ke\ny-and-adding-it-to-the-ssh-agenthttps://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account?tool=webui\nThen the second video on this module about saturn cloud would show you how to add the ssh keys to secrets and authenticate through a terminal.\nOr alternatively, you could just use the public keys provided by Saturn Cloud by default. To do so, follow these steps:\nClick on your username and on manage\nDown below you will see the Git SSH keys section.\nCopy the default public key provided by Saturn Cloud\nPaste these key into the SSH keys section of your github repo\nOpen a terminal on Saturn Cloud and run this command \u201cssh -T git@github.com\u201d\nYou will receive a successful authentication notice.\nOdimegwu David", "section": "8. Neural Networks and Deep Learning", "question": "How do I push from Saturn Cloud to Github?" }, { "text": "This template is referred to in the video 8.1b Setting up the Environment on Saturn Cloud\nbut the location shown in the video is no longer correct.\nThis template has been moved to \u201cpython deep learning tutorials\u2019 which is shown on the Saturn Cloud home page.\nSteven Christolis", "section": "8. Neural Networks and Deep Learning", "question": "Where is the Python TensorFlow template on Saturn Cloud?" }, { "text": "The above error happens since module scipy is not installed in the saturn cloud tensorflow image. While creating the Jupyter server resource, in the \u201cExtra Packages\u201d section under pip in the textbox write scipy. Below the textbox, the pip install scipy command will be displayed. This will ensure when the resource spins up, the scipy package will be automatically installed. This approach can also be followed for additional python packages.\nSumeet Lalla", "section": "8. Neural Networks and Deep Learning", "question": "Getting error module scipy not found during model training in Saturn Cloud tensorflow image" }, { "text": "Problem description: Uploading the data to saturn cloud from kaggle can be time saving, specially if the dataset is large.\nYou can just download to your local machine and then upload to a folder on saturn cloud, but there is a better solution that needs to be set once and you have access to all kaggle datasets in saturn cloud.\nOn your notebook run:\n!pip install -q kaggle\nGo to Kaggle website (you need to have an account for this):\nClick on your profile image -> Account\nScroll down to the API box\nClick on Create New API token\nIt will download a json file with the name kaggle.json store on your local computer. We need to upload this file in the .kaggle folder\nOn the notebook click on folder icon on the left upper corner\nThis will take you to the root folder\nClick on the .kaggle folder\nOnce inside of the .kaggle folder upload the kaggle.json file that you downloaded\nRun this command on your notebook:\n!chmod 600 /home/jovyan/.kaggle/kaggle.json\nDownload the data using this command:\n!kaggle datasets download -d agrigorev/dino-or-dragon\nCreate a folder to unzip your files:\n!mkdir data\nUnzip your files inside that folder\n!unzip dino-or-dragon.zip -d data\nPastor Soto", "section": "8. Neural Networks and Deep Learning", "question": "How to upload kaggle data to Saturn Cloud?" }, { "text": "In order to run tensorflow with gpu on your local machine you\u2019ll need to setup cuda and cudnn.\nThe process can be overwhelming. Here\u2019s a simplified guide\nOsman Ali", "section": "8. Neural Networks and Deep Learning", "question": "How to install CUDA & cuDNN on Ubuntu 22.04" }, { "text": "Problem description:\nWhen loading saved model getting error: ValueError: Unable to load weights saved in HDF5 format into a subclassed Model which has not created its variables yet. Call the Model first, then load the weights.\nSolution description:\nBefore loading model need to evaluate the model on input data: model.evaluate(train_ds)\nAdded by Vladimir Yesipov", "section": "8. Neural Networks and Deep Learning", "question": "Error: (ValueError: Unable to load weights saved in HDF5 format into a subclassed Model which has not created its variables yet. Call the Model first, then load the weights.) when loading model." }, { "text": "Problem description:\nWhen follow module 8.1b video to setup git in Saturn Cloud, run `ssh -T git@github.com` lead error `git@github.com: Permission denied (publickey).`\nSolution description:\nAlternative way, we can setup git in our Saturn Cloud env with generate SSH key in our Saturn Cloud and add it to our git account host. After it, we can access/manage our git through Saturn\u2019s jupyter server. All steps detailed on this following tutorial: https://saturncloud.io/docs/using-saturn-cloud/gitrepo/\nAdded by Ryan Pramana", "section": "8. Neural Networks and Deep Learning", "question": "Getting error when connect git on Saturn Cloud: permission denied" }, { "text": "Problem description:\nGetting an error using <git clone git@github.com:alexeygrigorev/clothing-dataset-small.git>\nThe error:\nCloning into 'clothing-dataset'...\nHost key verification failed.\nfatal: Could not read from remote repository.\nPlease make sure you have the correct access rights\nand the repository exists.\nSolution description:\nwhen cloning the repo, you can also chose https - then it should work. This happens when you don't have your ssh key configured.\n<git clone https://github.com/alexeygrigorev/clothing-dataset-small.git>\nAdded by Gregory Morris", "section": "8. Neural Networks and Deep Learning", "question": "Host key verification failed." }, { "text": "Problem description\nThe accuracy and the loss are both still the same or nearly the same while training.\nSolution description\nIn the homework, you should set class_mode='binary' while reading the data.\nAlso, problem occurs when you choose the wrong optimizer, batch size, or learning rate\nAdded by Ekaterina Kutovaia", "section": "8. Neural Networks and Deep Learning", "question": "The same accuracy on epochs" }, { "text": "Problem:\nWhen resuming training after augmentation, the loss skyrockets (1000+ during first epoch) and accuracy settles around 0.5 \u2013 i.e. the model becomes as good as a random coin flip.\nSolution:\nCheck that the augmented ImageDataGenerator still includes the option \u201crescale\u201d as specified in the preceding step.\nAdded by Konrad M\u00fchlberg", "section": "8. Neural Networks and Deep Learning", "question": "Model breaking after augmentation \u2013 high loss + bad accuracy" }, { "text": "While doing:\nimport tensorflow as tf\nfrom tensorflow import keras\nmodel = tf.keras.models.load_model('model_saved.h5')\nIf you get an error message like this:\nValueError: The channel dimension of the inputs should be defined. The input_shape received is (None, None, None, None), where axis -1 (0-based) is the channel dimension, which found to be `None`.\nSolution:\nSaving a model (either yourself via model.save() or via checkpoint when save_weights_only = False) saves two things: The trained model weights (for example the best weights found during training) and the model architecture. If the number of channels is not explicitly specified in the Input layer of the model, and is instead defined as a variable, the model architecture will not have the value in the variable stored. Therefore when the model is reloaded, it will complain about not knowing the number of channels. See the code below, in the first line, you need to specify number of channels explicitly:\n# model architecture:\ninputs = keras.Input(shape=(input_size, input_size, 3))\nbase = base_model(inputs, training=False)\nvectors = keras.layers.GlobalAveragePooling2D()(base)\ninner = keras.layers.Dense(size_inner, activation='relu')(vectors)\ndrop = keras.layers.Dropout(droprate)(inner)\noutputs = keras.layers.Dense(10)(drop)\nmodel = keras.Model(inputs, outputs)\n(Memoona Tahira)", "section": "8. Neural Networks and Deep Learning", "question": "Missing channel value error while reloading model:" }, { "text": "Problem:\nA dataset for homework is in a zipped folder. If you unzip it within a jupyter notebook by means of ! unzip command, you\u2019ll see a huge amount of output messages about unzipping of each image. So you need to suppress this output\nSolution:\nExecute the next cell:\n%%capture\n! unzip zipped_folder_name.zip -d destination_folder_name\nAdded by Alena Kniazeva\nInside a Jupyter Notebook:\nimport zipfile\nlocal_zip = 'data.zip'\nzip_ref = zipfile.ZipFile(local_zip, 'r')\nzip_ref.extractall('data')\nzip_ref.close()", "section": "8. Neural Networks and Deep Learning", "question": "How to unzip a folder with an image dataset and suppress output?" }, { "text": "Problem:\nWhen we run train_gen.flow_from_directory() as in video 8.5, it finds images belonging to 10 classes. Does it understand the names of classes from the names of folders? Or, there is already something going on deep behind?\nSolution:\nThe name of class is the folder name\nIf you just create some random folder with the name \"xyz\", it will also be considered as a class!! The name itself is saying flow_from_directory\na clear explanation below:\nhttps://vijayabhaskar96.medium.com/tutorial-image-classification-with-keras-flow-from-directory-and-generators-95f75ebe5720\nAdded by Bhaskar Sarma", "section": "8. Neural Networks and Deep Learning", "question": "How keras flow_from_directory know the names of classes in images?" }, { "text": "Problem:\nI created a new environment in SaturnCloud and chose the image corresponding to Saturn with Tensorflow, but when I tried to fit the model it showed an error about the missing module: scipy\nSolution:\nInstall the module in a new cell: !pip install scipy\nRestart the kernel and fit the model again\nAdded by Erick Calderin", "section": "8. Neural Networks and Deep Learning", "question": "Error with scipy missing module in SaturnCloud" }, { "text": "The command to read folders in the dataset in the tensorflow source code is:\nfor subdir in sorted(os.listdir(directory)):\n\u2026\nReference: https://github.com/keras-team/keras/blob/master/keras/preprocessing/image.py, line 563\nThis means folders will be read in alphabetical order. For example, in the case of a folder named dino, and another named dragon, dino will read first and will have class label 0, whereas dragon will be read in next and will have class label 1.\nWhen a Keras model predicts binary labels, it will only return one value, and this is the probability of class 1 in case of sigmoid activation function in the last dense layer with 2 neurons. The probability of class 0 can be found out by:\nprob(class(0)) = 1- prob(class(1))\nIn case of using from_logits to get results, you will get two values for each of the labels.\nA prediction of 0.8 is saying the probability that the image has class label 1 (in this case dragon), is 0.8, and conversely we can infer the probability that the image has class label 0 is 0.2.\n(Added by Memoona Tahira)", "section": "8. Neural Networks and Deep Learning", "question": "How are numeric class labels determined in flow_from_directroy using binary class mode and what is meant by the single probability predicted by a binary Keras model:" }, { "text": "It's fine, some small changes are expected\nAlexey Grigorev", "section": "8. Neural Networks and Deep Learning", "question": "Does the actual values matter after predicting with a neural network or it should be treated as like hood of falling in a class?" }, { "text": "Problem:\nI found running the wasp/bee model on my mac laptop had higher reported accuracy and lower std deviation than the HW answers. This may be because of the SGD optimizer. Running this on my mac printed a message about a new and legacy version that could be used.\nSolution:\nTry running the same code on google collab or another way. The answers were closer for me on collab. Another tip is to change the runtime to use T4 and the model run\u2019s faster than just CPU\nAdded by Quinn Avila", "section": "8. Neural Networks and Deep Learning", "question": "What if your accuracy and std training loss don\u2019t match HW?" }, { "text": "When running \u201cmodel.fit(...)\u201d an additional parameter \u201cworkers\u201d can be specified for speeding up the data loading/generation. The default value is \u201c1\u201d. Try out which value between 1 and the cpu count on your system performs best.\nhttps://www.tensorflow.org/api_docs/python/tf/keras/Model#fit\nAdded by Sylvia Schmitt", "section": "8. Neural Networks and Deep Learning", "question": "Using multi-threading for data generation in \u201cmodel.fit()\u201d" }, { "text": "Reproducibility for training runs can be achieved following these instructions: \nhttps://www.tensorflow.org/versions/r2.8/api_docs/python/tf/config/experimental/enable_op_determinism\nseed = 1234\ntf.keras.utils.set_random_seed(seed)\ntf.config.experimental.enable_op_determinism()\nThis will work for a script, if this gets executed multiple times.\nAdded by Sylvia Schmitt", "section": "8. Neural Networks and Deep Learning", "question": "Reproducibility with TensorFlow using a seed point" }, { "text": "Pytorch is also a deep learning framework that allows to do equivalent tasks as keras. Here is a tutorial to create a CNN from scratch using pytorch :\nhttps://blog.paperspace.com/writing-cnns-from-scratch-in-pytorch/\nThe functions have similar goals. The syntax can be slightly different. For the lessons and the homework, we use keras, but one can feel free to make a pull request with the equivalent with pytorch for lessons and homework!\nM\u00e9lanie Fouesnard", "section": "8. Neural Networks and Deep Learning", "question": "Can we use pytorch for this lesson/homework ?" }, { "text": "While training a Keras model you get the error \u201cFailed to find data adapter that can handle input: <class 'keras.src.preprocessing.image.ImageDataGenerator'>, <class 'NoneType'>\u201d you may have unintentionally passed the image generator instead of the dataset to the model\ntrain_gen = ImageDataGenerator(rescale=1./255)\ntrain_ds = train_gen.flow_from_directory(\u2026)\nhistory_after_augmentation = model.fit(\ntrain_gen, # this should be train_ds!!!\nepochs=10,\nvalidation_data=test_gen # this should be test_ds!!!\n)\nThe fix is simple and probably obvious once pointed out, use the training and validation dataset (train_ds and val_ds) returned from flow_from_directory\nAdded by Tzvi Friedman", "section": "8. Neural Networks and Deep Learning", "question": "Keras model training fails with \u201cFailed to find data adapter\u201d" }, { "text": "The command \u2018nvidia-smi\u2019 has a built-in function which will run it in subsequently updating it every N seconds without the need of using the command \u2018watch\u2019.\nnvidia-smi -l <N seconds>\nThe following command will run \u2018nvidia-smi\u2019 every 2 seconds until interrupted using CTRL+C.\nnvidia-smi -l 2\nAdded by Sylvia Schmitt", "section": "8. Neural Networks and Deep Learning", "question": "Running \u2018nvidia-smi\u2019 in a loop without using \u2018watch\u2019" }, { "text": "The Python package \u2018\u2019 is an interactive GPU process viewer similar to \u2018htop\u2019 for CPU.\nhttps://pypi.org/project//\nImage source: https://pypi.org/project//\nAdded by Sylvia Schmitt", "section": "8. Neural Networks and Deep Learning", "question": "Checking GPU and CPU utilization using \u2018nvitop\u2019" }, { "text": "Let\u2019s say we define our Conv2d layer like this:\n>> tf.keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=(150, 150, 3))\nIt means our input image is RGB (3 channels, 150 by 150 pixels), kernel is 3x3 and number of filters (layer\u2019s width) is 32.\nIf we check model.summary() we will get this:\n_________________________________________________________________\nLayer (type) Output Shape Param #\n=================================================================\nconv2d (Conv2D) (None, 148, 148, 32) 896\nSo where does 896 params come from? It\u2019s computed like this:\n>>> (3*3*3 +1) * 32\n896\n# 3x3 kernel, 3 channels RGB, +1 for bias, 32 filters\nWhat about the number of \u201cfeatures\u201d we get after the Flatten layer?\nFor our homework model.summary() for last MaxPooling2d and Flatten layers looked like this:\n_________________________________________________________________\nLayer (type) Output Shape Param #\n=================================================================\nmax_pooling2d_3 (None, 7, 7, 128) 0\nflatten (Flatten) (None, 6272) 0\nSo where do 6272 vectors come from? It\u2019s computed like this:\n>>> 7*7*128\n6272\n# 7x7 \u201cimage shape\u201d after several convolutions and poolings, 128 filters\nAdded by Andrii Larkin", "section": "8. Neural Networks and Deep Learning", "question": "Q: Where does the number of Conv2d layer\u2019s params come from? Where does the number of \u201cfeatures\u201d we get after the Flatten layer come from?" }, { "text": "It\u2019s quite useful to understand that all types of models in the course are a plain stack of layers where each layer has exactly one input tensor and one output tensor (Sequential model TF page, Sequential class).\nYou can simply start from an \u201cempty\u201d model and add more and more layers in a sequential order.\nThis mode is called \u201cSequential Model API\u201d (easier)\nIn Alexey\u2019s videos it is implemented as chained calls of different entities (\u201cinputs\u201d,\u201cbase\u201d, \u201cvectors\u201d, \u201coutputs\u201d) in a more advanced mode \u201cFunctional Model API\u201d.\nMaybe a more complicated way makes sense when you do Transfer Learning and want to separate \u201cBase\u201d model vs. rest, but in the HW you need to recreate the full model from scratch \u21d2 I believe it is easier to work with a sequence of \u201csimilar\u201d layers.\nYou can read more about it in this TF2 tutorial.\nA really useful Sequential model example is shared in the Kaggle\u2019s \u201cBee or Wasp\u201d dataset folder with code: notebook\nAdded by Ivan Brigida\nFresh Run on Neural Nets\nWhile correcting an error on neural net architecture, it is advised to do fresh run by restarting kernel, else the model learns on top of previous runs.\nAdded by Abhijit Chakraborty", "section": "8. Neural Networks and Deep Learning", "question": "Sequential vs. Functional Model Modes in Keras (TF2)" }, { "text": "I found this code snippet fixed my OOM errors, as I have an Nvidia GPU. Can't speak to OOM errors on CPU, though.\nhttps://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth\n```\nphysical_devices = tf.configlist_physical_devices('GPU')\ntry:\ntf.config.experimental.set_memory_growth(physical_devices[0],True)\nexcept:\n# Invalid device or cannot modify virtual devices once initialized.\npass\n```", "section": "8. Neural Networks and Deep Learning", "question": "Out of memory errors when running tensorflow" }, { "text": "When training the models, in the fit function, you can specify the number of workers/threads.\nThe number of threads apparently also works for GPUs, and came very handy in google colab for the T4 GPU, since it was very very slow, and workers default value is 1.\nI changed the workers variable to 2560, following this thread in stackoverflow. I am using the free T4 GPU. (https://stackoverflow.com/questions/68208398/how-to-find-the-number-of-cores-in-google-colabs-gpu)\nAdded by Ibai Irastorza", "section": "8. Neural Networks and Deep Learning", "question": "Model training very slow in google colab with T4 GPU" }, { "text": "From the keras documentation:\nDeprecated: tf.keras.preprocessing.image.ImageDataGenerator is not recommended for new code. Prefer loading images with tf.keras.utils.image_dataset_from_directory and transforming the output tf.data.Dataset with preprocessing layers. For more information, see the tutorials for loading images and augmenting images, as well as the preprocessing layer guide.\nHrithik Kumar Advani", "section": "9. Serverless Deep Learning", "question": "Using image_dataset_from_directory instead of ImageDataGeneratorn for loading images" }, { "text": "TODO", "section": "9. Serverless Deep Learning", "question": "How to get started with Week 9?" }, { "text": "The week 9 uses a link to github to fetch the models.\nThe original link was moved to here:\nhttps://github.com/DataTalksClub/machine-learning-zoomcamp/releases", "section": "9. Serverless Deep Learning", "question": "Where is the model for week 9?" }, { "text": "Solution description\nIn the unit 9.6, Alexey ran the command echo ${REMOTE_URI} which turned the URI address in the terminal. There workaround is to set a local variable (REMOTE_URI) and assign your URI address in the terminal and use it to login the registry, for instance, REMOTE_URI=2278222782.dkr.ecr.ap-south-1.amazonaws.com/clothing-tflite-images (fake address). One caveat is that you will lose this variable once the session is terminated.\nI also had the same problem on Ubuntu terminal. I executed the following two commands:\n$ export REMOTE_URI=1111111111.dkr.ecr.us-west-1.amazonaws.com/clothing-tflite-images:clothing-model-xception-v4-001\n$ echo $REMOTE_URI\n111111111.dkr.ecr.us-west-1.amazonaws.com/clothing-tflite-images:clothing-model-xception-v4-001\nNote: 1. no curly brackets (e.g. echo ${REMOTE_URI}) needed unlike in video 9.6,\n2. Replace REMOTE_URI with your URI\n(Bhaskar Sarma)", "section": "9. Serverless Deep Learning", "question": "Executing the command echo ${REMOTE_URI} returns nothing." }, { "text": "The command aws ecr get-login --no-include-email returns an invalid choice error:\nThe solution is to use the following command instead: aws ecr get-login-password\nCould simplify the login process with, just replace the <ACCOUNT_NUMBER> and <REGION> with your values:\nexport PASSWORD=`aws ecr get-login-password`\ndocker login -u AWS -p $PASSWORD <ACCOUNT_NUMBER>.dkr.ecr.<REGION>.amazonaws.com/clothing-tflite-images\nAdded by Martin Uribe", "section": "9. Serverless Deep Learning", "question": "Getting a syntax error while trying to get the password from aws-cli" }, { "text": "We can use the keras.models.Sequential() function to pass many parameters of the cnn at once.\nKrishna Anand", "section": "9. Serverless Deep Learning", "question": "Pass many parameters in the model at once" }, { "text": "This error is produced sometimes when building your docker image from the Amazon python base image.\nSolution description: The following could solve the problem.\nUpdate your docker desktop if you haven\u2019t done so.\nOr restart docker desktop and terminal and then build the image all over again.\nOr if all else fails, first run the following command: DOCKER_BUILDKIT=0 docker build . then build your image.\n(optional) Added by Odimegwu David", "section": "9. Serverless Deep Learning", "question": "Getting ERROR [internal] load metadata for public.ecr.aws/lambda/python:3.8" }, { "text": "When trying to run the command !ls -lh in windows jupyter notebook , I was getting an error message that says \u201c'ls' is not recognized as an internal or external command,operable program or batch file.\nSolution description :\nInstead of !ls -lh , you can use this command !dir , and you will get similar output\nAsia Saeed", "section": "9. Serverless Deep Learning", "question": "Problem: 'ls' is not recognized as an internal or external command, operable program or batch file." }, { "text": "When I run import tflite_runtime.interpreter as tflite , I get an error message says \u201cImportError: generic_type: type \"InterpreterWrapper\" is already registered!\u201d\nSolution description\nThis error occurs when you import both tensorflow and tflite_runtime.interpreter \u201cimport tensorflow as tf\u201d and \u201cimport tflite_runtime.interpreter as tflite\u201d in the same notebook. To fix the issue, restart the kernel and import only tflite_runtime.interpreter \" import tflite_runtime.interpreter as tflite\".\nAsia Saeed", "section": "9. Serverless Deep Learning", "question": "ImportError: generic_type: type \"InterpreterWrapper\" is already registered!" }, { "text": "Problem description:\nIn command line try to do $ docker build -t dino_dragon\ngot this Using default tag: latest\n[2022-11-24T06:48:47.360149000Z][docker-credential-desktop][W] Windows version might not be up-to-date: The system cannot find the file specified.\nerror during connect: This error may indicate that the docker daemon is not running.: Post\n.\nSolution description:\nYou need to make sure that Docker is not stopped by a third-party program.\nAndrei Ilin", "section": "9. Serverless Deep Learning", "question": "Windows version might not be up-to-date" }, { "text": "When running docker build -t dino-dragon-model it returns the above error\nThe most common source of this error in this week is because Alex video shows a version of the wheel with python 8, we need to find a wheel with the version that we are working on. In this case python 9. Another common error is to copy the link, this will also produce the same error, we need to download the raw format:\nhttps://github.com/alexeygrigorev/tflite-aws-lambda/raw/main/tflite/tflite_runtime-2.7.0-cp39-cp39-linux_x86_64.whl\nPastor Soto", "section": "9. Serverless Deep Learning", "question": "WARNING: You are using pip version 22.0.4; however, version 22.3.1 is available" }, { "text": "Problem description:\nIn video 9.6, after installing aswcli, we should configure it with aws configure . There it asks for Access Key ID, Secret Access Key, Default Region Name and also Default output format. What we should put for Default output format? Leaving it as None is okay?\nSolution description:\nYes, in my I case I left everything as the provided defaults (except, obviously, the Access key and the secret access key)\nAdded by Bhaskar Sarma", "section": "9. Serverless Deep Learning", "question": "How to do AWS configure after installing awscli" }, { "text": "Problem:\nWhile passing local testing of the lambda function without issues, trying to test the same input with a running docker instance results in an error message like\n{\u2018errorMessage\u2019: \u2018Unable to marshal response: Object of type float32 is not JSON serializable\u2019, \u2018errorType\u2019: \u2018Runtime.MarshalError\u2019, \u2018requestId\u2019: \u2018f155492c-9af2-4d04-b5a4-639548b7c7ac\u2019, \u2018stackTrace\u2019: []}\nThis happens when a model (in this case the dino vs dragon model) returns individual estimation values as numpy float32 values (arrays). They need to be converted individually to base-Python floats in order to become \u201cserializable\u201d.\nSolution:\nIn my particular case, I set up the dino vs dragon model in such a way as to return a label + predicted probability for each class as follows (below is a two-line extract of function predict() in the lambda_function.py):\npreds = [interpreter.get_tensor(output_index)[0][0], \\\n1-interpreter.get_tensor(output_index)[0][0]]\nIn which case the above described solution will look like this:\npreds = [float(interpreter.get_tensor(output_index)[0][0]), \\\nfloat(1-interpreter.get_tensor(output_index)[0][0])]\nThe rest can be made work by following the chapter 9 (and/or chapter 5!) lecture videos step by step.\nAdded by Konrad Muehlberg", "section": "9. Serverless Deep Learning", "question": "Object of type float32 is not JSON serializable" }, { "text": "I had this error when running the command line : interpreter.set_tensor(input_index, x) that can be seen in the video 9.3 around 12 minutes.\nValueError: Cannot set tensor: Got value of type UINT8 but expected type FLOAT32 for input 0, name: serving_default_conv2d_input:0\nThis is because the X is an int but a float is expected.\nSolution:\nI found this solution from this question here https://stackoverflow.com/questions/76102508/valueerror-cannot-set-tensor-got-value-of-type-float64-but-expected-type-float :\n# Need to convert to float32 before set_tensor\nX = np.float32(X)\nThen, it works. I work with tensorflow 2.15.0, maybe the fact that this version is more recent involves this change ?\nAdded by M\u00e9lanie Fouesnard", "section": "9. Serverless Deep Learning", "question": "Error with the line \u201cinterpreter.set_tensor(input_index, X\u201d)" }, { "text": "To check your file size using the powershell terminal, you can do the following command lines:\n$File = Get-Item -Path path_to_file\n$FileSize = (Get-Item -Path $FilePath).Length\nNow you can check the size of your file, for example in MB:\nWrite-host \"MB\":($FileSize/1MB)\nSource: https://www.sharepointdiary.com/2020/10/powershell-get-file-size.html#:~:text=To%20get%20the%20size%20of,the%20file%2C%20including%20its%20size.\nAdded by M\u00e9lanie Fouesnard", "section": "9. Serverless Deep Learning", "question": "How to easily get file size in powershell terminal ?" }, { "text": "I wanted to understand how lambda container images work in depth and how lambda functions are initialized, for this reason, I found the following documentation\nhttps://docs.aws.amazon.com/lambda/latest/dg/images-create.html\nhttps://docs.aws.amazon.com/lambda/latest/dg/runtimes-api.html\nAdded by Alejandro aponte", "section": "9. Serverless Deep Learning", "question": "How do Lambda container images work?" }, { "text": "The docker image for aws lambda can be created and pushed to aws ecr and the same can be exposed as a REST API through APIGatewayService in a single go using AWS Serverless Framework. Refer the below article for a detailed walkthrough.\nhttps://medium.com/hoonio/deploy-containerized-serverless-flask-to-aws-lambda-c0eb87c1404d\nAdded by Sumeet Lalla", "section": "9. Serverless Deep Learning", "question": "How to use AWS Serverless Framework to deploy on AWS Lambda and expose it as REST API through APIGatewayService?" }, { "text": "Problem:\nWhile trying to build docker image in Section 9.5 with the command:\ndocker build -t clothing-model .\nIt throws a pip install error for the tflite runtime whl\nERROR: failed to solve: process \"/bin/sh -c pip install https://github.com/alexeygrigorev/tflite-aws-lambda/blob/main/tflite/tflite_runtime-2.14.0-cp310-cp310-linux_x86_64.whl\" did not complete successfully: exit code: 1\nTry to use this link: https://github.com/alexeygrigorev/tflite-aws-lambda/raw/main/tflite/tflite_runtime-2.14.0-cp310-cp310-linux_x86_64.whl\nIf the link above does not work:\nThe problem is because of the arm architecture of the M1. You will need to run the code on a PC or Ubuntu OS.\nOr try the code bellow.\nAdded by Dashel Ruiz Perez\nSolution:\nTo build the Docker image, use the command:\ndocker build --platform linux/amd64 -t clothing-model .\nTo run the built image, use the command:\ndocker run -it --rm -p 8080:8080 --platform linux/amd64 clothing-model:latest\nAdded by Daniel Egbo", "section": "9. Serverless Deep Learning", "question": "Error building docker image on M1 Mac" }, { "text": "Problem: Trying to test API gateway in 9.7 - API Gateway: Exposing the Lambda Function, running: $ python test.py\nWith error message:\n{'message': 'Missing Authentication Token'}\nSolution:\nNeed to get the deployed API URL for the specific path you are invoking. Example:\nhttps://<random string>.execute-api.us-east-2.amazonaws.com/test/predict\nAdded by Andrew Katoch", "section": "9. Serverless Deep Learning", "question": "Error invoking API Gateway deploy API locally" }, { "text": "Problem: When trying to install tflite_runtime with\n!pip install --extra-index-url https://google-coral.github.io/py-repo/ tflite_runtime\none gets an error message above.\nSolution:\nfflite_runtime is only available for the os-python version combinations that can be found here: https://google-coral.github.io/py-repo/tflite-runtime/\nyour combination must be missing here\nyou can see if any of these work for you https://github.com/alexeygrigorev/tflite-aws-lambda/tree/main/tflite\nand install the needed one using pip\neg\npip install https://github.com/alexeygrigorev/tflite-aws-lambda/raw/main/tflite/tflite_runtime-2.7.0-cp38-cp38-linux_x86_64.whl\nas it is done in the lectures code:\nhttps://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/09-serverless/code/Dockerfile#L4\nAlternatively, use a virtual machine (with VM VirtualBox, for example) with a Linux system. The other way is to run a code at a virtual machine within cloud service, for example you can use Vertex AI Workbench at GCP (notebooks and terminals are provided there, so all tasks may be performed).\nAdded by Alena Kniazeva, modified by Alex Litvinov", "section": "9. Serverless Deep Learning", "question": "Error: Could not find a version that satisfies the requirement tflite_runtime (from versions:none)" }, { "text": "docker: Error response from daemon: mkdir /var/lib/docker/overlay2/37be849565da96ac3fce34ee9eb2215bd6cd7899a63ebc0ace481fd735c4cb0e-init: read-only file system.\nYou need to restart the docker services to get rid of the above error\nKrishna Anand", "section": "9. Serverless Deep Learning", "question": "Docker run error" }, { "text": "The docker image can be saved/exported to tar format in local machine using the below command:\ndocker image save <image-name> -o <name-of-tar-file.tar>\nThe individual layers of the docker image for the filesystem content can be viewed by extracting the layer.tar present in the <name-of-tar-file.tar> created from above.\nSumeet Lalla", "section": "9. Serverless Deep Learning", "question": "Save Docker Image to local machine and view contents" }, { "text": "On vscode running jupyter notebook. After I \u2018pip install pillow\u2019, my notebook did not recognize using the import for example from PIL import image. After restarting the jupyter notebook the imports worked.\nQuinn Avila", "section": "9. Serverless Deep Learning", "question": "Jupyter notebook not seeing package" }, { "text": "Due to experimenting back and forth so much without care for storage, I just ran out of it on my 30-GB AWS instance. It turns out that deleting docker images does not actually free up any space as you might expect. After removing images, you also need to run docker system prune", "section": "9. Serverless Deep Learning", "question": "Running out of space for AWS instance." }, { "text": "Using the 2.14 version with python 3.11 works fine.\nIn case it doesn\u2019t work, I tried with tensorflow 2.4.4 whl, however, make sure to run it on top of supported python versions like 3.8, else there will be issues installing tf==2.4.4\nAdded by Abhijit Chakraborty", "section": "9. Serverless Deep Learning", "question": "Using Tensorflow 2.15 for AWS deployment" }, { "text": "see here", "section": "9. Serverless Deep Learning", "question": "Command aws ecr get-login --no-include-email returns \u201caws: error: argument operation: Invalid choice\u2026\u201d" }, { "text": "Sign in to the AWS Console: Log in to the AWS Console.\nNavigate to IAM: Go to the IAM service by clicking on \"Services\" in the top left corner and selecting \"IAM\" under the \"Security, Identity, & Compliance\" section.\nCreate a new policy: In the left navigation pane, select \"Policies\" and click on \"Create policy.\"\nSelect the service and actions:\nClick on \"JSON\" and copy and paste the JSON policy you provided earlier for the specific ECR actions.\nReview and create the policy:\nClick on \"Review policy.\"\nProvide a name and description for the policy.\nClick on \"Create policy.\"\nJSON policy:\n{\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n{\n\"Sid\": \"VisualEditor0\",\n\"Effect\": \"Allow\",\n\"Action\": [\n\"ecr:CreateRepository\",\n\"ecr:GetAuthorizationToken\",\n\"ecr:BatchCheckLayerAvailability\",\n\"ecr:BatchGetImage\",\n\"ecr:InitiateLayerUpload\",\n\"ecr:UploadLayerPart\",\n\"ecr:CompleteLayerUpload\",\n\"ecr:PutImage\"\n],\n\"Resource\": \"*\"\n}\n]\n}\nAdded by: Daniel Mu\u00f1oz-Viveros\nERROR: failed to solve: public.ecr.aws/lambda/python:3.10: error getting credentials - err: exec: \"docker-credential-desktop.exe\": executable file not found in $PATH, out: ``\n(WSL2 system)\nSolved: Delete the file ~/.docker/config.json\nYishan Zhan", "section": "9. Serverless Deep Learning", "question": "What IAM permission policy is needed to complete Week 9: Serverless?" }, { "text": "Add the next lines to vim /etc/docker/daemon.json\n{\n\"dns\": [\"8.8.8.8\", \"8.8.4.4\"]\n}\nThen, restart docker: sudo service docker restart\nIbai Irastorza", "section": "9. Serverless Deep Learning", "question": "Docker Temporary failure in name resolution" }, { "text": "Solution: add compile = False to the load_model function\nkeras.models.load_model('model_name.h5', compile=False)\nNadia Paz", "section": "9. Serverless Deep Learning", "question": "Keras model *.h5 doesn\u2019t load. Error: weight_decay is not a valid argument, kwargs should be empty for `optimizer_experimental.Optimizer`" }, { "text": "This deployment setup can be tested locally using AWS RIE (runtime interface emulator).\nBasically, if your Docker image was built upon base AWS Lambda image (FROM public.ecr.aws/lambda/python:3.10) - just use certain ports for \u201cdocker run\u201d and a certain \u201clocalhost link\u201d for testing:\ndocker run -it --rm -p 9000:8080 name\nThis command runs the image as a container and starts up an endpoint locally at:\nlocalhost:9000/2015-03-31/functions/function/invocations\nPost an event to the following endpoint using a curl command:\ncurl -XPOST \"http://localhost:9000/2015-03-31/functions/function/invocations\" -d '{}'\nExamples of curl testing:\n* windows testing:\ncurl -XPOST \"http://localhost:9000/2015-03-31/functions/function/invocations\" -d \"{\\\"url\\\": \\\"https://habrastorage.org/webt/rt/d9/dh/rtd9dhsmhwrdezeldzoqgijdg8a.jpeg\\\"}\"\n* unix testing:\ncurl -XPOST \"http://localhost:9000/2015-03-31/functions/function/invocations\" -d '{\"url\": \"https://habrastorage.org/webt/rt/d9/dh/rtd9dhsmhwrdezeldzoqgijdg8a.jpeg\"}'\nIf during testing you encounter an error like this:\n# {\"errorMessage\": \"Unable to marshal response: Object of type float32 is not JSON serializable\", \"errorType\": \"Runtime.MarshalError\", \"requestId\": \"7ea5d17a-e0a2-48d5-b747-a16fc530ed10\", \"stackTrace\": []}\njust turn your response at lambda_handler() to string - str(result).\nAdded by Andrii Larkin", "section": "9. Serverless Deep Learning", "question": "How to test AWS Lambda + Docker locally?" }, { "text": "Make sure all codes in test.py dont have any dependencies with tensorflow library. One of most common reason that lead the this error is tflite still imported from tensorflow. Change import tensorflow.lite as tflite to import tflite_runtime.interpreter as tflite\nAdded by Ryan Pramana", "section": "9. Serverless Deep Learning", "question": "\"Unable to import module 'lambda_function': No module named 'tensorflow'\" when run python test.py" }, { "text": "I\u2019ve tried to do everything in Google Colab. Here is a way to work with Docker in Google Colab:\nhttps://gist.github.com/mwufi/6718b30761cd109f9aff04c5144eb885\n\uec03%%shell\npip install udocker\nudocker --allow-root install\n\uec02!udocker --allow-root run hello-world\nAdded by Ivan Brigida\nLambda API Gateway errors:\n`Authorization header requires 'Credential' parameter. Authorization header requires 'Signature' parameter. Authorization header requires 'SignedHeaders' parameter. Authorization header requires existence of either a 'X-Amz-Date' or a 'Date' header.`\n`Missing Authentication Token`\nimport boto3\nclient = boto3.client('apigateway')\nresponse = client.test_invoke_method(\nrestApiId='your_rest_api_id',\nresourceId='your_resource_id',\nhttpMethod='POST',\npathWithQueryString='/test/predict', #depend how you set up the api\nbody='{\"url\": \"https://habrastorage.org/webt/rt/d9/dh/rtd9dhsmhwrdezeldzoqgijdg8a.jpeg\"}'\n)\nprint(response['body'])\nYishan Zhan\nUnable to run pip install tflite_runtime from github wheel links?\nTo overcome this issue, you can download the whl file to your local project folder and in the Docker file add the following lines:\nCOPY <file-name> .\nRUN pip install <file-name>\nAbhijit Chakraborty", "section": "10. Kubernetes and TensorFlow Serving", "question": "Install Docker (udocker) in Google Colab" }, { "text": "TODO", "section": "10. Kubernetes and TensorFlow Serving", "question": "How to get started with Week 10?" }, { "text": "Running a CNN on your CPU can take a long time and once you\u2019ve run out of free time on some cloud providers, it\u2019s time to pay up. Both can be tackled by installing tensorflow with CUDA support on your local machine if you have the right hardware.\nI was able to get it working by using the following resources:\nCUDA on WSL :: CUDA Toolkit Documentation (nvidia.com)\nInstall TensorFlow with pip\nStart Locally | PyTorch\nI included the link to PyTorch so that you can get that one installed and working too while everything is fresh on your mind. Just select your options, and for Computer Platform, I chose CUDA 11.7 and it worked for me.\nAdded by Martin Uribe", "section": "10. Kubernetes and TensorFlow Serving", "question": "How to install Tensorflow in Ubuntu WSL2" }, { "text": "If you are running tensorflow on your own machine and you start getting the following errors:\nAllocator (GPU_0_bfc) ran out of memory trying to allocate 6.88GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.\nTry adding this code in a cell at the beginning of your notebook:\nconfig = tf.compat.v1.ConfigProto()\nconfig.gpu_options.allow_growth = True\nsession = tf.compat.v1.Session(config=config)\nAfter doing this most of my issues went away. I say most because there was one instance when I still got the error once more, but only during one epoch. I ran the code again, right after it finished, and I never saw the error again.\nAdded by Martin Uribe", "section": "10. Kubernetes and TensorFlow Serving", "question": "Getting: Allocator ran out of memory errors?" }, { "text": "In session 10.3, when creating the virtual environment with pipenv and trying to run the script gateway.py, you might get this error:\nTypeError: Descriptors cannot not be created directly.\nIf this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.\nIf you cannot immediately regenerate your protos, some other possible workarounds are:\n1. Downgrade the protobuf package to 3.20.x or lower.\n2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).\nMore information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates\nThis will happen if your version of protobuf is one of the newer ones. As a workaround, you can fix the protobuf version to an older one. In my case I got around the issue by creating the environment with:\npipenv install --python 3.9.13 requests grpcio==1.42.0 flask gunicorn \\\nkeras-image-helper tensorflow-protobuf==2.7.0 protobuf==3.19.6\nAdded by \u00c1ngel de Vicente", "section": "10. Kubernetes and TensorFlow Serving", "question": "Problem with recent version of protobuf" }, { "text": "Due to the uncertainties associated with machines, sometimes you can get the error message like this when you try to run a docker command:\n\u201dCannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\u201d\nSolution: The solution is simple. The Docker Desktop might no longer be connecting to the WSL Linux distro. What you need to do is go to your Docker Desktop setting and then click on resources. Under resources, click on WSL Integration. You will get a tab like the image below:\nJust enable additional distros. That\u2019s all. Even if the additional distro is the same as the default WSL distro.\nOdimegwu David", "section": "10. Kubernetes and TensorFlow Serving", "question": "WSL Cannot Connect To Docker Daemon" }, { "text": "In case the HPA instance does not run correctly even after installing the latest version of Metrics Server from the components.yaml manifest with:\n>>kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml\nAnd the targets still appear as <unknown>\nRun >>kubectl edit deploy -n kube-system metrics-server\nAnd search for this line:\nargs:\n- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname\nAdd this line in the middle: - --kubelet-insecure-tls\nSo that it stays like this:\nargs:\n- --kubelet-insecure-tls\n- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname\nSave and run again >>kubectl get hpa\nAdded by Marilina Orihuela", "section": "10. Kubernetes and TensorFlow Serving", "question": "HPA instance doesn\u2019t run properly" }, { "text": "In case the HPA instance does not run correctly even after installing the latest version of Metrics Server from the components.yaml manifest with:\n>>kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml\nAnd the targets still appear as <unknown>\nRun the following command:\nkubectl apply -f https://raw.githubusercontent.com/Peco602/ml-zoomcamp/main/10-kubernetes/kube-config/metrics-server-deployment.yaml\nWhich uses a metrics server deployment file already embedding the - --kubelet-insecure-tls option.\nAdded by Giovanni Pecoraro", "section": "10. Kubernetes and TensorFlow Serving", "question": "HPA instance doesn\u2019t run properly (easier solution)" }, { "text": "When I run pip install grpcio==1.42.0 tensorflow-serving-api==2.7.0 to install the libraries in windows machine, I was getting the below error :\nERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\\\Users\\\\Asia\\\\anaconda3\\\\Lib\\\\site-packages\\\\google\\\\protobuf\\\\internal\\\\_api_implementation.cp39-win_amd64.pyd'\nConsider using the `--user` option or check the permissions.\nSolution description :\nI was able to install the libraries using below command:\npip --user install grpcio==1.42.0 tensorflow-serving-api==2.7.0\nAsia Saeed", "section": "10. Kubernetes and TensorFlow Serving", "question": "Could not install packages due to an OSError: [WinError 5] Access is denied" }, { "text": "Problem description\nI was getting the below error message when I run gateway.py after modifying the code & creating virtual environment in video 10.3 :\nFile \"C:\\Users\\Asia\\Data_Science_Code\\Zoompcamp\\Kubernetes\\gat.py\", line 9, in <module>\nfrom tensorflow_serving.apis import predict_pb2\nFile \"C:\\Users\\Asia\\.virtualenvs\\Kubernetes-Ge6Ts1D5\\lib\\site-packages\\tensorflow_serving\\apis\\predict_pb2.py\", line 14, in <module>\nfrom tensorflow.core.framework import tensor_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__pb2\nFile \"C:\\Users\\Asia\\.virtualenvs\\Kubernetes-Ge6Ts1D5\\lib\\site-packages\\tensorflow\\core\\framework\\tensor_pb2.py\", line 14, in <module>\nfrom tensorflow.core.framework import resource_handle_pb2 as tensorflow_dot_core_dot_framework_dot_resource__handle__pb2\nFile \"C:\\Users\\Asia\\.virtualenvs\\Kubernetes-Ge6Ts1D5\\lib\\site-packages\\tensorflow\\core\\framework\\resource_handle_pb2.py\", line 14, in <module>\nfrom tensorflow.core.framework import tensor_shape_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__shape__pb2\nFile \"C:\\Users\\Asia\\.virtualenvs\\Kubernetes-Ge6Ts1D5\\lib\\site-packages\\tensorflow\\core\\framework\\tensor_shape_pb2.py\", line 36, in <module>\n_descriptor.FieldDescriptor(\nFile \"C:\\Users\\Asia\\.virtualenvs\\Kubernetes-Ge6Ts1D5\\lib\\site-packages\\google\\protobuf\\descriptor.py\", line 560, in __new__\n_message.Message._CheckCalledFromGeneratedFile()\nTypeError: Descriptors cannot not be created directly.\nIf this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.\nIf you cannot immediately regenerate your protos, some other possible workarounds are:\n1. Downgrade the protobuf package to 3.20.x or lower.\n2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).\nSolution description:\nIssue has been resolved by downgrading protobuf to version 3.20.1.\npipenv install protobuf==3.20.1\nAsia Saeed", "section": "10. Kubernetes and TensorFlow Serving", "question": "TypeError: Descriptors cannot not be created directly." }, { "text": "To install kubectl on windows using the terminal in vscode (powershell), I followed this tutorial: https://medium.com/@ggauravsigra/install-kubectl-on-windows-af77da2e6fff\nI first downloaded kubectl with curl, with these command lines: https://kubernetes.io/docs/tasks/tools/install-kubectl-windows/#install-kubectl-binary-with-curl-on-windows\nAt step 3, I followed the tutorial with the copy of the exe file in a specific folder on C drive.\nThen I added this folder path to PATH in my environment variables.\nKind can be installed the same way with the curl command on windows, by specifying a folder that will be added to the path environment variable.\nAdded by M\u00e9lanie Fouesnard", "section": "10. Kubernetes and TensorFlow Serving", "question": "How to install easily kubectl on windows ?" }, { "text": "First you need to launch a powershell terminal with administrator privilege.\nFor this we need to install choco library first through the following syntax in powershell:\nSet-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))\nKrishna Anand", "section": "10. Kubernetes and TensorFlow Serving", "question": "Install kind through choco library" }, { "text": "If you are having challenges installing Kind through the Windows Powershell as provided on the website and Choco Library as I did, you can simply install Kind through Go.\n> Download and Install Go (https://go.dev/doc/install)\n> Confirm installation by typing the following in Command Prompt - go version\n> Proceed by installing Kind by following this command - go install sigs.k8s.io/kind@v0.20.0\n>Confirm Installation kind --version\nIt works perfectly.", "section": "10. Kubernetes and TensorFlow Serving", "question": "Install Kind via Go package" }, { "text": "I ran into an issue where kubectl wasn't working.\nI kept getting the following error:\nkubectl get service\nThe connection to the server localhost:8080 was refused - did you specify the right host or port?\nI searched online for a resolution, but everyone kept talking about creating an environment variable and creating some admin.config file in my home directory.\nAll hogwash.\nThe solution to my problem was to just start over.\nkind delete cluster\nrm -rf ~/.kube\nkind create cluster\nNow when I try the same command again:\nkubectl get service\nNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE\nkubernetes ClusterIP 10.96.0.1 <none> 443/TCP 53s\nAdded by Martin Uribe", "section": "10. Kubernetes and TensorFlow Serving", "question": "The connection to the server localhost:8080 was refused - did you specify the right host or port?" }, { "text": "Problem description\nDue to experimenting back and forth so much without care for storage, I just ran out of it on my 30-GB AWS instance.\nMy first reflex was to remove some zoomcamp directories, but of course those are mostly code so it didn\u2019t help much.\nSolution description\n> docker images\nrevealed that I had over 20 GBs worth of superseded / duplicate models lying around, so I proceeded to > docker rmi\na bunch of those \u2014 but to no avail!\nIt turns out that deleting docker images does not actually free up any space as you might expect. After removing images, you also need to run\n> docker system prune\nSee also: https://stackoverflow.com/questions/36799718/why-removing-docker-containers-and-images-does-not-free-up-storage-space-on-wind\nAdded by Konrad M\u00fchlberg", "section": "10. Kubernetes and TensorFlow Serving", "question": "Running out of storage after building many docker images" }, { "text": "Yes, the question does require for you to specify values for CPU and memory in the yaml file, however the question that it is use in the form only refers to the port which do have a define correct value for this specific homework.\nPastor Soto", "section": "10. Kubernetes and TensorFlow Serving", "question": "In HW10 Q6 what does it mean \u201ccorrect value for CPU and memory\u201d? Aren\u2019t they arbitrary?" }, { "text": "In Kubernetes resource specifications, such as CPU requests and limits, the \"m\" stands for milliCPU, which is a unit of computing power. It represents one thousandth of a CPU core.\ncpu: \"100m\" means the container is requesting 100 milliCPUs, which is equivalent to 0.1 CPU core.\ncpu: \"500m\" means the container has a CPU limit of 500 milliCPUs, which is equivalent to 0.5 CPU core.\nThese values are specified in milliCPUs to allow fine-grained control over CPU resources. It allows you to express CPU requirements and limits in a more granular way, especially in scenarios where your application might not need a full CPU core.\nAdded by Andrii Larkin", "section": "10. Kubernetes and TensorFlow Serving", "question": "Why cpu vals for Kubernetes deployment.yaml look like \u201c100m\u201d and \u201c500m\u201d? What does \"m\" mean?" }, { "text": "Problem: Failing to load docker-image to cluster (when you\u2019ved named a cluster)\nkind load docker-image zoomcamp-10-model:xception-v4-001\nERROR: no nodes found for cluster \"kind\"\nSolution: Specify cluster name with -n\nkind -n clothing-model load docker-image zoomcamp-10-model:xception-v4-001\nAndrew Katoch", "section": "10. Kubernetes and TensorFlow Serving", "question": "Kind cannot load docker image" }, { "text": "Problem: I download kind from the next command:\ncurl.exe -Lo kind-windows-amd64.exe https://kind.sigs.k8s.io/dl/v0.17.0/kind-windows-amd64\nWhen I try\nkind --version\nI get: 'kind' is not recognized as an internal or external command, operable program or batch file\nSolution: The default name of executable is kind-windows-amd64.exe, so that you have to rename this file to kind.exe. Put this file in specific folder, and add it to PATH\nAlejandro Aponte", "section": "10. Kubernetes and TensorFlow Serving", "question": "'kind' is not recognized as an internal or external command, operable program or batch file. (In Windows)" }, { "text": "Using kind with Rootless Docker or Rootless Podman requires some changes on the system (Linux), see kind \u2013 Rootless (k8s.io).\nSylvia Schmitt", "section": "10. Kubernetes and TensorFlow Serving", "question": "Running kind on Linux with Rootless Docker or Rootless Podman" }, { "text": "Deploy and Access the Kubernetes Dashboard\nLuke", "section": "10. Kubernetes and TensorFlow Serving", "question": "Kubernetes-dashboard" }, { "text": "Make sure you are on AWS CLI v2 (check with aws --version)\nhttps://docs.aws.amazon.com/cli/latest/userguide/cliv2-migration-instructions.html", "section": "10. Kubernetes and TensorFlow Serving", "question": "Correct AWS CLI version for eksctl" }, { "text": "Problem Description:\nIn video 10.3, when I was testing a flask service, I got the above error. I ran docker run .. in one terminal. When in second terminal I run python gateway.py, I get the above error.\nSolution: This error has something to do with versions of Flask and Werkzeug. I got the same error, if I just import flask with from flask import Flask.\nBy running pip freeze > requirements.txt,I found that their versions are Flask==2.2.2 and Werkzeug==2.2.2. This error appears while using an old version of werkzeug (2.2.2) with new version of flask (2.2.2). I solved it by pinning version of Flask into an older version with pipenv install Flask==2.1.3.\nAdded by Bhaskar Sarma", "section": "10. Kubernetes and TensorFlow Serving", "question": "TypeError: __init__() got an unexpected keyword argument 'unbound_message' while importing Flask" }, { "text": "As per AWS documentation:\nhttps://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html\nYou need to do: (change the fields in red)\naws ecr get-login-password --region region | docker login --username AWS --password-stdin aws_account_id.dkr.ecr.region.amazonaws.com\nAlternatively you can run the following command without changing anything given you have a default region configured\naws ecr get-login-password --region $(aws configure get region) | docker login --username AWS --password-stdin \"$(aws sts get-caller-identity --query \"Account\" --output text).dkr.ecr.$(aws configure get region).amazonaws.com\"\nAdded by Humberto Rodriguez", "section": "10. Kubernetes and TensorFlow Serving", "question": "Command aws ecr get-login --no-include-email returns \u201caws: error: argument operation: Invalid choice\u2026\u201d" }, { "text": "While trying to run the docker code on M1:\ndocker run --platform linux/amd64 -it --rm \\\n-p 8500:8500 \\\n-v $(pwd)/clothing-model:/models/clothing-model/1 \\\n-e MODEL_NAME=\"clothing-model\" \\\ntensorflow/serving:2.7.0\nIt outputs the error:\nError:\nStatus: Downloaded newer image for tensorflow/serving:2.7.0\n[libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/generated_message_reflection.cc:2345] CHECK failed: file != nullptr:\nterminate called after throwing an instance of 'google::protobuf::FatalException'\nwhat(): CHECK failed: file != nullptr:\nqemu: uncaught target signal 6 (Aborted) - core dumped\n/usr/bin/tf_serving_entrypoint.sh: line 3: 8 Aborted tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} \"$@\"\nSolution\ndocker pull emacski/tensorflow-serving:latest\ndocker run -it --rm \\\n-p 8500:8500 \\\n-v $(pwd)/clothing-model:/models/clothing-model/1 \\\n-e MODEL_NAME=\"clothing-model\" \\\nemacski/tensorflow-serving:latest-linux_arm64\nSee more here: https://github.com/emacski/tensorflow-serving-arm\nAdded by Daniel Egbo", "section": "10. Kubernetes and TensorFlow Serving", "question": "Error downloading tensorflow/serving:2.7.0 on Apple M1 Mac" }, { "text": "Similar to the one above but with a different solution the main reason is that emacski doesn\u2019t seem to maintain the repo any more, the latest image is from 2 years ago at the time of writing (December 2023)\nProblem:\nWhile trying to run the docker code on Mac M2 apple silicon:\ndocker run --platform linux/amd64 -it --rm \\\n-p 8500:8500 \\\n-v $(pwd)/clothing-model:/models/clothing-model/1 \\\n-e MODEL_NAME=\"clothing-model\" \\\ntensorflow/serving\nYou get an error:\n/usr/bin/tf_serving_entrypoint.sh: line 3: 7 Illegal instruction tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} \"$@\"\nSolution:\nUse bitnami/tensorflow-serving base image\nLaunch it either using docker run\ndocker run -d \\\n--name tf_serving \\\n-p 8500:8500 \\\n-p 8501:8501 \\\n-v $(pwd)/clothing-model:/bitnami/model-data/1 \\\n-e TENSORFLOW_SERVING_MODEL_NAME=clothing-model \\\nbitnami/tensorflow-serving:2\nOr the following docker-compose.yaml\nversion: '3'\nservices:\ntf_serving:\nimage: bitnami/tensorflow-serving:2\nvolumes:\n- ${PWD}/clothing-model:/bitnami/model-data/1\nports:\n- 8500:8500\n- 8501:8501\nenvironment:\n- TENSORFLOW_SERVING_MODEL_NAME=clothing-model\nAnd run it with\ndocker compose up\nAdded by Alex Litvinov", "section": "10. Kubernetes and TensorFlow Serving", "question": "Illegal instruction error when running tensorflow/serving image on Mac M2 Apple Silicon (potentially on M1 as well)" }, { "text": "Problem: CPU metrics Shows Unknown\nNAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE\ncredit-hpa Deployment/credit <unknown>/20% 1 3 1 18s\nFailedGetResourceMetric 2m15s (x169 over 44m) horizontal-pod-autoscaler failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API:\nSolution:\n-> Delete HPA (kubectl delete hpa credit-hpa)\n-> kubectl apply -f https://raw.githubusercontent.com/pythianarora/total-practice/master/sample-kubernetes-code/metrics-server.yaml\n-> Create HPA\nThis should solve the cpu metrics report issue.\nAdded by Priya V", "section": "11. KServe", "question": "HPA doesn\u2019t show CPU metrics" }, { "text": "Problem description:\nRunning this:\ncurl -s \"https://raw.githubusercontent.com/kserve/kserve/release-0.9/hack/quick_install.sh\" | bash\nFails with errors because of istio failing to update resources, and you are on kubectl > 1.25.0.\nCheck kubectl version with kubectl version\nSolution description\nEdit the file \u201cquick_install.bash\u201d by downloading it with curl without running bash. Edit the versions of Istio and Knative as per the matrix on the KServe website.\nRun the bash script now.\nAdded by Andrew Katoch", "section": "11. KServe", "question": "Errors with istio during installation" }, { "text": "Problem description\nSolution description\n(optional) Added by Name", "section": "Projects (Midterm and Capstone)", "question": "Problem title" }, { "text": "Answer: You can see them here (it\u2019s taken from the 2022 cohort page). Go to the cohort folder for your own cohort\u2019s deadline.", "section": "Projects (Midterm and Capstone)", "question": "What are the project deadlines?" }, { "text": "Answer: All midterms and capstones are meant to be solo projects. [source @Alexey]", "section": "Projects (Midterm and Capstone)", "question": "Are projects solo or collaborative/group work?" }, { "text": "Answer: Ideally midterms up to module-06, capstones include all modules in that cohort\u2019s syllabus. But you can include anything extra that you want to feature. Just be sure to document anything not covered in class.\nAlso watch office hours from previous cohorts. Go to DTC youtube channel and click on Playlists and search for {course yyyy}. ML Zoomcamp was first launched in 2021.\nMore discussions:\n[source1] [source2] [source3]", "section": "Projects (Midterm and Capstone)", "question": "What modules, topics, problem-sets should a midterm/capstone project cover? Can I do xyz?" }, { "text": "These links apply to all projects, actually. Again, for some cohorts, the modules/syllabus might be different, so always check in your cohort\u2019s folder as well for additional or different instructions, if any.\nMidterm Project Sample: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp/cohorts/2021/07-midterm-project\nMidTerm Project Deliverables: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp/projects\nSubmit MidTerm Project: https://docs.google.com/forms/d/e/1FAIpQLSfgmOk0QrmHu5t0H6Ri1Wy_FDVS8I_nr5lY3sufkgk18I6S5A/viewform\nDatasets:\nhttps://www.kaggle.com/datasets and https://www.kaggle.com/competitions\nhttps://archive.ics.uci.edu/ml/index.php\nhttps://data.europa.eu/en\nhttps://www.openml.org/search?type=data\nhttps://newzealand.ai/public-data-sets\nhttps://datasetsearch.research.google.com\nWhat to do and Deliverables\nThink of a problem that's interesting for you and find a dataset for that\nDescribe this problem and explain how a model could be used\nPrepare the data and doing EDA, analyze important features\nTrain multiple models, tune their performance and select the best model\nExport the notebook into a script\nPut your model into a web service and deploy it locally with Docker\nBonus points for deploying the service to the cloud", "section": "Projects (Midterm and Capstone)", "question": "Crucial Links" }, { "text": "Answer: Previous cohorts projects page has instructions (youtube).\nhttps://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2022/projects.md#midterm-project\nAlexey and his team will compile a g-sheet with links to submitted projects with our hashed emails (just like when we check leaderboard for homework) that are ours to review within the evaluation deadline.\n~~~ Added by Nukta Bhatia ~~~", "section": "Projects (Midterm and Capstone)", "question": "How to conduct peer reviews for projects?" }, { "text": "See the answer here.", "section": "Projects (Midterm and Capstone)", "question": "Computing the hash for project review" }, { "text": "For the learning in public for this midterm project it seems that has a total value of 14!, Does this mean that we need make 14 posts?, Or the regular seven posts for each module and each one with a value of 2?, Or just one with a total value of 14?\n14 posts, one for each day", "section": "Projects (Midterm and Capstone)", "question": "Learning in public links for the projects" }, { "text": "You can use git-lfs (https://git-lfs.com/) for upload large file to github repository.\nRyan Pramana", "section": "Projects (Midterm and Capstone)", "question": "My dataset is too large and I can't loaded in GitHub , do anyone knows about a solution?" }, { "text": "If you have submitted two projects (and peer-reviewed at least 3 course-mates\u2019 projects for each submission), you will get the certificate for the course. According to the course coordinator, Alexey Grigorev, only two projects are needed to get the course certificate.\n(optional) David Odimegwu", "section": "Projects (Midterm and Capstone)", "question": "What If I submitted only two projects and failed to submit the third?" }, { "text": "Yes. You only need to review peers when you submit your project.\nConfirmed on Slack by Alexey Grigorev (added by Rileen Sinha)", "section": "Projects (Midterm and Capstone)", "question": "I did the first two projects and skipped the last one so I wouldn't have two peer review in second capstone right?" }, { "text": "Regarding Point 4 in the midterm deliverables, which states, \"Train multiple models, tune their performance, and select the best model,\" you might wonder, how many models should you train? The answer is simple: train as many as you can. The term \"multiple\" implies having more than one model, so as long as you have more than one, you're on the right track.", "section": "Projects (Midterm and Capstone)", "question": "How many models should I train?" }, { "text": "I am not sure how the project evaluate assignment works? Where do I find this? I have access to all the capstone 2 project, perhaps, I can randomly pick any to review.\nAnswer:\nThe link provided for example (2023/Capstone link ): https://docs.google.com/forms/d/e/1FAIpQLSdgoepohpgbM4MWTAHWuXa6r3NXKnxKcg4NDOm0bElAdXdnnA/viewform contains a list of all submitted projects to be evaluated. More specific, you are to review 3 assigned peer projects. In the spreadsheet are 3 hash values of your assigned peer projects. However, you need to derive the your hash value of your email address and find the value on the spreadsheet under the (reviewer_hash) heading.\nTo calculate your hash value run the python code below:\nfrom hashlib import sha1\ndef compute_hash(email):\nreturn sha1(email.lower().encode('utf-8')).hexdigest()\n# Example usage **** enter your email below (Example1@gmail.com)****\nemail = \"Example1@gmail.com\"\nhashed_email = compute_hash(email)\nprint(\"Original Email:\", email)\nprint(\"Hashed Email (SHA-1):\", hashed_email)\nEdit the above code to replace Example1@gmail.com as your email address\nStore and run the above python code from your terminal. See below as the Hashed Email (SHA-1) value\nYou then go to the link: https://docs.google.com/spreadsheets/d/e/2PACX-1vR-7RRtq7AMx5OzI-tDbkzsbxNLm-NvFOP5OfJmhCek9oYcDx5jzxtZW2ZqWvBqc395UZpHBv1of9R1/pubhtml?gid=876309294&single=true\nLastly, copy the \u201cHashed Email (SHA-1): bd9770be022dede87419068aa1acd7a2ab441675\u201d value and search for 3 identical entries. There you should see your peer project to be reviewed.\nBy Emmanuel Ayeni", "section": "Projects (Midterm and Capstone)", "question": "How does the project evaluation work for you as a peer reviewer?" }, { "text": "Alexey Grigorev: \u201cIt\u2019s based on all the scores to make sure most of you pass.\u201d By Annaliese Bronz\nOther course-related questions that don\u2019t fall into any of the categories above or can apply to more than one category/module", "section": "Miscellaneous", "question": "Do you pass a project based on the average of everyone else\u2019s scores or based on the total score you earn?" }, { "text": "Answer: The train.py file will be used by your peers to review your midterm project. It is for them to cross-check that your training process works on someone else\u2019s system. It should also be included in the environment in conda or with pipenv.\nOdimegwu David", "section": "Miscellaneous", "question": "Why do I need to provide a train.py file when I already have the notebook.ipynb file?" }, { "text": "Pip install pillow - install pillow library\nfrom PIL import Image\nimg = Image.open('aeroplane.png')\nFrom numpy import asarray\nnumdata=asarray(img)\nKrishna Anand", "section": "Miscellaneous", "question": "Loading the Image with PILLOW library and converting to numpy array" }, { "text": "Ans: train.py has to be a python file. This is because running a python script for training a model is much simpler than running a notebook and that's how training jobs usually look like in real life.", "section": "Miscellaneous", "question": "Is a train.py file necessary when you have a train.ipynb file in your midterm project directory?" }, { "text": "Yes, you can create a mobile app or interface that manages these forms and validations. But you should also perform validations on backend.\nYou can also check Streamlit: https://github.com/DataTalksClub/project-of-the-week/blob/main/2022-08-14-frontend.md\nAlejandro Aponte", "section": "Miscellaneous", "question": "Is there a way to serve up a form for users to enter data for the model to crunch on?" }, { "text": "Using model.feature_importances_ can gives you an error:\nAttributeError: 'Booster' object has no attribute 'feature_importances_'\nAnswer: if you train the model like this: model = xgb.train you should use get_score() instead\nEkaterina Kutovaia", "section": "Miscellaneous", "question": "How to get feature importance for XGboost model" }, { "text": "In the Elastic Container Service task log, error \u201c[Errno 12] Cannot allocate memory\u201d showed up.\nJust increase the RAM and CPU in your task definition.\nHumberto Rodriguez", "section": "Miscellaneous", "question": "[Errno 12] Cannot allocate memory in AWS Elastic Container Service" }, { "text": "When running a docker container with waitress serving the app.py for making predictions, pickle will throw an error that can't get attribute <name_of_class> on module __main__.\nThis does not happen when Flask is used directly, i.e. not through waitress.\nThe problem is that the model uses a custom column transformer class, and when the model was saved, it was saved from the __main__ module (e.g. python train.py). Pickle will reference the class in the global namespace (top-level code): __main__.<custom_class>.\nWhen using waitress, waitress will load the predict_app module and this will call pickle.load, that will try to find __main__.<custom_class> that does not exist.\nSolution:\nPut the class into a separate module and import it in both the script that saves the model (e.g. train.py) and the script that loads the model (e.g. predict.py)\nNote: If Flask is used (no waitress) in predict.py, and predict.py has the definition of the class, When it is run: python predict.py, it will work because the class is in the same namespace as the one used when the model was saved (__main__).\nDetailed info: https://stackoverflow.com/questions/27732354/unable-to-load-files-using-pickle-and-multiple-modules\nMarcos MJD", "section": "Miscellaneous", "question": "Pickle error: can\u2019t get attribute XXX on module __main__" }, { "text": "There are different techniques, but the most common used are the next:\nDataset transformation (for example, log transformation)\nClipping high values\nDropping these observations\nAlena Kniazeva", "section": "Miscellaneous", "question": "How to handle outliers in a dataset?" }, { "text": "I was getting the below error message when I was trying to create docker image using bentoml\n[bentoml-cli] `serve` failed: Failed loading Bento from directory /home/bentoml/bento: Failed to import module \"service\": No module named 'sklearn'\nSolution description\nThe cause was because , in bentofile.yaml, I wrote sklearn instead of scikit-learn. Issue was fixed after I modified the packages list as below.\npackages: # Additional pip packages required by the service\n- xgboost\n- scikit-learn\n- pydantic\nAsia Saeed", "section": "Miscellaneous", "question": "Failed loading Bento from directory /home/bentoml/bento: Failed to import module \"service\": No module named 'sklearn'" }, { "text": "You might see a long error message with something about sparse matrices, and in the swagger UI, you get a code 500 error with \u201c\u201d (empty string) as output.\nPotential reason: Setting DictVectorizer or OHE to sparse while training, and then storing this in a pipeline or custom object in the benotml model saving stage in train.py. This means that when the custom object is called in service.py, it will convert each input to a different sized sparse matrix, and this can't be batched due to inconsistent length. In this case, bentoml model signatures should have batchable set to False for production during saving the bentoml mode in train.py.\n(Memoona Tahira)", "section": "Miscellaneous", "question": "BentoML not working with \u2013production flag at any stage: e.g. with bentoml serve and while running the bentoml container" }, { "text": "Problem description:\nDo we have to run everything?\nYou are encouraged, if you can, to run them. As this provides another opportunity to learn from others.\nNot everyone will be able to run all the files, in particular the neural networks.\nSolution description:\nAlternatively, can you see that everything you need to reproduce is there: the dataset is there, the instructions are there, are there any obvious errors and so on.\nRelated slack conversation here.\n(Gregory Morris)", "section": "Miscellaneous", "question": "Reproducibility" }, { "text": "If your model is too big for github one option is to try and compress the model using joblib. For example joblib.dump(model, model_filename, compress=('zlib', 6) will use zlib to compress the model. Just note this could take a few moments as the model is being compressed.\nQuinn Avila", "section": "Miscellaneous", "question": "Model too big" }, { "text": "When you try to push the docker image to Google Container Registry and get this message \u201cunauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials.\u201d, type this below on console, but first install https://cloud.google.com/sdk/docs/install, this is to be able to use gcloud in console:\ngcloud auth configure-docker\n(Jesus Acu\u00f1a)", "section": "Miscellaneous", "question": "Permissions to push docker to Google Container Registry" }, { "text": "I am getting this error message when I tried to install tflite in a pipenv environment\nError: An error occurred while installing tflite_runtime!\nError text:\nERROR: Could not find a version that satisfies the requirement tflite_runtime (from versions: none)\nERROR: No matching distribution found for tflite_runtime\nThis version of tflite do not run on python 3.10, the way we can make it work is by install python 3.9, after that it would install the tflite_runtime without problem.\nPastor Soto\nCheck all available versions here:\nhttps://google-coral.github.io/py-repo/tflite-runtime/\nIf you don\u2019t find a combination matching your setup, try out the options at\nhttps://github.com/alexeygrigorev/tflite-aws-lambda/tree/main/tflite\nwhich you can install as shown in the lecture, e.g.\npip install https://github.com/alexeygrigorev/tflite-aws-lambda/raw/main/tflite/tflite_runtime-2.7.0-cp38-cp38-linux_x86_64.whl\nFinally, if nothing works, use the TFLite included in TensorFlow for local development, and use Docker for testing Lambda.\nRileen Sinha (based on discussions on Slack)", "section": "Miscellaneous", "question": "Tflite_runtime unable to install" }, { "text": "Error: ImageDataGenerator name 'scipy' is not defined.\nCheck that scipy is installed in your environment.\nRestart jupyter kernel and try again.\nMarcos MJD", "section": "Miscellaneous", "question": "Error when running ImageDataGenerator.flow_from_dataframe" }, { "text": "Tim from BentoML has prepared a dedicated video tutorial wrt this use case here:\nhttps://www.youtube.com/watch?v=7gI1UH31xb4&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=97\nKonrad Muehlberg", "section": "Miscellaneous", "question": "How to pass BentoML content / docker container to Amazon Lambda" }, { "text": "In deploying model part, I wanted to test my model locally on a test-image data and I had this silly error after the following command:\nurl = 'https://github.com/bhasarma/kitchenware-classification-project/blob/main/test-image.jpg'\nX = preprocessor.from_url(url)\nI got the error:\nUnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7f797010a590>\nSolution:\nAdd ?raw=true after .jpg in url. E.g. as below\nurl = \u2018https://github.com/bhasarma/kitchenware-classification-project/blob/main/test-image.jpg?raw=true\u2019\nBhaskar Sarma", "section": "Miscellaneous", "question": "Error UnidentifiedImageError: cannot identify image file" }, { "text": "Problem: If you run pipenv install and get this message. Maybe manually change Pipfile and Pipfile.lock.\nSolution: Run: ` pipenv lock` for fix this problem and dependency files\nAlejandro Aponte", "section": "Miscellaneous", "question": "[pipenv.exceptions.ResolutionFailure]: Warning: Your dependencies could not be resolved. You likely have a mismatch in your sub-dependencies" }, { "text": "Problem: In the course this function worked to get the features from the dictVectorizer instance: dv.get_feature_names(). But in my computer did not work. I think it has to do with library versions and but apparently that function will be deprecated soon:\nOld: https://scikit-learn.org/0.22/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.get_feature_names\nNew: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.get_feature_names\nSolution: change the line dv.get_feature_names() to list(dv.get_feature_names_out))\nIbai Irastorza", "section": "Miscellaneous", "question": "Get_feature_names() not found" }, { "text": "Problem happens when contacting the server waiting to send your predict-test and your data here in the correct shape.\nThe problem was the format input to the model wasn\u2019t in the right shape. Server receives the data in json format (dict) which is not suitable for the model. U should convert it to like numpy arrays.\nAhmed Okka", "section": "Miscellaneous", "question": "Error decoding JSON response: Expecting value: line 1 column 1 (char 0)" }, { "text": "Q: Hii folks, I tried deploying my docker image on Render, but it won't I get SIGTERM everytime.\nI think .5GB RAM is not enough, is there any other free alternative available ?\nA: aws (amazon), gcp (google), saturn.\nBoth aws and gcp give microinstance for free for a VERY long time, and a bunch more free stuff.\nSaturn even provides free GPU instances. Recent promo link from mlzoomcamp for Saturn:\n\u201cYou can sign up here: https://bit.ly/saturn-mlzoomcamp\nWhen you sign up, write in the chat box that you're an ML Zoomcamp student and you should get extra GPU hours (something like 150)\u201d\nAdded by Andrii Larkin", "section": "Miscellaneous", "question": "Free cloud alternatives" }, { "text": "Problem description: I have one column day_of_the_month . It has values 1, 2, 20, 25 etc. and int . I have a second column month_of_the_year. It has values jan, feb, ..dec. and are string. I want to convert these two columns into one column day_of_the_year and I want them to be int. 2 and jan should give me 2, i.e. 2nd day of the year, 1 and feb should give me 32, i.e. 32 nd day of the year. What is the simplest pandas-way to do it?\nSolution description:\nconvert dtype in day_of_the_month column from int to str with df['day_of_the_month'] = df['day_of_the_month'].map(str)\nconvert month_of_the_year column in jan, feb ...,dec into 1,2, ..,12 string using map()\nconvert day and month into a datetime object with:\ndf['date_formatted'] = pd.to_datetime(\ndict(\nyear='2055',\nmonth=df['month'],\nday=df['day']\n)\n)\nget day of year with: df['day_of_year']=df['date_formatted'].dt.dayofyear\n(Bhaskar Sarma)", "section": "Miscellaneous", "question": "Getting day of the year from day and month column" }, { "text": "How to visualize the predictions per classes after training a neural net\nSolution description\nclasses, predictions = zip(*dict(zip(classes, predictions)).items())\nplt.figure(figsize=(12, 3))\nplt.bar(classes, predictions)\nLuke", "section": "Miscellaneous", "question": "Chart for classes and predictions" }, { "text": "You can convert the prediction output values to a datafarme using \ndf = pd.DataFrame.from_dict(dict, orient='index' , columns=[\"Prediction\"])\nEdidiong Esu", "section": "Miscellaneous", "question": "Convert dictionary values to Dataframe table" }, { "text": "The image dataset for the competition was in a different layout from what we used in the dino vs dragon lesson. Since that\u2019s what was covered, some folks were more comfortable with that setup, so I wrote a script that would generate it for them\nIt can be found here: kitchenware-dataset-generator | Kaggle\nMartin Uribe", "section": "Miscellaneous", "question": "Kitchenware Classification Competition Dataset Generator" }, { "text": "Install Nvidia drivers: https://www.nvidia.com/download/index.aspx.\nWindows:\nInstall Anaconda prompt https://www.anaconda.com/\nTwo options:\nInstall package \u2018tensorflow-gpu\u2019 in Anaconda\nInstall the Tensorflow way https://www.tensorflow.org/install/pip#windows-native\nWSL/Linux:\nWSL: Use the Windows Nvida drivers, do not touch that.\nTwo options:\nInstall the Tensorflow way https://www.tensorflow.org/install/pip#linux_1\nMake sure to follow step 4 to install CUDA by environment\nAlso run:\necho \u2018export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib/> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh\nInstall CUDA toolkit 11.x.x https://developer.nvidia.com/cuda-toolkit-archive\nInstall https://developer.nvidia.com/rdp/cudnn-download\nNow you should be able to do training/inference with GPU in Tensorflow\n(Learning in public links Links to social media posts where you share your progress with others (LinkedIn, Twitter, etc). Use #mlzoomcamp tag. The scores for this part will be capped at 7 points. Please make sure the posts are valid URLs starting with \"https://\" Does it mean that I should provide my linkedin link? or it means that I should write a post that I have completed my first assignement? (\nANS (by ezehcp7482@gmail.com): Yes, provide the linkedIN link to where you posted.\nezehcp7482@gmail.com:\nPROBLEM: Since I had to put up a link to a public repository, I had to use Kaggle and uploading the dataset therein was a bit difficult; but I had to \u2018google\u2019 my way out.\nANS: See this link for a guide (https://www.kaggle.com/code/dansbecker/finding-your-files-in-kaggle-kernels/notebook)", "section": "Miscellaneous", "question": "CUDA toolkit and cuDNN Install for Tensorflow" }, { "text": "When multiplying matrices, the order of multiplication is important.\nFor example:\nA (m x n) * B (n x p) = C (m x p)\nB (n x p) * A (m x n) = D (n x n)\nC and D are matrices of different sizes and usually have different values. Therefore the order is important in matrix multiplication and changing the order changes the result.\nBaran Ak\u0131n", "section": "Miscellaneous", "question": "About getting the wrong result when multiplying matrices" }, { "text": "Refer to https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/06-environment.md\n(added by Rileen Sinha)", "section": "Miscellaneous", "question": "None of the videos have how to install the environment in Mac, does someone have instructions for Mac with M1 chip?" }, { "text": "Depends on whether the form will still be open. If you're lucky and it's open, you can submit your homework and it will be evaluated. if closed - it's too late.\n(Added by Rileen Sinha, based on answer by Alexey on Slack)", "section": "Miscellaneous", "question": "I may end up submitting the assignment late. Would it be evaluated?" }, { "text": "Yes. Whoever corrects the homework will only be able to access the link if the repository is public.\n(added by Tano Bugelli)\nHow to install Conda environment in my local machine?\nWhich ide is recommended for machine learning?", "section": "Miscellaneous", "question": "Does the github repository need to be public?" }, { "text": "Install w get:\n!which wget\nDownload data:\n!wget -P /content/drive/My\\ Drive/Downloads/ URL\n(added by Paulina Hernandez)", "section": "Miscellaneous", "question": "How to use wget with Google Colab?" }, { "text": "Features (X) must always be formatted as a 2-D array to be accepted by scikit-learn.\nUse reshape to reshape a 1D array to a 2D.\n\t\t\t\t\t\t\t(-Aileah) :>\n(added by Tano\nfiltered_df = df[df['ocean_proximity'].isin(['<1H OCEAN', 'INLAND'])]\n# Select only the desired columns\nselected_columns = [\n'latitude',\n'longitude',\n'housing_median_age',\n'total_rooms',\n'total_bedrooms',\n'population',\n'households',\n'median_income',\n'median_house_value'\n]\nfiltered_df = filtered_df[selected_columns]\n# Display the first few rows of the filtered DataFrame\nprint(filtered_df.head())", "section": "Miscellaneous", "question": "Features in scikit-learn?" }, { "text": "FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead", "section": "Miscellaneous", "question": "When I plotted using Matplot lib to check if median has a tail, I got the error below how can one bypass?" }, { "text": "When trying to rerun the docker file in Windows, as opposed to developing in WSL/Linux, I got the error of:\n```\nWarning: Python 3.11 was not found on your system\u2026\nNeither \u2018pipenv\u2019 nor \u2018asdf\u2019 could be found to install Python.\nYou can specify specific versions of Python with:\n$ pipenv \u2013python path\\to\\python\n```\nThe solution was to add Python311 installation folder to the PATH and restart the system and run the docker file again. That solved the error.\n(Added by Abhijit Chakraborty)", "section": "Miscellaneous", "question": "Reproducibility in different OS" }, { "text": "You may quickly deploy your project to DigitalOcean App Cloud. The process is relatively straightforward. The deployment costs about 5 USD/month. The container needs to be up until the end of the project evaluation.\nSteps:\nRegister in DigitalOcean\nGo to Apps -> Create App.\nYou will need to choose GitHub as a service provider.\nEdit Source Directory (if your project is not in the repo root)\nIMPORTANT: Go to settings -> App Spec and edit the Dockerfile path so it looks like ./project/Dockerfile path relative to your repo root\nRemember to add model files if they are not built automatically during the container build process.\nBy Dmytro Durach", "section": "Miscellaneous", "question": "Deploying to Digital Ocean" }, { "text": "I\u2019m just looking back at the lessons in week 3 (churn prediction project), and lesson 3.6 talks about Feature Importance for categorical values. At 8.12, the mutual info scores show that the some features are more important than others, but then in lesson 3.10 the Logistic Regression model is trained on all of the categorical variables (see 1:35). Once we have done feature importance, is it best to train your model only on the most important features?\nNot necessarily - rather, any feature that can offer additional predictive value should be included (so, e.g. predict with & without including that feature; if excluding it drops performance, keep it, else drop it). A few individually important features might in fact be highly correlated with others, & dropping some might be fine. There are many feature selection algorithms, it might be interesting to read up on them (among the methods we've learned so far in this course, L1 regularization (Lasso) implicitly does feature selection by shrinking some weights all the way to zero).\nBy Rileen Sinha", "section": "Miscellaneous", "question": "Is it best to train your model only on the most important features?" }, { "text": "You can consider several different approaches:\nSampling: In the exploratory phase, you can use random samples of the data.\nChunking: When you do need all the data, you can read and process it in chunks that do fit in the memory.\nOptimizing data types: Pandas\u2019 automatic data type inference (when reading data in) might result in e.g. float64 precision being used to represent integers, which wastes space. You might achieve substantial memory reduction by optimizing the data types.\nUsing Dask, an open-source python project which parallelizes Numpy and Pandas.\n(see, e.g. https://www.vantage-ai.com/en/blog/4-strategies-how-to-deal-with-large-datasets-in-pandas)\nBy Rileen Sinha", "section": "Miscellaneous", "question": "How can I work with very large datasets, e.g. the New York Yellow Taxi dataset, with over a million rows?" }, { "text": "Technically, yes. Advisable? Not really. Reasons:\nSome homework(s) asks for specific python library versions.\nAnswers may not match in MCQ options if using different languages other than Python 3.10 (the recommended version for 2023 cohort)\nAnd as for midterms/capstones, your peer-reviewers may not know these other languages. Do you want to be penalized for others not knowing these other languages?\nYou can create a separate repo using course\u2019s lessons but written in other languages for your own learnings, but not advisable for submissions.\ntx[source]", "section": "Miscellaneous", "question": "Can I do the course in other languages, like R or Scala?" }, { "text": "Yes, it\u2019s allowed (as per Alexey).\nAdded By Rileen Sinha", "section": "Miscellaneous", "question": "Is use of libraries like fast.ai or huggingface allowed in the capstone and competition, or are they considered to be \"too much help\"?" }, { "text": "The TF and TF Serving versions have to match (as per solution from the slack channel)\nAdded by Chiedu Elue", "section": "Miscellaneous", "question": "Flask image was built and tested successfully, but tensorflow serving image was built and unable to test successfully. What could be the problem?" }, { "text": "I\u2019ve seen LinkedIn users list DataTalksClub as Experience with titles as:\nMachine Learning Fellow\nMachine Learning Student\nMachine Learning Participant\nMachine Learning Trainee\nPlease note it is best advised that you do not list the experience as an official \u201cjob\u201d or \u201cinternship\u201d experience since DataTalksClub did not hire you, nor financially compensate you.\nOther ways you can incorporate the experience in the following sections:\nOrganizations\nProjects\nSkills\nFeatured\nOriginal posts\nCertifications\nCourses\nBy Annaliese Bronz\nInteresting question, I put the link of my project into my CV as showcase and make posts to show my progress.\nBy Ani Mkrtumyan", "section": "Miscellaneous", "question": "Any advice for adding the Machine Learning Zoomcamp experience to your LinkedIn profile?" } ] }, { "course": "mlops-zoomcamp", "documents": [ { "text": "MLOps Zoomcamp FAQ\nThe purpose of this document is to capture frequently asked technical questions.\nWe did this for our data engineering course, and it worked quite well. Check this document for inspiration on how to structure your questions and answers:\nData Engineering Zoomcamp FAQ\n[Problem description]\n[Solution description]\n(optional) Added by Name", "section": "+-General course questions", "question": "Format for questions: [Problem title]" }, { "text": "Approximately 3 months. For each module, about 1 week with possible deadline extensions (in total 6~9 weeks), 2 weeks for working on the capstone project and 1 week for peer review.", "section": "+-General course questions", "question": "What is the expected duration of this course or that for each module?" }, { "text": "The difference is the Orchestration and Monitoring modules. Those videos will be re-recorded. The rest should mostly be the same.\nAlso all of the homeworks will be changed for the 2023 cohort.", "section": "+-General course questions", "question": "What\u2019s the difference between the 2023 and 2022 course?" }, { "text": "Yes, it will start in May 2024", "section": "+-General course questions", "question": "Will there be a 2024 Cohort? When will the 2024 cohort start?" }, { "text": "Please choose the closest one to your answer. Also do not post your answer in the course slack channel.", "section": "+-General course questions", "question": "What if my answer is not exactly the same as the choices presented?" }, { "text": "Please pick up a problem you want to solve yourself. Potential datasets can be found on either Kaggle, Hugging Face, Google, AWS, or the UCI Machine Learning Datasets Repository.", "section": "+-General course questions", "question": "Are we free to choose our own topics for the final project?" }, { "text": "In order to obtain the certificate, completion of the final capstone project is mandatory. The completion of weekly homework assignments is optional, but they can contribute to your overall progress and ranking on the top 100 leaderboard.", "section": "+-General course questions", "question": "Can I still graduate when I didn\u2019t complete homework for week x?" }, { "text": "You can get a few cloud points by using kubernetes even if you deploy it only locally. Or you can use local stack too to mimic AWS\nAdded by Ming Jun, Asked by Ben Pacheco, Answered by Alexey Grigorev", "section": "Module 1: Introduction", "question": "For the final project, is it required to be put on the cloud?" }, { "text": "For those who are not using VSCode (or other similar IDE), you can automate port-forwarding for Jupyter Notebook by adding the following line of code to your\n~/.ssh/config file (under the mlops-zoomcamp host):\nLocalForward 127.0.0.1:8899 127.0.0.1:8899\nThen you can launch Jupyter Notebook using the following command: jupyter notebook --port=8899 --no-browser and copy paste the notebook URL into your browser.\nAdded by Vishal", "section": "Module 1: Introduction", "question": "Port-forwarding without Visual Studio" }, { "text": "You can install the Jupyter extension to open notebooks in VSCode.\nAdded by Khubaib", "section": "Module 1: Introduction", "question": "Opening Jupyter in VSCode" }, { "text": "In case one would like to set a github repository (e.g. for Homeworks), one can follow 2 great tutorials that helped a lot\nSetting up github on AWS instance - this\nSetting up keys on AWS instance - this\nThen, one should be able to push to its repo\nAdded by Daniel Hen (daniel8hen@gmail.com)", "section": "Module 1: Introduction", "question": "Configuring Github to work from the remote VM" }, { "text": "Faced issue while setting up JUPYTER NOTEBOOK on AWS. I was unable to access it from my desktop. (I am not using visual studio and hence faced problem)\nRun\njupyter notebook --generate-config\nEdit file /home/ubuntu/.jupyter/jupyter_notebook_config.py to add following line:\nNotebookApp.ip = '*'\nAdded by Atul Gupta (samatul@gmail.com)", "section": "Module 1: Introduction", "question": "Opening Jupyter in AWS" }, { "text": "If you wish to use WSL on your windows machine, here are the setup instructions:\nCommand: Sudo apt install wget\nGet Anaconda download address here. wget <download address>\nTurn on Docker Desktop WFree Download | AnacondaSL2\nCommand: git clone <github repository address>\nVSCODE on WSL\nJupyter: pip3 install jupyter\nAdded by Gregory Morris (gwm1980@gmail.com)\nAll in all softwares at one shop:\nYou can use anaconda which has all built in services like pycharm, jupyter\nAdded by Khaja Zaffer (khajazaffer@aln.iseg.ulisboa.pt)\nFor windows \u201cwsl --install\u201d in Powershell\nAdded by Vadim Surin (vdmsurin@gmai.com)", "section": "Module 1: Introduction", "question": "WSL instructions" }, { "text": "If you create a folder data and download datasets or raw files in your local repository. Then to push all your code to remote repository without this files or folder please use gitignore file. The simple way to create it do the following steps\n1. Create empty .txt file (using text editor or command line)\n2. Safe as .gitignore (. must use the dot symbol)\n3. Add rules\n *.parquet - to ignore all parquet files\ndata/ - to ignore all files in folder data\n\nFor more pattern read GIT documentation\nhttps://git-scm.com/docs/gitignore\nAdded by Olga Rudakova (olgakurgan@gmail.com)", "section": "Module 1: Introduction", "question": ".gitignore how-to" }, { "text": "Make sure when you stop an EC2 instance that it actually stops (there's a meme about it somewhere). There are green circles (running), orange (stopping), and red (stopped). Always refresh the page to make sure you see the red circle and status of stopped.\nEven when an EC2 instance is stopped, there WILL be other charges that are incurred (e.g. if you uploaded data to the EC2 instance, this data has to be stored somewhere, usually an EBS volume and this storage incurs a cost).\nYou can set up billing alerts. (I've never done this, so no advice on how to do this).\n(Question by: Akshit Miglani (akshit.miglani09@gmail.com) and Answer by Anna Vasylytsya)", "section": "Module 1: Introduction", "question": "AWS suggestions" }, { "text": "You can get invitation code by coursera and use it in account to verify it it has different characteristics.\nI really love it\nhttps://www.youtube.com/watch?v=h_GdX6KtXjo", "section": "Module 1: Introduction", "question": "IBM Cloud an alternative for AWS" }, { "text": "I am worried about the cost of keeping an AWS instance running during the course.\nWith the instance specified during working environment setup, if you remember to Stop Instance once you finished your work for the day. Using that strategy, in a day with about 5 hours of work you will pay around $0.40 USD which will account for $12 USD per month, which seems to be an affordable amount.\nYou must remember that you would have a different IP public address every time you Restart your instance, and you would need to edit your ssh Config file. It's worth the time though.\nAdditionally, AWS enables you to set up an automatic email alert if a predefined budget is exceeded.\nHere is a tutorial to set this up.\nAlso, you can estimate the cost yourself, using AWS pricing calculator (to use it you don\u2019t even need to be logged in).\nAt the time of writing (20.05.2023) t3a.xlarge instance with 2 hr/day usage (which translates to 10 hr/week that should be enough to complete the course) and 30GB EBS monthly cost is 10.14 USD\nHere\u2019s a link to the estimate\nAdded by Alex Litvinov (aaalex.lit@gmail.com)", "section": "Module 1: Introduction", "question": "AWS costs" }, { "text": "For many parts - yes. Some things like kinesis are not in AWS free tier, but you can do it locally with localstack.", "section": "Module 1: Introduction", "question": "Is the AWS free tier enough for doing this course?" }, { "text": "When I click an open IP-address in an AWS EC2 instance I get an error: \u201cThis site can\u2019t be reached\u201d. What should I do?\nThis ip-address is not required to be open in a browser. It is needed to connect to the running EC2 instance via terminal from your local machine or via terminal from a remote server with such command, for example if:\nip-address is 11.111.11.111\ndownloaded key name is razer.pem (the key should be moved to a hidden folder .ssh)\nyour user name is user_name\nssh -i /Users/user_name/.ssh/razer.pem ubuntu@11.111.11.111", "section": "Module 1: Introduction", "question": "AWS EC2: this site can\u2019t be reached" }, { "text": "After this command `ssh -i ~/.ssh/razer.pem ubuntu@XX.XX.XX.XX` I got this error: \"unprotected private key file\". This page (https://99robots.com/how-to-fix-permission-error-ssh-amazon-ec2-instance/) explains how to fix this error. Basically you need to change the file permissions of the key file with this command: chmod 400 ~/.ssh/razer.pem", "section": "Module 1: Introduction", "question": "Unprotected private key file!" }, { "text": "My SSH connection to AWS cannot last more than a few minutes, whether via terminal or VS code.\nMy config:\n# Copy Configuration in local nano editor, then Save it!\nHost mlops-zoomcamp # ssh connection calling name\nUser ubuntu # username AWS EC2\nHostName <instance-public-IPv4-addr> # Public IP, it changes when Source EC2 is turned off.\nIdentityFile ~/.ssh/name-of-your-private-key-file.pem # Private SSH key file path\nLocalForward 8888 localhost:8888 # Connecting to a service on an internal network from the outside, static forward or set port user forward via on vscode\nStrictHostKeyChecking no\nAdded by Muhammed \u00c7elik\nThe disconnection will occur whether I SSH via WSL2 or via VS Code, and usually occurs after I run some code, i.e. \u201cimport mlflow\u201d, so not particularly intense computation.\nI cannot reconnect to the instance without stopping and restarting with a new IPv4 address.\nI\u2019ve gone through steps listed on this page: https://aws.amazon.com/premiumsupport/knowledge-center/ec2-linux-resolve-ssh-connection-errors/\nInbound rule should allow all incoming IPs for SSH.\nWhat I expect to happen:\nSSH connection should remain while I\u2019m actively using the instance, and if it does disconnect, I should be able to reconnect back.\nSolution: sometimes the hang ups are caused by the instance running out of memory. In one instance, using EC2 feature to view screenshot of the instance as a means to troubleshoot, it was the OS out-of-memory feature which killed off some critical processes. In this case, if we can\u2019t use a higher compute VM with more RAM, try adding a swap file, which uses the disk as RAM substitute and prevents the OOM error. Follow Ubuntu\u2019s documentation here: https://help.ubuntu.com/community/SwapFaq.\nAlternatively follow AWS\u2019s own doc, which mirrors Ubuntu\u2019s: https://aws.amazon.com/premiumsupport/knowledge-center/ec2-memory-swap-file/", "section": "Module 1: Introduction", "question": "AWS EC2 instance constantly drops SSH connection" }, { "text": "Everytime I restart my EC2 instance I keep getting different IP and need to update the config file manually.\n\nSolution: You can create a script like this to automatically update the IP address of your EC2 instance.https://github.com/dimzachar/mlops-zoomcamp/blob/master/notes/Week_1/update_ssh_config.md", "section": "Module 1: Introduction", "question": "AWS EC2 IP Update" }, { "text": "Make sure to use an instance with enough compute capabilities such as a t2.xlarge. You can check the monitoring tab in the EC2 dashboard to monitor your instance.", "section": "Module 1: Introduction", "question": "VS Code crashes when connecting to Jupyter" }, { "text": "Error \u201cValueError: X has 526 features, but LinearRegression is expecting 525 features as input.\u201d when running your Linear Regression Model on the validation data set:\nSolution: The DictVectorizer creates an initial mapping for the features (columns). When calling the DictVecorizer again for the validation dataset transform should be used as it will ignore features that it did not see when fit_transform was last called. E.g.\nX_train = dv.fit_transform(train_dict)\nX_test = dv.transform(test_dict)", "section": "Module 1: Introduction", "question": "X has 526 features, but expecting 525 features" }, { "text": "If some dependencies are missing\nInstall following packages\npandas\nmatplotlib\nscikit-learn\nfastparquet\npyarrow\nseaborn\npip install -r requirements.txt\nI have seen this error when using pandas.read_parquet(), the solution is to install pyarrow or fastparquet by doing !pip install pyarrow in the notebook\nNOTE: if you\u2019re using Conda instead of pip, install fastparquet rather than pyarrow, as it is much easier to install and it\u2019s functionally identical to pyarrow for our needs.", "section": "Module 1: Introduction", "question": "Missing dependencies" }, { "text": "The evaluation RMSE I get doesn\u2019t figure within the options!\nIf you\u2019re evaluating the model on the entire February data, try to filter outliers using the same technique you used on the train data (0\u2264duration\u226460) and you\u2019ll get a RMSE which is (approximately) in the options. Also don\u2019t forget to convert the columns data types to str before using the DictVectorizer.\nAnother option: Along with filtering outliers, additionally filter on null values by replacing them with -1. You will get a RMSE which is (almost same as) in the options. Use \u2018.round(2)\u2019 method to round it to 2 decimal points.\nWarning deprecation\nThe python interpreter warning of modules that have been deprecated and will be removed in future releases as well as making suggestion how to go about your code.\nFor example\nC:\\ProgramData\\Anaconda3\\lib\\site-packages\\seaborn\\distributions.py:2619:\nFutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).\nwarnings.warn(msg, FutureWarning)\nTo suppress the warnings, you can include this code at the beginning of your notebook\nimport warnings\nwarnings.filterwarnings(\"ignore\")", "section": "Module 1: Introduction", "question": "No RMSE value in the options" }, { "text": "sns.distplot(df_train[\"duration\"])\nCan be replaced with\nsns.histplot(\ndf_train[\"duration\"] , kde=True,\nstat=\"density\", kde_kws=dict(cut=3), bins=50,\nalpha=.4, edgecolor=(1, 1, 1, 0.4),\n)\nTo get almost identical result", "section": "Module 1: Introduction", "question": "How to replace distplot with histplot" }, { "text": "You need to replace the capital letter \u201cL\u201d with a small one \u201cl\u201d", "section": "Module 1: Introduction", "question": "KeyError: 'PULocationID' or 'DOLocationID'" }, { "text": "I have faced a problem while reading the large parquet file. I tried some workarounds but they were NOT successful with Jupyter.\nThe error message is:\nIndexError: index 311297 is out of bounds for axis 0 with size 131743\nI solved it by performing the homework directly as a python script.\nAdded by Ibraheem Taha (ibraheemtaha91@gmail.com)\nYou can try using the Pyspark library\nAnswered by kamaldeen (kamaldeen32@gmail.com)", "section": "Module 1: Introduction", "question": "Reading large parquet files" }, { "text": "First remove the outliers (trips with unusual duration) before plotting\nAdded by Ibraheem Taha (ibraheemtaha91@gmail.com)", "section": "Module 1: Introduction", "question": "Distplot takes too long" }, { "text": "Problem: RMSE on test set was too high when hot encoding the validation set with a previously fitted OneHotEncoder(handle_unknown=\u2019ignore\u2019) on the training set, while DictVectorizer would yield the correct RMSE.\nIn principle both transformers should behave identically when treating categorical features (at least in this week\u2019s homework where we don\u2019t have sequences of strings in each row):\nFeatures are put into binary columns encoding their presence (1) or absence (0)\nUnknown categories are imputed as zeroes in the hot-encoded matrix", "section": "Module 1: Introduction", "question": "RMSE on test set too high" }, { "text": "A: Alexey\u2019s answer https://www.youtube.com/watch?v=8uJ36ZZr_Is&t=13s\nIn summary,\npd.get_dummies or OHE can come up with result in different orders and handle missing data differently, so train and val set would have different columns during train and validation\nDictVectorizer would ignore missing (in train) and new (in val) datasets\nOther sources:\nhttps://datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor\nhttps://scikit-learn.org/stable/modules/feature_extraction.html\nhttps://innovation.alteryx.com/encode-smarter/\n~ ellacharmed", "section": "Module 1: Introduction", "question": "Q: Using of OneHotEncoder instead of DictVectorizer" }, { "text": "Why didn't get_dummies in pandas library or OneHotEncoder in scikit-learn library be used for one-hot encoding? I know OneHotEncoder is the most common and useful. One-hot coding can also be done using the eye or identity components of the NumPy library.\nM.Sari\nOneHotEncoder has the option to output a row column tuple matrix. DictVectorizer is a one step method to encode and support row column tuple matrix output.\nHarinder(sudwalh@gmail.com)", "section": "Module 1: Introduction", "question": "Q: Why did we not use OneHotEncoder(sklearn) instead of DictVectorizer ?" }, { "text": "How to check that we removed the outliers?\nUse the pandas function describe() which can provide a report of the data distribution along with the statistics to describe the data. For example, after clipping the outliers using boolean expression, the min and max can be verified using\ndf[\u2018duration\u2019].describe()", "section": "Module 1: Introduction", "question": "Clipping outliers" }, { "text": "pd.get_dummies and DictVectorizer both create a one-hot encoding on string values. Therefore you need to convert the values in PUlocationID and DOlocationID to string.\nIf you convert the values in PUlocationID and DOlocationID from numeric to string, the NaN values get converted to the string \"nan\". With DictVectorizer the RMSE is the same whether you use \"nan\" or \"-1\" as string representation for the NaN values. Therefore the representation doesn't have to be \"-1\" specifically, it could also be some other string.", "section": "Module 1: Introduction", "question": "Replacing NaNs for pickup location and drop off location with -1 for One-Hot Encoding" }, { "text": "Problem: My LinearRegression RSME is very close to the answer but not exactly the same. Is this normal?\nAnswer: No, LinearRegression is an deterministic model, it should always output the same results when given the same inputs.\nAnswer:\nCheck if you have treated the outlier properly for both train and validation sets\nCheck if the one hot encoding has been done properly by looking at the shape of one hot encoded feature matrix. If it shows 2 features, there is wrong with one hot encoding. Hint: the drop off and pick up codes need to be converted to proper data format and then DictVectorizer is fitted.\nHarshit Lamba (hlamba19@gmail.com)", "section": "Module 1: Introduction", "question": "Slightly different RSME" }, { "text": "Problem: I\u2019m facing an extremely low RMSE score (eg: 4.3451e-6) - what shall I do?\nAnswer: Recheck your code to see if your model is learning the target prior to making the prediction. If the target variable is passed in as a parameter while fitting the model, chances are the model would score extremely low. However, that\u2019s not what you would want and would much like to have your model predict that. A good way to check that is to make sure your X_train doesn\u2019t contain any part of your y_train. The same stands for validation too.\nSnehangsu De (desnehangsu@gmail.com)", "section": "Module 1: Introduction", "question": "Extremely low RSME" }, { "text": "Problem: how to enable auto completion in jupyter notebook? Tab doesn\u2019t work for me\nSolution: !pip install --upgrade jedi==0.17.2\nChristopher R.J.(romanjaimesc@gmail.com)", "section": "Module 1: Introduction", "question": "Enabling Auto-completion in jupyter notebook" }, { "text": "Problem: While following the steps in the videos you may have problems trying to download with wget the files. Usually it is a 403 error type (Forbidden access).\nSolution: The links point to files on cloudfront.net, something like this:\nhttps://d37ci6vzurychx.cloudfront.net/tOSError: Could not open parquet input source '<Buffer>': Invalid: Parquet OSError: Could not open parquet input source '<Buffer>': Invalid: Parquet rip+data/green_tripdata_2021-01.parquet\nI\u2019m not download the dataset directly, i use dataset URL and run this in the file.\nUpdate(27-May-2023): Vikram\nI am able to download the data from the below link. This is from the official NYC trip record page (https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). Copy link from page directly as the below url might get changed if the NYC decides to move away from this. Go to the page , right click and use copy link.\nwget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2021-01.parquet\n(Asif)\nCopy the link address and replace the cloudfront.net part with s3.amazonaws.com/nyc-tlc/, so it looks like this:\nhttps://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2021-01.parquet\nMario Tormo (mario@tormo-romero.eu)\nOSError: Could not open parquet input source '<Buffer>': Invalid: Parquet", "section": "Module 1: Introduction", "question": "Downloading the data from the NY Taxis datasets gives error : 403 Forbidden" }, { "text": "Problem: PyCharm (remote) doesn\u2019t see conda execution path. So, I cannot use conda env (which is located on a remote server).\nSolution: In remote server in command line write \u201cconda activate envname\u201d, after write \u201cwhich python\u201d - it gives you python execution path. After you can use this path when you will add new interpreter in PyCharm: add local interpreter -> system interpreter -> and put the path with python.\nSalimov Ilnaz (salimovilnaz777@gmail.com)", "section": "Module 1: Introduction", "question": "Using PyCharm & Conda env in remote development" }, { "text": "Problem: The output of DictVectorizer was taking up too much memory. So much so, that I couldn\u2019t even fit the linear regression model before running out of memory on my 16 GB machine.\nSolution: In the example for DictVectorizer in the scikit-learn website, they set the parameter \u201csparse\u201d as False. Although this helps with viewing the results, this results in a lot of memory usage. The solution is to either use \u201csparse=True\u201d instead, or leave it at the default which is also True.\nAhmed Fahim (afahim03@yahoo.com)", "section": "Module 1: Introduction", "question": "Running out of memory" }, { "text": "Problem: For me, Installing anaconda didn\u2019t modify the .bashrc profile. That means Anaconda env was not activated even after exiting and relaunching the unix shell.\nSolution:\nFor bash : Initiate conda again, which will add entries for anaconda in .bashrc file.\n$ cd YOUR_PATH_ANACONDA/bin $ ./conda init bash\nThat will automatically edit your .bashrc.\nReload:\n$ source ~/.bashrc\nAhamed Irshad (daisyfuentesahamed@gmail.com)", "section": "Module 1: Introduction", "question": "Activating Anaconda env in .bashrc" }, { "text": "While working through the HW1, you will realize that the training and the validation data set feature sizes are different. I was trying to figure out why and went down the entire rabbit hole only to see that I wasn\u2019t doing ```transform``` on the premade dictionary vectorizer instead of ```fit_transform```. You already have the dictionary vectorizer made so no need to execute the fit pipeline on the model.\nSam Lim(changhyeonlim@gmail.com)", "section": "Module 1: Introduction", "question": "The feature size is different for training set and validation set" }, { "text": "I found a good guide how to get acces to your machine again when you removed your public key.\nUsing the following link you can go to Session Manager and log in to your instance and create public key again. https://repost.aws/knowledge-center/ec2-linux-fix-permission-denied-errors\nThe main problem for me here was to get my old public key, so for doing this you should run the following command: ssh-keygen -y -f /path_to_key_pair/my-key-pair.pem\nFor more information: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/describe-keys.html#retrieving-the-public-key\nHanna Zhukavets (a.zhukovec1901@gmail.com)", "section": "Module 1: Introduction", "question": "Permission denied (publickey) Error (when you remove your public key on the AWS machine)" }, { "text": "Problem: The February dataset has been used as a validation/test dataset and been stripped of the outliers in a similar manner to the train dataset (taking only the rows for the duration between 1 and 60, inclusive). The RMSE obtained afterward is in the thousands.\nAnswer: The sparsematrix result from DictVectorizer shouldn\u2019t be turned into an ndarray. After removing that part of the code, I ended up receiving a correct result .\nTahina Mahatoky (tahinadanny@gmail.com)", "section": "Module 1: Introduction", "question": "Overfitting: Absurdly high RMSE on the validation dataset" }, { "text": "more specific error line:\nfrom sklearn.feature_extraction import DictVectorizer\nI had this issue and to solve it I did\n!pip install scikit-learn\nJoel Auccapuclla (auccapuclla 2013@gmail.com)", "section": "Module 2: Experiment tracking", "question": "Can\u2019t import sklearn" }, { "text": "Problem: Localhost:5000 Unavailable // Access to Localhost Denied // You don\u2019t have authorization to view this page (127.0.0.1:5000)\n\nSolution: If you are on an chrome browser you need to head to `chrome://net-internals/#sockets` and press \u201cFlush Socket Pools\u201d", "section": "Module 2: Experiment tracking", "question": "Access Denied at Localhost:5000 - Authorization Issue" }, { "text": "You have something running on the 5000 port. You need to stop it.\nAnswer: On terminal in mac .\nRun ps -A | grep gunicorn\nLook for the number process id which is the 1st number after running the command\nkill 13580\nwhere 13580 represents the process number.\nSource\nwarrie.warrieus@gmail.com\nOr by executing the following command it will kill all the processes using port 5000:\n>> sudo fuser -k 5000/tcp\nAnswered by Vaibhav Khandelwal\nJust execute in the command below in he command line to kill the running port\n->> kill -9 $(ps -A | grep python | awk '{print $1}')\nAnswered by kamaldeen (kamaldeen32@gmail.com)\nChange to different port (5001 in this case)\n>> mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5001\nAnswered by krishna (nellaikrishna@gmail.com)", "section": "Module 2: Experiment tracking", "question": "Connection in use: ('127.0.0.1', 5000)" }, { "text": "Running python register_model.py results in the following error:\nValueError: could not convert string to float: '0 int\\n1 float\\n2 hyperopt_param\\n3 Literal{n_estimators}\\n4 quniform\\n5 Literal{10}\\n6 Literal{50}\\n7 Literal{1}'\nFull Traceback:\nTraceback (most recent call last):\nFile \"/Users/name/Desktop/Programming/DataTalksClub/MLOps-Zoomcamp/2. Experiment tracking and model management/homework/scripts/register_model.py\", line 101, in <module>\nrun(args.data_path, args.top_n)\nFile \"/Users/name/Desktop/Programming/DataTalksClub/MLOps-Zoomcamp/2. Experiment tracking and model management/homework/scripts/register_model.py\", line 67, in run\ntrain_and_log_model(data_path=data_path, params=run.data.params)\nFile \"/Users/name/Desktop/Programming/DataTalksClub/MLOps-Zoomcamp/2. Experiment tracking and model management/homework/scripts/register_model.py\", line 41, in train_and_log_model\nparams = space_eval(SPACE, params)\nFile \"/Users/name/miniconda3/envs/mlops-zoomcamp/lib/python3.9/site-packages/hyperopt/fmin.py\", line 618, in space_eval\nrval = pyll.rec_eval(space, memo=memo)\nFile \"/Users/name/miniconda3/envs/mlops-zoomcamp/lib/python3.9/site-packages/hyperopt/pyll/base.py\", line 902, in rec_eval\nrval = scope._impls[node.name](*args, **kwargs)\nValueError: could not convert string to float: '0 int\\n1 float\\n2 hyperopt_param\\n3 Literal{n_estimators}\\n4 quniform\\n5 Literal{10}\\n6 Literal{50}\\n7 Literal{1}'\nSolution: There are two plausible errors to this. Both are in the hpo.py file where the hyper-parameter tuning is run. The objective function should look like this.\n\n def objective(params):\n# It's important to set the \"with\" statement and the \"log_params\" function here\n# in order to properly log all the runs and parameters.\nwith mlflow.start_run():\n# Log the parameters\nmlflow.log_params(params)\nrf = RandomForestRegressor(**params)\nrf.fit(X_train, y_train)\ny_pred = rf.predict(X_valid)\n# Calculate and log rmse\nrmse = mean_squared_error(y_valid, y_pred, squared=False)\nmlflow.log_metric('rmse', rmse)\nIf you add the with statement before this function, and just after the following line\nX_valid, y_valid = load_pickle(os.path.join(data_path, \"valid.pkl\"))\nand you log the parameters just after the search_space dictionary is defined, like this\nsearch_space = {....}\n# Log the parameters\nmlflow.log_params(search_space)\nThen there is a risk that the parameters will be logged in group. As a result, the\nparams = space_eval(SPACE, params)\nregister_model.py file will receive the parameters in group, while in fact it expects to receive them one by one. Thus, make sure that the objective function looks as above.\nAdded by Jakob Salomonsson", "section": "Module 2: Experiment tracking", "question": "Could not convert string to float - ValueError" }, { "text": "Make sure you launch the mlflow UI from the same directory as thec that is running the experiments (same directory that has the mlflow directory and the database that stores the experiments).\nOr navigate to the correct directory when specifying the tracking_uri.\nFor example:\nIf the mlflow.db is in a subdirectory called database, the tracking uri would be \u2018sqllite:///database/mlflow.db\u2019\nIf the mlflow.db is a directory above your current directory: the tracking uri would be \u2018sqlite:///../mlflow.db\u2019\nAnswered by Anna Vasylytsya\nAnother alternative is to use an absolute path to mlflow.db rather than relative path\nAnd yet another alternative is to launch the UI from the same notebook by executing the following code cell\nimport subprocess\nMLFLOW_TRACKING_URI = \"sqlite:///data/mlflow.db\"\nsubprocess.Popen([\"mlflow\", \"ui\", \"--backend-store-uri\", MLFLOW_TRACKING_URI])\nAnd then using the same MLFLOW_TRACKING_URI when initializing mlflow or the client\nclient = MlflowClient(tracking_uri=MLFLOW_TRACKING_URI)\nmlflow.set_tracking_uri(MLFLOW_TRACKING_URI)", "section": "Module 2: Experiment tracking", "question": "Experiment not visible in MLflow UI" }, { "text": "Problem:\nGetting\nERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE\nduring MLFlow's installation process, particularly while installing the Numpy package using pip\nWhen I installed mlflow using \u2018pip install mlflow\u2019 on 27th May 2022, I got the following error while numpy was getting installed through mlflow:\n\nCollecting numpy\nDownloading numpy-1.22.4-cp310-cp310-win_amd64.whl (14.7 MB)\n|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588 \t| 6.3 MB 107 kB/s eta 0:01:19\nERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE.\nIf you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.\nnumpy from https://files.pythonhosted.org/packages/b5/50/d7978137464251c393df28fe0592fbb968110f752d66f60c7a53f7158076/numpy-1.22.4-cp310-cp310-win_amd64.whl#sha256=3e1ffa4748168e1cc8d3cde93f006fe92b5421396221a02f2274aab6ac83b077 (from mlflow):\nExpected sha256 3e1ffa4748168e1cc8d3cde93f006fe92b5421396221a02f2274aab6ac83b077\nGot \t15e691797dba353af05cf51233aefc4c654ea7ff194b3e7435e6eec321807e90\nSolution:\nThen when I install numpy separately (and not as part of mlflow), numpy gets installed (same version), and then when I do 'pip install mlflow', it also goes through.\nPlease note that the above may not be consistently simulatable, but please be aware of this issue that could occur during pip install of mlflow.\nAdded by Venkat Ramakrishnan", "section": "Module 2: Experiment tracking", "question": "Hash Mismatch Error with Package Installation" }, { "text": "After deleting an experiment from UI, the deleted experiment still persists in the database.\nSolution: To delete this experiment permanently, follow these steps.\nAssuming you are using sqlite database;\nInstall ipython sql using the following command: pip install ipython-sql\nIn your jupyter notebook, load the SQL magic scripts with this: %load_ext sql\nLoad the database with this: %sql sqlite:///nameofdatabase.db\nRun the following SQL script to delete the experiment permanently: check link", "section": "Module 2: Experiment tracking", "question": "How to Delete an Experiment Permanently from MLFlow UI" }, { "text": "Problem: I cloned the public repo, made edits, committed and pushed them to my own repo. Now I want to get the recent commits from the public repo without overwriting my own changes to my own repo. Which command(s) should I use?\nThis is what my config looks like (in case this might be useful):\n[core]\nrepositoryformatversion = 0\nfilemode = true\nbare = false\nlogallrefupdates = true\nignorecase = true\nprecomposeunicode = true\n[remote \"origin\"]\nurl = git@github.com:my_username/mlops-zoomcamp.git\nfetch = +refs/heads/*:refs/remotes/origin/*\n[branch \"main\"]\nremote = origin\nmerge = refs/heads/main\nSolution: You should fork DataClubsTak\u2019s repo instead of cloning it. On GitHub, click \u201cFetch and Merge\u201d under the menu \u201cFetch upstream\u201d at the main page of your own", "section": "Module 2: Experiment tracking", "question": "How to Update Git Public Repo Without Overwriting Changes" }, { "text": "This is caused by ```mlflow.xgboost.autolog()``` when version 1.6.1 of xgboost\nDowngrade to 1.6.0\n```pip install xgboost==1.6.0``` or update requirements file with xgboost==1.6.0 instead of xgboost\nAdded by Nakul Bajaj", "section": "Module 2: Experiment tracking", "question": "Image size of 460x93139 pixels is too large. It must be less than 2^16 in each direction." }, { "text": "Since the version 1.29 the list_experiments method was deprecated and then removed in the later version\nYou should use search_experiments instead\nAdded by Alex Litvinov", "section": "Module 2: Experiment tracking", "question": "MlflowClient object has no attribute 'list_experiments'" }, { "text": "Make sure `mlflow.autolog()` ( or framework-specific autolog ) written BEFORE `with mlflow.start_run()` not after.\nAlso make sure that all dependencies for the autologger are installed, including matplotlib. A warning about uninstalled dependencies will be raised.\nMohammed Ayoub Chettouh", "section": "Module 2: Experiment tracking", "question": "MLflow Autolog not working" }, { "text": "If you\u2019re running MLflow on a remote VM, you need to forward the port too like we did in Module 1 for Jupyter notebook port 8888. Simply connect your server to VS Code, as we did, and add 5000 to the PORT like in the screenshot:\nAdded by Sharon Ibejih\nIf you are running MLflow locally and 127.0.0.1:5000 shows a blank page navigate to localhost:5000 instead.", "section": "Module 2: Experiment tracking", "question": "MLflow URL (http://127.0.0.1:5000), doesn\u2019t open." }, { "text": "Got the same warning message as Warrie Warrie when using \u201cmlflow.xgboost.autolog()\u201d\nIt turned out that this was just a warning message and upon checking MLflow UI (making sure that no \u201ctag\u201d filters were included), the model was actually automatically tracked in the MLflow.\nAdded by Bengsoon Chuah, Asked by Warrie Warrie, Answered by Anna Vasylytsya & Ivan Starovit", "section": "Module 2: Experiment tracking", "question": "MLflow.xgboost Autolog Model Signature Failure" }, { "text": "mlflow.exceptions.MlflowException: Cannot set a deleted experiment 'cross-sell' as the active experiment. You can restore the experiment, or permanently delete the experiment to create a new one.\nThere are many options to solve in this link: https://stackoverflow.com/questions/60088889/how-do-you-permanently-delete-an-experiment-in-mlflow", "section": "Module 2: Experiment tracking", "question": "MlflowException: Unable to Set a Deleted Experiment" }, { "text": "You do not have enough disk space to install the requirements. You can either increase the base EBS volume by following this link or add an external disk to your instance and configure conda installation to happen on the external disk.\nAbinaya Mahendiran\nOn GCP: I added another disk to my vm and followed this guide to mount the disk. Confirm the mount by running df -H (disk free) command in bash shell. I also deleted Anaconda and instead used miniconda. I downloaded miniconda in the additional disk that I mounted and when installing miniconda, enter the path to the extra disk instead of the default disk, this way conda is installed on the extra disk.\nYang Cao", "section": "Module 2: Experiment tracking", "question": "No Space Left on Device - OSError[Errno 28]" }, { "text": "I was using an old version of sklearn due to which I got the wrong number of parameters because in the latest version min_impurity_split for randomforrestRegressor was deprecated. Had to upgrade to the latest version to get the correct number of params.", "section": "Module 2: Experiment tracking", "question": "Parameters Mismatch in Homework Q3" }, { "text": "Error: I installed all the libraries from the requirements.txt document in a new environment as follows:\npip install -r requirementes.txt\nThen when I run mlflow from my terminal like this:\nmlflow\nI get this error:\nSOLUTION: You need to downgrade the version of 'protobuf' module to 3.20.x or lower. Initially, it was version=4.21, I installed protobuf==3.20\npip install protobuf==3.20\nAfter which I was able to run mlflow from my terminal.\n-Submitted by Aashnna Soni", "section": "Module 2: Experiment tracking", "question": "Protobuf error when installing MLflow" }, { "text": "Please check your current directory while running the mlflow ui command. You need to run mlflow ui or mlflow server command in the right directory.", "section": "Module 2: Experiment tracking", "question": "Setting up Artifacts folders" }, { "text": "If you have problem with setting up MLflow for experiment tracking on GCP, you can check these two links:\nhttps://kargarisaac.github.io/blog/mlops/data%20engineering/2022/06/15/MLFlow-on-GCP.html\nhttps://kargarisaac.github.io/blog/mlops/2022/08/26/machine-learning-workflow-orchestration-zenml.html", "section": "Module 2: Experiment tracking", "question": "Setting up MLflow experiment tracker on GCP" }, { "text": "Solution: Downgrade setuptools (I downgraded 62.3.2 -> 49.1.0)", "section": "Module 2: Experiment tracking", "question": "Setuptools Replacing Distutils - MLflow Autolog Warning" }, { "text": "I can\u2019t sort runs in MLFlow\nMake sure you are in table view (not list view) in the MLflow UI.\nAdded and Answered by Anna Vasylytsya", "section": "Module 2: Experiment tracking", "question": "Sorting runs in MLflow UI" }, { "text": "Problem: When I ran `$ mlflow ui` on a remote server and try to open it in my local browser I got an exception and the page with mlflow ui wasn\u2019t loaded.\nSolution: You should `pip uninstall flask` on your remote server on conda env and after it install Flask `pip install Flask`. It is because the base conda env has ~flask<1.2, and when you clone it to your new work env, you are stuck with this old version.\nAdded by Salimov Ilnaz", "section": "Module 2: Experiment tracking", "question": "TypeError: send_file() unexpected keyword 'max_age' during MLflow UI Launch" }, { "text": "Problem: After successfully installing mlflow using pip install mlflow on my Windows system, I am trying to run the mlflow ui command but it throws the following error:\nFileNotFoundError: [WinError 2] The system cannot find the file specified\nSolution: Add C:\\Users\\{User_Name}\\AppData\\Roaming\\Python\\Python39\\Scripts to the PATH\nAdded by Alex Litvinov", "section": "Module 2: Experiment tracking", "question": "mlflow ui on Windows FileNotFoundError: [WinError 2] The system cannot find the file specified" }, { "text": "Running \u201cpython hpo.py --data_path=./your-path --max_evals=50\u201d for the homework leads to the following error: TypeError: unsupported operand type(s) for -: 'str' and 'int'\nFull Traceback:\nFile \"~/repos/mlops/02-experiment-tracking/homework/hpo.py\", line 73, in <module>\nrun(args.data_path, args.max_evals)\nFile \"~/repos/mlops/02-experiment-tracking/homework/hpo.py\", line 47, in run\nfmin(\nFile \"~/Library/Caches/pypoetry/virtualenvs/mlflow-intro-SyTqwt0D-py3.9/lib/python3.9/site-packages/hyperopt/fmin.py\", line 540, in fmin\nreturn trials.fmin(\nFile \"~/Library/Caches/pypoetry/virtualenvs/mlflow-intro-SyTqwt0D-py3.9/lib/python3.9/site-packages/hyperopt/base.py\", line 671, in fmin\nreturn fmin(\nFile \"~/Library/Caches/pypoetry/virtualenvs/mlflow-intro-SyTqwt0D-py3.9/lib/python3.9/site-packages/hyperopt/fmin.py\", line 586, in fmin\nrval.exhaust()\nFile \"~/Library/Caches/pypoetry/virtualenvs/mlflow-intro-SyTqwt0D-py3.9/lib/python3.9/site-packages/hyperopt/fmin.py\", line 364, in exhaust\nself.run(self.max_evals - n_done, block_until_done=self.asynchronous)\nTypeError: unsupported operand type(s) for -: 'str' and 'int'\nSolution:\nThe --max_evals argument in hpo.py has no defined datatype and will therefore implicitly be treated as string. It should be an integer, so that the script can work correctly. Add type=int to the argument definition:\nparser.add_argument(\n\"--max_evals\",\ntype=int,\ndefault=50,\nhelp=\"the number of parameter evaluations for the optimizer to explore.\"\n)", "section": "Module 2: Experiment tracking", "question": "Unsupported Operand Type Error in hpo.py" }, { "text": "Getting the following warning when running mlflow.sklearn:\n\n2022/05/28 04:36:36 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of sklearn. If you encounter errors during autologging, try upgrading / downgrading sklearn to a supported version, or try upgrading MLflow. [\u2026]\nSolution: use 0.22.1 <= scikit-learn <= 1.1.0\nReference: https://www.mlflow.org/docs/latest/python_api/mlflow.sklearn.html", "section": "Module 2: Experiment tracking", "question": "Unsupported Scikit-Learn version" }, { "text": "Problem: CLI commands (mlflow experiments list) do not return experiments\nSolution description: need to set environment variable for the Tracking URI:\n$ export MLFLOW_TRACKING_URI=http://127.0.0.1:5000\nAdded and Answered by Dino Vitale", "section": "Module 2: Experiment tracking", "question": "Mlflow CLI does not return experiments" }, { "text": "Problem: After starting the tracking server, when we try to use the mlflow cli commands as listed here, most of them can\u2019t seem to find the experiments that have been run with the tracking server\nSolution: We need to set the environment variable MLFLOW_TRACKING_URI to the URI of the sqlite database. This is something like \u201cexport MLFLOW_TRACKING_URI=sqlite:///{path to sqlite database}\u201d . After this, we can view the experiments from the command line using commands like \u201cmlflow experiments search\u201d\nEven after this commands like \u201cmlflow gc\u201d doesn\u2019t seem to get the tracking uri, and they have to be passed explicitly as an argument every time the command is run.\nAhmed Fahim (afahim03@yahoo.com)", "section": "Module 2: Experiment tracking", "question": "Viewing MLflow Experiments using MLflow CLI" }, { "text": "All the experiment and other tracking information in mlflow are stored in sqllite database provided while initiating the mlflow ui command. This database can be inspected using Pycharm\u2019s Database tab by using the SQLLite database type. Once the connection is created as below, the tables can be queried and inspected using regular SQL. The same applies for any SQL backed database such as postgres as well.\nThis is very useful to understand the entity structure of the data being stored within mlflow and useful for any kind of systematic archiving of model tracking for longer periods.\nAdded by Senthilkumar Gopal", "section": "Module 2: Experiment tracking", "question": "Viewing SQLlite Data Raw & Deleting Experiments Manually" }, { "text": "Solution : It is another way to start it for remote hosting a mlflow server. For example, if you are multiple colleagues working together on something you most likely would not run mlflow on one laptop but rather everyone would connect to the same server running mlflow\nAnswer by Christoffer Added by Akshit Miglani (akshit.miglani09@gmail.com)", "section": "Module 2: Experiment tracking", "question": "What does launching the tracking server locally mean?" }, { "text": "Problem: parameter was not recognized during the model registry\nSolution: parameters should be added in previous to the model registry. The parameters can be added by mlflow.log_params(params) so that the dictionary can be directly appended to the data.run.params.\nAdded and Answered by Sam Lim", "section": "Module 2: Experiment tracking", "question": "Parameter adding in case of max_depth not recognized" }, { "text": "Problem: Max_depth is not recognize even when I add the mlflow.log_params\nSolution: the mlflow.log_params(params) should be added to the hpo.py script, but if you run it it will append the new model to the previous run that doesn\u2019t contain the parameters, you should either remove the previous experiment or change it\nPastor Soto", "section": "Module 2: Experiment tracking", "question": "Max_depth is not recognize even when I add the mlflow.log_params" }, { "text": "Problem: About week_2 homework: The register_model.py script, when I copy it into a jupyter notebook fails and spits out the following error. AttributeError: 'tuple' object has no attribute 'tb_frame'\nSolution: remove click decorators", "section": "Module 2: Experiment tracking", "question": "AttributeError: 'tuple' object has no attribute 'tb_frame'" }, { "text": "Problem: when running the preprocess_data.py file you get the following error:\n\nwandb: ERROR api_key not configured (no-tty). call wandb.login(key=[your_api_key])\nSolution: Go to your WandB profile (top RHS) \u2192 user settings \u2192 scroll down to \u201cDanger Zone\u201d and copy your API key. \n\nThen before running preprocess_data.py, add and run the following cell in your notebook:\n\n%%bash\n\nWandb login <YOUR_API_KEY_HERE>.\nAdded and Answered by James Gammerman (jgammerman@gmail.com)", "section": "Module 2: Experiment tracking", "question": "WandB API error" }, { "text": "Please make sure you following the order below nd enabling the autologging before constructing the dataset. If you still have this issue check that your data is in format compatible with XGBoost.\n# Enable MLflow autologging for XGBoost\nmlflow.xgboost.autolog()\n# Construct your dataset\nX_train, y_train = ...\n# Train your XGBoost model\nmodel = xgb.XGBRegressor(...)\nmodel.fit(X_train, y_train)\nAdded by Olga Rudakova", "section": "Module 2: Experiment tracking", "question": "WARNING mlflow.xgboost: Failed to infer model signature: could not sample data to infer model signature: please ensure that autologging is enabled before constructing the dataset." }, { "text": "Problem\nUsing wget command to download either data or python scripts on Windows, I am using the notebook provided by Visual Studio and despite having a python virtual env, it did not recognize the pip command.\nSolution: Use python -m pip, this same for any other command. Ie. python -m wget\nAdded by Erick Calderin", "section": "Module 2: Experiment tracking", "question": "wget not working" }, { "text": "Problem: Open/run github notebook(.ipynb) directly in Google Colab\nSolution: Change the domain from 'github.com' to 'githubtocolab.com'. The notebook will open in Google Colab.\nOnly works with Public repo.\nAdded by Ming Jun\nNavigating in Wandb UI became difficult to me, I had to intuit some options until I found the correct one.\nSolution: Go to the official doc.\nAdded by Erick Calderin", "section": "Module 2: Experiment tracking", "question": "Open/run github notebook(.ipynb) directly in Google Colab" }, { "text": "Problem: Someone asked why we are using this type of split approach instead of just a random split.\nSolution: For example, I have some models at work that train on Jan 1 2020 \u2014 Aug 1 2021 time period, and then test on Aug 1 - Dec 31 2021, and finally validate on Jan - March or something\nWe do these \u201cout of time\u201d validations to do a few things:\nCheck for seasonality of our data\nWe know if the RMSE for Test is 5 say, and then RMSE for validation is 20, then there\u2019s serious seasonality to the data we are looking at, and now we might change to Time Series approaches\nIf I\u2019m predicting on Mar 30 2023 the outcomes for the next 3 months, the \u201crandom sample\u201d in our train/test would have caused data leakage, overfitting, and poor model performance in production. We mustn\u2019t take information about the future and apply it to the present when we are predicting in a model context.\nThese are two of, I think, the biggest points for why we are doing jan/feb/march. I wouldn\u2019t do it any other way.\nTrain: Jan\nTest: Feb\nValidate: March\nThe point of validation is to report out model metrics to leadership, regulators, auditors, and record the models performance to then later analyze target drift\nAdded by Sam LaFell\nProblem: If you get an error while trying to run the mlflow server on AWS CLI with S3 bucket and POSTGRES database:\nReproducible Command:\nmlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://<DB_USERNAME>:<DB_PASSWORD>@<DB_ENDPOINT>:<DB_PORT>/<DB_NAME> --default-artifact-root s3://<BUCKET_NAME>\nError:\n\"urllib3 v2.0 only supports OpenSSL 1.1.1+, currently \"\nImportError: urllib3 v2.0 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'OpenSSL 1.0.2k-fips 26 Jan 2017'. See: https://github.com/urllib3/urllib3/issues/2168\nSolution: Upgrade mlflow using\nCode: pip3 install --upgrade mlflow\nResolution: It downgrades urllib3 2.0.3 to 1.26.16 which is compatible with mlflow and ssl 1.0.2\nInstalling collected packages: urllib3\nAttempting uninstall: urllib3\nFound existing installation: urllib3 2.0.3\nUninstalling urllib3-2.0.3:\nSuccessfully uninstalled urllib3-2.0.3\nSuccessfully installed urllib3-1.26.16\nAdded by Sarvesh Thakur", "section": "Module 3: Orchestration", "question": "Why do we use Jan/Feb/March for Train/Test/Validation Purposes?" }, { "text": "Problem description\nSolution description\n(optional) Added by Name", "section": "Module 3: Orchestration", "question": "Problem title" }, { "text": "Here", "section": "Module 4: Deployment", "question": "Where is the FAQ for Prefect questions?" }, { "text": "Windows with AWS CLI already installed\nAWS CLI version:\naws-cli/2.4.24 Python/3.8.8 Windows/10 exe/AMD64 prompt/off\nExecuting\n$(aws ecr get-login --no-include-email)\nshows error\naws.exe: error: argument operation: Invalid choice, valid choices are\u2026\nUse this command instead. More info here:\nhttps://docs.aws.amazon.com/cli/latest/reference/ecr/get-login-password.html\naws ecr get-login-password \\\n--region <region> \\\n| docker login \\\n--username AWS \\\n--password-stdin <aws_account_id>.dkr.ecr.<region>.amazonaws.com\nAdded by MarcosMJD", "section": "Module 4: Deployment", "question": "aws.exe: error: argument operation: Invalid choice \u2014 Docker can not login to ECR." }, { "text": "Use ` at the end of each line except the last. Note that multiline string does not need `.\nEscape \u201c to \u201c\\ .\nUse $env: to create env vars (non-persistent). E.g.:\n$env:KINESIS_STREAM_INPUT=\"ride_events\"\naws kinesis put-record --cli-binary-format raw-in-base64-out `\n--stream-name $env:KINESIS_STREAM_INPUT `\n--partition-key 1 `\n--data '{\n\\\"ride\\\": {\n\\\"PULocationID\\\": 130,\n\\\"DOLocationID\\\": 205,\n\\\"trip_distance\\\": 3.66\n},\n\\\"ride_id\\\": 156\n}'\nAdded by MarcosMJD", "section": "Module 4: Deployment", "question": "Multiline commands in Windows Powershell" }, { "text": "If one gets pipenv failures for pipenv install command -\nAttributeError: module 'collections' has no attribute 'MutableMapping'\nIt happens because you use the system Python (3.10) for pipenv.\nIf you previously installed pipenv with apt-get, remove it - sudo-apt remove pipenv\nMake sure you have a non-system Python installed in your environment. The easiest way to do it is to install anaconda or miniconda\nNext, install pipenv to your non-system Python. If you use the setup from the lectures, it\u2019s just this: pip install pipenv\nNow re-run pipenv install XXXX (relevant dependencies) - should work\nTested and worked on AWS instance, similar to the config Alexey presented in class.\nAdded by Daniel HenSSL", "section": "Module 4: Deployment", "question": "Pipenv installation not working (AttributeError: module 'collections' has no attribute 'MutableMapping')" }, { "text": "First check if SSL module configured with following command:\nPython -m ssl\n\nIf the output of this is empty there is no problem with SSL configuration.\n\nThen you should upgrade your pipenv package in your current environment to resolve the problem.\nAdded by Kenan Arslanbay", "section": "Module 4: Deployment", "question": "module is not available (Can't connect to HTTPS URL)" }, { "text": "During scikit-learn installation via the command:\npipenv install scikit-learn==1.0.2\nThe following error is raised:\nModuleNotFoundError: No module named 'pip._vendor.six'\nThen, one should:\nsudo apt install python-six\npipenv --rm\npipenv install scikit-learn==1.0.2\nAdded by Giovanni Pecoraro", "section": "Module 4: Deployment", "question": "No module named 'pip._vendor.six'" }, { "text": "Problem description. How can we use Jupyter notebooks with the Pipenv environment?\nSolution: Refer to this stackoverflow question. Basically install jupyter and ipykernel using pipenv. And then register the kernel with `python -m ipykernel install --user --name=my-virtualenv-name` inside the Pipenv shell. If you are using Jupyter notebooks in VS Code, doing this will also add the virtual environment in the list of kernels.\nAdded by Ron Medina", "section": "Module 4: Deployment", "question": "Pipenv with Jupyter" }, { "text": "Problem: I tried to run starter notebook on pipenv environment but had issues with no output on prints. \nI used scikit-learn==1.2.2 and python==3.10\nTornado version was 6.3.2\n\nSolution: The error you're encountering seems to be a bug related to Tornado, which is a Python web server and networking library. It's used by Jupyter under the hood to handle networking tasks.\nDowngrading to tornado==6.1 fixed the issue\nhttps://stackoverflow.com/questions/54971836/no-output-jupyter-notebook", "section": "Module 4: Deployment", "question": "Pipenv with Jupyter no output" }, { "text": "Problem description: You might get an error \u2018Invalid base64\u2019 after running the \u2018aws kinesis put-record\u2019 command on your local machine. This might be the case if you are using the AWS CLI version 2 (note that in the video 4.4, around 57:42, you can see a warning since the instructor is using v1 of the CLI.\nSolution description: To get around this, pass the argument \u2018--cli-binary-format raw-in-base64-out\u2019. This will encode your data string into base64 before passing it to kinesis\nAdded by M", "section": "Module 4: Deployment", "question": "\u2018Invalid base64\u2019 error after running `aws kinesis put-record`" }, { "text": "Problem description: Running starter.ipynb in homework\u2019s Q1 will show up this error.\nSolution description: Update pandas (actually pandas version was the latest, but several dependencies are updated).\nAdded by Marcos Jimenez", "section": "Module 4: Deployment", "question": "Error index 311297 is out of bounds for axis 0 with size 131483 when loading parquet file." }, { "text": "Use command $pipenv lock to force the creation of Pipfile.lock\nAdded by Bijay P.", "section": "Module 4: Deployment", "question": "Pipfile.lock was not created along with Pipfile" }, { "text": "This issue is usually due to the pythonfinder module in pipenv.\nThe solution to this involves manually changing the scripts as describe here python_finder_fix\nAdded by Ridwan Amure", "section": "Module 4: Deployment", "question": "Permission Denied using Pipenv" }, { "text": "When passing arguments to a script via command line and converting it to a 4 digit number using f\u2019{year:04d}\u2019, this error showed up.\nThis happens because all inputs from the command line are read as string by the script. They need to be converted to numeric/integer before transformation in fstring.\nyear = int(sys.argv[1])\nf\u2019{year:04d}\u2019\nIf you use click library just edit a decorator\n@click.command()\n@click.option( \"--year\", help=\"Year for evaluation\", type=int)\ndef your_function(year):\n<<Your code>>\nAdded by Taras Sh", "section": "Module 4: Deployment", "question": "Error while parsing arguments via CLI [ValueError: Unknown format code 'd' for object of type 'str']" }, { "text": "Ensure the correct image is being used to derive from.\nCopy the data from local to the docker image using the COPY command to a relative path. Using absolute paths within the image might be troublesome.\nUse paths starting from /app and don\u2019t forget to do WORKDIR /app before actually performing the code execution.\nMost common commands\nBuild container using docker build -t mlops-learn .\nExecute the script using docker run -it --rm mlops-learn\n<mlops-learn> is just a name used for the image and does not have any significance.", "section": "Module 4: Deployment", "question": "Dockerizing tips" }, { "text": "If you are trying to run Flask gunicorn & MLFlow server from the same container, defining both in Dockerfile with CMD will only run MLFlow & not Flask.\nSolution: Create separate shell script with server run commands, for eg:\n> \tscript1.sh\n#!/bin/bash\ngunicorn --bind=0.0.0.0:9696 predict:app\nAnother script with e.g. MLFlow server:\n>\tscript2.sh\n#!/bin/bash\nmlflow server -h 0.0.0.0 -p 5000 --backend-store-uri=sqlite:///mlflow.db --default-artifact-root=g3://zc-bucket/mlruns/\nCreate a wrapper script to run above 2 scripts:\n>\twrapper_script.sh\n#!/bin/bash\n# Start the first process\n./script1.sh &\n# Start the second process\n./script2.sh &\n# Wait for any process to exit\nwait -n\n# Exit with status of process that exited first\nexit $?\nGive executable permissions to all scripts:\nchmod +x *.sh\nNow we can define last line of Dockerfile as:\n> \tDockerfile\nCMD ./wrapper_script.sh\nDont forget to expose all ports defined by services!", "section": "Module 4: Deployment", "question": "Running multiple services in a Docker container" }, { "text": "Problem description cannot generate pipfile.lock raise InstallationError( pip9.exceptions.InstallationError: Command \"python setup.py egg_info\" failed with error code 1\nSolution: you need to force and upgrade wheel and pipenv\nJust run the command line :\npip install --user --upgrade --upgrade-strategy eager pipenv wheel", "section": "Module 4: Deployment", "question": "Cannot generate pipfile.lock raise InstallationError( pip9.exceptions.InstallationError)" }, { "text": "Problem description. How can we connect s3 bucket to MLFLOW?\nSolution: Use boto3 and AWS CLI to store access keys. The access keys are what will be used by boto3 (AWS' Python API tool) to connect with the AWS servers. If there are no Access Keys how can they make sure that they have the right to access this Bucket? Maybe you're a malicious actor (Hacker for ex). The keys must be present for boto3 to talk to the AWS servers and they will provide access to the Bucket if you possess the right permissions. You can always set the Bucket as public so anyone can access it, now you don't need access keys because AWS won't care.\nRead more here: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html\nAdded by Akshit Miglani", "section": "Module 4: Deployment", "question": "Connecting s3 bucket to MLFLOW" }, { "text": "Even though the upload works using aws cli and boto3 in Jupyter notebook.\nSolution set the AWS_PROFILE environment variable (the default profile is called default)", "section": "Module 4: Deployment", "question": "Uploading to s3 fails with An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.\"" }, { "text": "Problem description: lib_lightgbm.so Reason: image not found\nSolution description: Add \u201cRUN apt-get install libgomp1\u201d to your docker. (change installer command based on OS)\nAdded by Kazeem Hakeem", "section": "Module 4: Deployment", "question": "Dockerizing lightgbm" }, { "text": "When the request is processed in lambda function, mlflow library raises:\n2022/09/19 21:18:47 WARNING mlflow.pyfunc: Encountered an unexpected error (AttributeError(\"module 'dataclasses' has no attribute '__version__'\")) while detecting model dependency mismatches. Set logging level to DEBUG to see the full traceback.\nSolution: Increase the memory of the lambda function.\nAdded by MarcosMJD", "section": "Module 4: Deployment", "question": "Error raised when executing mlflow\u2019s pyfunc.load_model in lambda function." }, { "text": "Just a note if you are following the video but also using the repo\u2019s notebook The notebook is the end state of the video which eventually uses mlflow pipelines.\nJust watch the video and be patient. Everything will work :)\nAdded by Quinn Avila", "section": "Module 4: Deployment", "question": "4.3 FYI Notebook is end state of Video -" }, { "text": "Problem description: I was having issues because my python script was not reading AWS credentials from env vars, after building the image I was running it like this:\ndocker run -it homework-04 -e AWS_ACCESS_KEY_ID=xxxxxxxx -e AWS_SECRET_ACCESS_KEY=xxxxxx\nSolution 1:\n\nEnvironment Variables: \nYou can set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN (if you are using AWS STS) environment variables. You can set these in your shell, or you can include them in your Docker run command like this:\nI found out by myself that those variables must be passed before specifying the name of the image, as follow:\ndocker run -e AWS_ACCESS_KEY_ID=xxxxxxxx -e AWS_SECRET_ACCESS_KEY=xxxxxx -it homework-04\nAdded by Erick Cal\nSolution 2 (if AWS credentials were not found):\nAWS Configuration Files: \nThe AWS SDKs and CLI will check the ~/.aws/credentials and ~/.aws/config files for credentials if they exist. You can map these files into your Docker container using volumes:\n\ndocker run -it --rm -v ~/.aws:/root/.aws homework:v1", "section": "Module 4: Deployment", "question": "Passing envs to my docker image" }, { "text": "If anyone is troubleshooting or just interested in seeing the model listed on the image svizor/zoomcamp-model:mlops-3.10.0-slim.\nCreate a dockerfile. (yep thats all) and build \u201cdocker build -t zoomcamp_test .\u201d\nFROM svizor/zoomcamp-model:mlops-3.10.0-slim\nRun \u201cdocker run -it zoomcamp_test ls /app\u201d output -> model.bin\nThis will list the contents of the app directory and \u201cmodel.bin\u201d should output. With this you could just copy your files, for example \u201ccopy myfile .\u201d maybe a requirements file and this can be run for example \u201cdocker run -it myimage myscript arg1 arg2 \u201d. Of course keep in mind a build is needed everytime you change the Dockerfile.\nAnother variation is to have it run when you run the docker file.\n\u201c\u201d\u201d\nFROM svizor/zoomcamp-model:mlops-3.10.0-slim\nWORKDIR /app\nCMD ls\n\u201c\u201d\u201d\nJust keep in mind CMD is needed because the RUN commands are used for building the image and the CMD is used at container runtime. And in your example you probably want to run a script or should we say CMD a script.\nQuinn Avila", "section": "Module 4: Deployment", "question": "How to see the model in the docker container in app/?" }, { "text": "To resolve this make sure to build the docker image with the platform tag, like this:\n\u201cdocker build -t homework:v1 --platform=linux/arm64 .\u201d", "section": "Module 4: Deployment", "question": "WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested" }, { "text": "Solution: instead of input_file = f'https://s3.amazonaws.com/nyc-tlc/trip+data/{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet' use input_file = f'https://d37ci6vzurychx.cloudfront.net/trip-data/{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet'\nIlnaz Salimov\nsalimovilnaz777@gmail.com", "section": "Module 4: Deployment", "question": "HTTPError: HTTP Error 403: Forbidden when call apply_model() in score.ipynb" }, { "text": "i'm getting this error ModuleNotFoundError: No module named 'pipenv.patched.pip._vendor.urllib3.response'\nand Resolved from this command pip install pipenv --force-reinstall\ngetting this errror site-packages\\pipenv\\patched\\pip\\_vendor\\urllib3\\connectionpool.py\"\nResolved from this command pip install -U pip and pip install requests\nAsif", "section": "Module 5: Monitoring", "question": "ModuleNotFoundError: No module named 'pipenv.patched.pip._vendor.urllib3.response'" }, { "text": "Problem description: When running docker-compose up as shown in the video 5.2 if you go to http://localhost:3000/ you get asked for a username and a password.\nSolution: for both of them the default is \u201cadmin\u201d. Then you can enter your new password. \nSee also here\nAdded by JaimeRV", "section": "Module 5: Monitoring", "question": "Login window in Grafana" }, { "text": "Problem Description : In Linux, when starting services using docker compose up --build as shown in video 5.2, the services won\u2019t start and instead we get message unknown flag: --build in command prompt.\nSolution : Since we install docker-compose separately in Linux, we have to run docker-compose up --build instead of docker compose up --build\nAdded by Ashish Lalchandani", "section": "Module 5: Monitoring", "question": "Error in starting monitoring services in Linux" }, { "text": "Problem: When running prepare.py getting KeyError: \u2018content-length\u2019\nSolution: From Emeli Dral:\nIt seems to me that the link we used in prepare.py to download taxi data does not work anymore. I substituted the instruction:\nurl = f\"https://nyc-tlc.s3.amazonaws.com/trip+data/{file}\nby the\nurl = f\"https://d37ci6vzurychx.cloudfront.net/trip-data/{file}\"\nin the prepare.py and it worked for me. Hopefully, if you do the same you will be able to get those data.", "section": "Module 5: Monitoring", "question": "KeyError \u2018content-length\u2019 when running prepare.py" }, { "text": "Problem description\nWhen I run the command \u201cdocker-compose up \u2013build\u201d and send the data to the real-time prediction service. The service will return \u201cMax retries exceeded with url: /api\u201d.\nIn my case it because of my evidently service exit with code 2 due to the \u201capp.py\u201d in evidently service cannot import \u201cfrom pyarrow import parquet as pq\u201d.\nSolution description\nThe first solution is just install the pyarrow module \u201cpip install pyarrow\u201d\nThe second solution is restart your machine.\nThe third solution is if the first and second one didn\u2019t work with your machine. I found that \u201capp.py\u201d of evidently service didn\u2019t use that module. So comment the pyarrow module out and the problem was solved for me.\nAdded by Surawut Jirasaktavee", "section": "Module 5: Monitoring", "question": "Evidently service exit with code 2" }, { "text": "When using evidently if you get this error.\nYou probably forgot to and parentheses () just and opening and closing and you are good to go.\nQuinn Avila", "section": "Module 5: Monitoring", "question": "ValueError: Incorrect item instead of a metric or metric preset was passed to Report" }, { "text": "You will get an error if you didn\u2019t add a target=\u2019duration_min\u2019\nIf you want to use RegressionQualityMetric() you need a target=\u2019duration_min and you need this added to you current_data[\u2018duration_min\u2019]\nQuinn Avila", "section": "Module 5: Monitoring", "question": "For the report RegressionQualityMetric()" }, { "text": "Problem description\nValueError: Found array with 0 sample(s) (shape=(0, 6)) while a minimum of 1 is required by LinearRegression.\nSolution description\nThis happens because the generated data is based on an early date therefore the training dataset would be empty.\nAdjust the following\nbegin = datetime.datetime(202X, X, X, 0, 0)\nAdded by Luke", "section": "Module 5: Monitoring", "question": "Found array with 0 sample(s)" }, { "text": "Problem description\nGetting \u201ctarget columns\u201d \u201cprediction columns\u201d not present errors after adding a metric\nSolution description\nMake sure to read through the documentation on what is required or optional when adding the metric. I added DatasetCorrelationsMetric which doesn\u2019t require any parameters because the metric evaluates for correlations among the features.\nSam Lim", "section": "Module 5: Monitoring", "question": "Adding additional metric" }, { "text": "When you try to login in Grafana with standard requisites (admin/admin) it throw up an error.\nAfter run grafana-cli admin reset-admin-password admin in Grafana container the problem will be fixed\nAdded by Artem Glazkov", "section": "Module 5: Monitoring", "question": "Standard login in Grafana does not work" }, { "text": "Problem description. While my metric generation script was still running, I noticed that the charts in Grafana don\u2019t get updated.\nSolution description. There are two things to pay attention to:\nRefresh interval: set it to a small value: 5-10-30 seconds\nUse your local timezone in a call to `pytz.timezone` \u2013 I couldn\u2019t get updates before changing this from the original value \u201cEurope/London\u201d to my own zone", "section": "Module 5: Monitoring", "question": "The chart in Grafana doesn\u2019t get updates" }, { "text": "Problem description. Prefect server was not running locally, I ran `prefect server start` command but it stopped immediately..\nSolution description. I used Prefect cloud to run the script, however I created an issue on the Prefect github.\nBy Erick Calderin", "section": "Module 5: Monitoring", "question": "Prefect server was not running locally" }, { "text": "Solution. Using docker CLI run docker system prune to remove unused things (build cache, containers, images etc)\nAlso, to see what\u2019s taking space before pruning you can run docker system df\nBy Alex Litvinov", "section": "Module 5: Monitoring", "question": "no disk space left error when doing docker compose up" }, { "text": "Problem: when run docker-compose up \u2013build, you may see this error. To solve, add `command: php -S 0.0.0.0:8080 -t /var/www/html` in adminer block in yml file like:\nadminer:\ncommand: php -S 0.0.0.0:8080 -t /var/www/html\nimage: adminer\n\u2026\nIlnaz Salimov\nsalimovilnaz777@gmail.com", "section": "Module 5: Monitoring", "question": "Failed to listen on :::8080 (reason: php_network_getaddresses: getaddrinfo failed: Address family for hostname not supported)" }, { "text": "Problem: Can we generate charts like Evidently inside Grafana?\nSolution: In Grafana that would be a stat panel (just a number) and scatter plot panel (I believe it requires a plug-in). However, there is no native way to quickly recreate this exact Evidently dashboard. You'd need to make sure you have all the relevant information logged to your Grafana data source, and then design your own plots in Grafana.\nIf you want to recreate the Evidently visualizations externally, you can export the Evidently output in JSON with include_render=True\n(more details here https://docs.evidentlyai.com/user-guide/customization/json-dict-output) and then parse information from it for your external visualization layer. To include everything you need for non-aggregated visuals, you should also add \"raw_data\": True option (more details here https://docs.evidentlyai.com/user-guide/customization/report-data-aggregation).\nOverall, this specific plot with under- and over-performance segments is more useful during debugging, so might be easier to access it ad hoc using Evidently.\nAdded by Ming Jun, Asked by Luke, Answered by Elena Samuylova", "section": "Module 6: Best practices", "question": "Generate Evidently Chart in Grafana" }, { "text": "You may get an error \u2018{'errorMessage': 'Unable to locate credentials', \u2026\u2019 from the print statement in test_docker.py after running localstack with kinesis.\nTo fix this, in the docker-compose.yaml file, in addition to the environment variables like AWS_DEFAULT_REGION, add two other variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Their value is not important; anything like abc will suffice\nAdded by M\nOther possibility is just to run\naws --endpoint-url http://localhost:4566 configure\nAnd providing random values for AWS Access Key ID , AWS Secret Access Key, Default region name, and Default output format.\nAdded by M.A. Monjas", "section": "Module 6: Best practices", "question": "Get an error \u2018Unable to locate credentials\u2019 after running localstack with kinesis" }, { "text": "You may get an error while creating a bucket with localstack and the boto3 client:\nbotocore.exceptions.ClientError: An error occurred (IllegalLocationConstraintException) when calling the CreateBucket operation: The unspecified location constraint is incompatible for the region specific endpoint this request was sent to.\nTo fix this, instead of creating a bucket via\ns3_client.create_bucket(Bucket='nyc-duration')\nCreate it with\ns3_client.create_bucket(Bucket='nyc-duration', CreateBucketConfiguration={\n'LocationConstraint': AWS_DEFAULT_REGION})\nyam\nAdded by M", "section": "Module 6: Best practices", "question": "Get an error \u2018 unspecified location constraint is incompatible \u2019" }, { "text": "When executing an AWS CLI command (e.g., aws s3 ls), you can get the error <botocore.awsrequest.AWSRequest object at 0x7fbaf2666280>.\nTo fix it, simply set the AWS CLI environment variables:\nexport AWS_DEFAULT_REGION=eu-west-1\nexport AWS_ACCESS_KEY_ID=foobar\nexport AWS_SECRET_ACCESS_KEY=foobar\nTheir value is not important; anything would be ok.\nAdded by Giovanni Pecoraro", "section": "Module 6: Best practices", "question": "Get an error \u201c<botocore.awsrequest.AWSRequest object at 0x7fbaf2666280>\u201d after running an AWS CLI command" }, { "text": "At every commit the above error is thrown and no pre-commit hooks are ran.\nMake sure the indentation in .pre-commit-config.yaml is correct. Especially the 4 spaces ahead of every `repo` statement\nAdded by M. Ayoub C.", "section": "Module 6: Best practices", "question": "Pre-commit triggers an error at every commit: \u201cmapping values are not allowed in this context\u201d" }, { "text": "No option to remove pytest test\nRemove .vscode folder located on the folder you previously used for testing, e.g. folder code (from week6-best-practices) was chosen to test, so you may remove .vscode inside the folder.\nAdded by Rizdi Aprilian", "section": "Module 6: Best practices", "question": "Could not reconfigure pytest from zero after getting done with previous folder" }, { "text": "Problem description\nFollowing video 6.3, at minute 11:23, get records command returns empty Records.\nSolution description\nAdd --no-sign-request to Kinesis get records call:\n aws --endpoint-url=http://localhost:4566 kinesis get-records --shard-iterator [\u2026] --no-sign-request", "section": "Module 6: Best practices", "question": "Empty Records in Kinesis Get Records with LocalStack" }, { "text": "Problem description\ngit commit -m 'Updated xxxxxx'\nAn error has occurred: InvalidConfigError:\n==> File .pre-commit-config.yaml\n=====> 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte\nSolution description\nSet uft-8 encoding when creating the pre-commit yaml file:\npre-commit sample-config | out-file .pre-commit-config.yaml -encoding utf8\nAdded by MarcosMJD", "section": "Module 6: Best practices", "question": "In Powershell, Git commit raises utf-8 encoding error after creating pre-commit yaml file" }, { "text": "Problem description\ngit commit -m 'Updated xxxxxx'\n[INFO] Initializing environment for https://github.com/pre-commit/pre-commit-hooks.\n[INFO] Installing environment for https://github.com/pre-commit/pre-commit-hooks.\n[INFO] Once installed this environment will be reused.\nAn unexpected error has occurred: CalledProcessError: command:\n\u2026\nreturn code: 1\nexpected return code: 0\nstdout:\nAttributeError: 'PythonInfo' object has no attribute 'version_nodot'\nSolution description\nClear app-data of the virtualenv\npython -m virtualenv api -vvv --reset-app-data\nAdded by MarcosMJD", "section": "Module 6: Best practices", "question": "Git commit with pre-commit hook raises error \u2018'PythonInfo' object has no attribute 'version_nodot'" }, { "text": "Problem description\nProject structure:\n/sources/production/model_service.py\n/sources/tests/unit_tests/test_model_service.py (\u201cfrom production.model_service import ModelService)\nWhen running python test_model_service.py from the sources directory, it works.\nWhen running pytest ./test/unit_tests fails. \u2018No module named \u2018production\u2019\u2019\nSolution description\nUse python -m pytest ./test/unit_tests\nExplanation: pytest does not add to the sys.path the path where pytest is run.\nYou can run python -m pytest, or alternatively export PYTHONPATH=. Before executing pytest\nAdded by MarcosMJD", "section": "Module 6: Best practices", "question": "Pytest error \u2018module not found\u2019 when if using custom packages in the source code" }, { "text": "Problem description\nProject structure:\n/sources/production/model_service.py\n/sources/tests/unit_tests/test_model_service.py (\u201cfrom production.model_service import ModelService)\ngit commit -t \u2018test\u2019 raises \u2018No module named \u2018production\u2019\u2019 when calling pytest hook\n- repo: local\nhooks:\n- id: pytest-check\nname: pytest-check\nentry: pytest\nlanguage: system\npass_filenames: false\nalways_run: true\nargs: [\n\"tests/\"\n]\nSolution description\nUse this hook instead:\n- repo: local\nhooks:\n- id: pytest-check\nname: pytest-check\nentry: \"./sources/tests/unit_tests/run.sh\"\nlanguage: system\ntypes: [python]\npass_filenames: false\nalways_run: true\nAnd make sure that run.sh sets the right directory and run pytest:\ncd \"$(dirname \"$0\")\"\ncd ../..\nexport PYTHONPATH=.\npipenv run pytest ./tests/unit_tests\nAdded by MarcosMJD", "section": "Module 6: Best practices", "question": "Pytest error \u2018module not found\u2019 when using pre-commit hooks if using custom packages in the source code" }, { "text": "Problem description\nThis is the step in the ci yml file definition:\n- name: Run Unit Tests\nworking-directory: \"sources\"\nrun: ./tests/unit_tests/run.sh\nWhen executing github ci action, error raises:\n\u2026/tests/unit_test/run.sh Permission error\nError: Process completed with error code 126\nSolution description\nAdd execution permission to the script and commit+push:\ngit update-index --chmod=+x .\\sources\\tests\\unit_tests\\run.sh\nAdded by MarcosMJD", "section": "Module 6: Best practices", "question": "Github actions: Permission denied error when executing script file" }, { "text": "Problem description\nWhen a docker-compose file contains a lot of containers, running the containers may take too much resource. There is a need to easily select only a group of containers while ignoring irrelevant containers during testing.\nSolution description\nAdd profiles: [\u201cprofile_name\u201d] in the service definition.\nWhen starting up the service, add `--profile profile_name` in the command.\nAdded by Ammar Chalifah", "section": "Module 6: Best practices", "question": "Managing Multiple Docker Containers with docker-compose profile" }, { "text": "Problem description\nIf you are having problems with the integration tests and kinesis double check that your aws regions match on the docker-compose and local config. Otherwise you will be creating a stream in the wrong region\nSolution description\nFor example set ~/.aws/config region = us-east-1 and the docker-compose.yaml - AWS_DEFAULT_REGION=us-east-1\nAdded by Quinn Avila", "section": "Module 6: Best practices", "question": "AWS regions need to match docker-compose" }, { "text": "Problem description\nPre-commit command was failing with isort repo.\nSolution description\nSet version to 5.12.0\nAdded by Erick Calderin", "section": "Module 6: Best practices", "question": "Isort Pre-commit" }, { "text": "Problem description\nInfrastructure created in AWS with CD-Deploy Action needs to be destroyed\nSolution description\nFrom local:\nterraform init -backend-config=\"key=mlops-zoomcamp-prod.tfstate\" --reconfigure\nterraform destroy --var-file vars/prod.tfvars\nAdded by Erick Calderin", "section": "Module 6: Best practices", "question": "How to destroy infrastructure created via GitHub Actions" } ] } ]