,doc_body,doc_description,doc_full_name,doc_status,article_id 3,"Skip navigation Sign in SearchLoading... Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE. WATCH QUEUE QUEUE Watch Queue Queue * Remove all * Disconnect The next video is starting stop 1. Loading... Watch Queue Queue __count__/__total__ Find out why CloseDEMO: DETECT MALFUNCTIONING IOT SENSORS WITH STREAMING ANALYTICS IBM AnalyticsLoading... Unsubscribe from IBM Analytics? Cancel UnsubscribeWorking... Subscribe Subscribed Unsubscribe 26KLoading... Loading... Working... Add toWANT TO WATCH THIS AGAIN LATER? Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO? Sign in to report inappropriate content. Sign in * Transcript * Statistics * Add translations 175 views 6LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 7 0DON'T LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1Loading... Loading... TRANSCRIPT The interactive transcript could not be loaded.Loading... Loading... Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Nov 6, 2017This video demonstrates a Streaming Analytics application written in Python running in the IBM Data Science experience. The results of the analysis are displayed on a map using Plotly. The notebook demonstrated in this video is available for you to try: http://ibm.biz/WeatherNotebook Visit Streamsdev for more articles and tips about Streams: https://developer.ibm.com/streamsdev Python API Developer guide: http://ibmstreams.github.io/streamsx.... Streaming Analytics in Python course: https://developer.ibm.com/courses/all... * CATEGORY * Science & Technology * LICENSE * Standard YouTube License Show more Show lessLoading... Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * The Python ecosystem for Data Science: A guided tour - Christian Staudt - Duration: 25:41. PyData 1,411 views 25:41 -------------------------------------------------------------------------------- * IBM Streaming Analytics and Python - Duration: 1:00:51. John O'Neill 105 views 1:00:51 * How Customers Are Using the IBM Data Science Experience Expected Cases and Not So Expected Ones - Duration: 18:29. Databricks 327 views 18:29 * Giovanni Lanzani | Applied Data Science - Duration: 35:14. PyData 2,728 views 35:14 * Detecting Fraud in Real-Time with Azure Stream Analytics - Duration: 32:16. Philip Howard 71 views 32:16 * Step by step guide how to build a real-time anomaly detection system using Apache Spark Streaming - Duration: 16:11. Mariusz Jacyno 4,591 views 16:11 * Real-time Analytics with Azure Stream Analytics - Duration: 54:47. PASS Business Analytics Virtual Group 940 views 54:47 * Real-Time Machine Learning Analytics Using Structured Streaming and Kinesis Firehose - Duration: 31:25. Databricks 660 views 31:25 * Data Science - Duration: 25:05. manish telang 3 views 25:05 * Real-Time Log Analytics using Amazon Kinesis and Amazon Elasticsearch Service - Duration: 28:32. Amazon Web Services - Webinar Channel 1,072 views 28:32 * IBM Data Science Experience and Machine Learning Use Cases in Healthcare - Duration: 26:53. IDEAS 157 views 26:53 * Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Services - Duration: 47:06. Kai Wähner 1,761 views 47:06 * An overview of IBM Streaming Analytics for Bluemix - Duration: 44:12. IBM Analytics 1,311 views 44:12 * Predicting Stock Prices - Learn Python for Data Science #4 - Duration: 7:39. Siraj Raval 274,452 views 7:39 * REST API concepts and examples - Duration: 8:53. WebConcepts 1,687,034 views 8:53 * Streaming Data Analytics with Apache Spark Streaming - Duration: 1:01:19. Data Gurus 300 views 1:01:19 * Orchestrate IBM Data Science Experience analytics workflows using Node-RED - Duration: 10:16. Balaji Kadambi 109 views 10:16 * Delight Clients with Data Science on the IBM Integrated Analytics System - Duration: 15:05. IBM Analytics 1,581 views 15:05 * What is DevOps? - In Simple English - Duration: 7:07. Rackspace 657,396 views 7:07 * Introduction - Learn Python for Data Science #1 - Duration: 6:55. Siraj Raval 206,552 views 6:55 * Loading more suggestions... * Show more * Language: English * Location: United States * Restricted Mode: Off History HelpLoading... Loading... Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Test new features * Loading... Working... Sign in to add this to Watch LaterADD TO Loading playlists...",Detect bad readings in real time using Python and Streaming Analytics.,Detect Malfunctioning IoT Sensors with Streaming Analytics,Live,0 5,"No Free Hunch Navigation * kaggle.com * kaggle.com Communicating data science: A guide to presenting your work 4COMMUNICATING DATA SCIENCE: A GUIDE TO PRESENTING YOUR WORK Megan Risdal | 06.29.2016 See the forest, see the trees . Here lies the challenge in both performing and presenting an analysis. As data scientists, analysts, and machine learning engineers faced with fulfilling business objectives, we find ourselves bridging the gap between The Two Cultures : sciences and humanities. After spending countless hours at the terminal devising a creative and elegant solution to a difficult problem, the insights and business applications are obvious in our minds. But how do you distill them into something you can communicate? Qualifications and requirements for a senior data scientist position. Presenting my work is one of the surprising challenges I faced in my recent transition from academia to life as a data analyst at a market research and strategy firm . When I was a linguistics PhD student at UCLA studying learnability theory in a classroom or measuring effects of an oral constriction on glottal vibration in a sound booth, my colleagues and I were comfortable speaking the same language. Now that I work with a much more diverse crowd of co-workers and clients with varied backgrounds and types of expertise, I need to work harder to ensure that the insights of my analyses are communicated effectively. In this second entry in the communicating data science series , I cover some essentials when it comes to presenting a thorough, comprehensible analysis for readers who want (or need) to know how to get their work noticed and read. -------------------------------------------------------------------------------- GET YOUR HEAD IN THE GAME Imagine you’ve just completed the so-called heavy lifting, whatever it may be, and you’re ready to present your results and conclusions in a report. Well, step away from the word processor! There are two things you must first consider: your audience and your goals. This is your forest. WHO IS YOUR AUDIENCE? The matter of who you’re speaking to will influence every detail of how you choose to present your analysis from whether you use technical jargon or spend time carefully defining your terms. The formality of the context may determine whether a short, fun tangent or personal anecdote will keep your audience happily engaged or elicit eye rolls … and worse. This is all important to consider because once you’ve envisioned your audience, you take stock of what may and may not be shared knowledge and how to manage their expectations. In your writing (and in everyday life), it’s useful to be cognizant of Grice’s principles of cooperative communication: 1. Maxim of quantity : be informative, without giving overwhelming amounts of extraneous detail. 2. Maxim of quality : be truthful. Enough said. 3. Maxim of relation : be relevant. I’ll give you some tips on staying topical shortly! 4. Maxim of manner : be clear. Don’t be ambiguous, be orderly. So be cooperative! Know your audience and do what you can to anticipate their expectations. This will ensure that you cover all ground in exactly as much detail as necessary in your report. WHAT IS THE GOAL? Also before you put pen to paper, it’s helpful remind yourself again (and again) of what your goal is. If you’re working in a professional environment, you’re aware that it’s important to be continually mindful of the goal or business problem and why you’re tasked with solving it. Or perhaps it's a strategic initiative you're after: Did you set out to learn something new about some data (and the world)? Or have you been diligently working on a new skill you’d like to showcase? Do you want to test out some ideas and get feedback? It’s okay to make it your goal to find out “Can I do this?” Maybe you want to share some of your expertise with the community on Kaggle Scripts. In that case, it’s even more imperative that you have a buttoned-up analysis! ""If we can really understand the problem, the answer will come out of it, because the answer is not separate from the problem."" ― Jiddu KrishnamurtiIf you’ve reached the point of having an analysis to report, you’ve more than likely familiarized yourself with the goals of the initiative, but you must also keep them at the forefront of your thoughts when presenting your results as well. Your work should be contextualized in terms of your understanding of the research objectives. Often in my own day job this means synthesizing many analyses I’ve performed into a few key pieces of evidence which support a story; this can’t be done well except by accident without keeping in mind the ultimate objective at hand. THE PREAMBLE Now that you’ve got yourself in the right frame of mind―you can see the forest and you know the trees―you’re ready to start thinking about the content of your report. However, before you start furiously spilling ink, first remind yourself of the three elements required to ask an askable question in science: 1. The question itself along with some justification of how it addresses your objectives 2. A hypothesis 3. A feasible methodology for addressing your question Much as I implore you to consider who your audience is and what your objectives are in order to get your mind in the right place, I’m recommending that you have the answers to these three things ready because they will dictate the content of your report. You don’t want to throw everything and the kitchen sink into a report! WHAT’S THE QUESTION? On Kaggle, the competition hosts very generously provide their burning questions to the community. Outside of this environment, the challenge is to come up with one on your own or work within the business objectives of your employer. At this point, you make sure that you can appropriately state the question and how it relates to your objective(s). As an aside, if you need some exercise in the area of asking insightful questions (a skill unto its own), I hereby challenge you to scroll through some of Kagglers’ most recent scripts, find and read one, and think of one new question you could ask the author. If you find that this is a stumbling block preventing you from proceeding with your analysis, many dataset publishers include a number of questions they’d like to see addressed. Or read the Script of the Week blogs and see what other ideas script authors would like to see explored in the same dataset. WHAT’S THE ANSWER? Now that you have your question, what do you think the answer will be? It’s good practice, of course, to consider what the possible answers may be before you dig into the data, so hopefully you’ve already done that! Clearly delimiting the hypothesis space at this point will guide the evidence and arguments you use in the body of your report. It will be easier to evaluate what constitutes weak and strong support of your theory and what analyses may be absolutely irrelevant. Ultimately you will prevent yourself from attacking straw men in faux support of your theory. Don't build straw men. WHAT’S YOUR METHODOLOGY? Let’s say you’re asking whether Twitter users with dense social networks in the How ISIS Uses Twitter dataset express greater negative sentiment than users with less dense networks. Your first step is to confirm that the data available is sufficient to address your research question. If there’s major missing information, you may want to rethink your question, revise your methodology, or even collect new data. If you’re unsure of how to put language to a particular methodology, this is a good opportunity to flex your Googling skills. Search for “social network analysis in r” or “sentiment analysis in python.” Dive into some academic papers if it's appropriate and see how it's presented. Peruse the natural language processing tags on No Free Hunch and read the winners’ interviews . Get inspiration from scripts on similar datasets on Kaggle. For example, a similar analysis was performed by Kaggle user Khomutov Nikita using the Hillary Clinton’s Emails dataset . Hillary Clinton's network graph. See the code here . Even if you don’t end up needing to share every nuance of your methodology with your given audience, you should always document your work thoroughly to the extent possible. Once you’re ready to present your analysis, you’ll be capable of determining how much is the right amount to share when discussing the nitty gritty mechanics of your model. Similarly, I've been able to pleasantly surprise my boss many times because I have an answer ready at-hand for immediate questions thanks to keeping my exploratory analyses well-documented. By the way, if you’ve felt overwhelmed by the task of putting together a solid methodology for tackling a question, it can’t hurt to lob an idea and some code to the community for feedback. Especially once you have solid presentation of analysis skills! Be honest about where you feel you could use extra input and maybe a fellow Kaggler will come forth with different angle on the problem. PUTTING THE PIECES TOGETHER Finally, you’re ready to write. Keep in mind that a good analysis should facilitate its own interpretation as much as possible. Again, this requires anticipating what information your likely audience will be seeking and what knowledge they’re coming in with already. One method which is both tried-and-true and friendly to the academic nature of the discipline is following a template for your analysis. With that, this section covers the structure which when fleshed out will help you tell the story in the data. Keep in mind that a good analysis should facilitate its own interpretation as much as possible. NOT SO ABSTRACT Make it easy for your audience to quickly determine what they’re about to digest. Use an abstract or introduction to recall your objectives and clearly state them for your readers. What is the problem that you’ve set out to solve? If you have a desired outcome or any expectations of your audience, say it, as this is the entire reason you’re presenting them with your analysis. You then cover everything from your preamble in this section: the question you’ve been on a mission to answer, your hypothesis, and the methodology you’ve used. Finally, you will often provide a high level summary of your results and key findings. Don’t worry about spoiler alerts or boring your readers to death with the content that’s about to follow. Trust that if they pay attention past the introduction that they are interested in how you achieve what you claim you have. Many people I've talked to have said that they often find it easier to write the abstract after having already completely documented the detailed findings of the analysis. I think that this is at least in part because you've familiarized yourself with your own work through the lens of your readership by doing so. Slowly but surely you're extracting yourself from the trees and the bigger picture becomes apparent. THE CONTENT: BREAK OFF WHAT YOU CAN CHEW This is where the good stuff lives. You've laid the foundation for your analysis such that your audience is prepared to read or listen intently to your story. I can’t tell you the specifics of what goes here, but I can tell you how to structure it. Take your analysis in small bits by breaking your question into subparts. For a data-driven analysis, it can make sense to tackle each piece of evidence one-by-one. You may have a dissertation’s worth of data to report on, but more likely than not you must pick and choose what will best support your analysis succinctly and effectively. Again, having the objectives and audience in mind will help you decide what’s critical. Lay it all out before you and pair sub-questions with evidence until you have a story. Once you’ve presented the evidence, explain why it supports (or doesn't support) your hypothesis or your objectives. A good analysis also considers alternative hypotheses or interpretations as well. You’ve already surveyed the hypothesis space, so you should be ready-armed to handle contrary evidence. Doing so is also a way of anticipating the expectations of your audience and the skepticism they may harbor. It’s at this point that it’s most critical to keep in mind your objectives and the question you’re addressing with your analysis. Ask how every piece of evidence you offer takes you one step closer to confirming or disproving your hypothesis. OTHER TIPS AND TRICKS Visualize the problem . Seeing is believing. It’s cliched to say in any statement asserting the value of data visualization, but it’s so incredibly true. This “trick” is so effective that I’m going to spend more time talking about it in a future post. If you can plainly “state” something with a graph or chart, go for it! * Shail Jayesh Deliwala visualizes confusion matrices to evaluate and compare model performance. Read the full notebook here ."" / * Lj Miranda shows the steady rise of carbon emissions in the Philippines. Read the full notebook here ."" / * 33Vito uses polar coordinates to show the times during the day leveling and non-leveling characters play World of Warcraft. Read the full notebook here ."" / * Michael Griffiths uses color and variations in transparency to make this table of percentages more readily interpretable. Read the full notebook here ."" / * Shail Jayesh Deliwala visualizes confusion matrices to evaluate and compare model performance. Read the full notebook here ."" / * Lj Miranda shows the steady rise of carbon emissions in the Philippines. Read the full notebook here ."" / * 33Vito uses polar coordinates to show the times during the day leveling and non-leveling characters play World of Warcraft. Read the full notebook here ."" / * Michael Griffiths uses color and variations in transparency to make this table of percentages more readily interpretable. Read the full notebook here ."" / Variety is the spice of life . And it can liven up your writing (and speaking) as well. For example, use a mix of short and sweet sentences interspersed among longer, more elaborate ones. Find where you accidentally used the word “didactic” four times on one page and change it up! Related to my first point, use effective variety in types of visualizations you employ. Small things like this will keep your readers awake and interested. Check your work . I don’t like to emphasize this too much because I’m a descriptivist , but make sure your writing is grammatical, fluent, and free of typos. For better or worse, trivial mistakes can discredit you in the eyes of many. I find that it helps to read my writing aloud to catch disfluencies. Gain muscle memory . If you really struggle with transforming your analysis into a form that can be shared more broadly, begin by writing anything until writing prose feels as natural as writing code. For example, I actually suggest sitting down and copying a report word-for-word. Or even any instance of persuasive writing. Not to be used as your own in any way (i.e., plagiarism), but to remove one more unknown from the equation: what it literally feels like to go through the motions of stringing words and sentences and paragraphs together to tell a story. CONCLUSIONS & NEXT STEPS A good analysis is repetitive. You know the intricacies of your work in and out, but your audience does not. You’ve told your readers in your abstract (or introduction, if you prefer) what you had ventured to do and even what you end up finding and the content lays this all out for them. In the conclusions section you hit them with it again. At this point, they’ve seen the relevant data you’ve carefully chosen to support your theory so it’s time to formally draw your conclusions. Your readers can decide if they agree or not. Speaking of being repetitive, after making your conclusions, you again remind your readers of the objective(s) of this report. Restate them again and help your readers help you―what do you expect now? What feedback would you like? What decision-making can happen now that your report is presented and the insights have been shared? In my work, I often collaborate with strategists to develop a set of recommendations for our clients. Typically I'll take a stab at it based on the expertise I've gained in working with the data and a strategist will refine using their business insights. FIN And this is exactly where the beauty of the analysis and your skillful presentation thereof meet. Because you’ve managed to package your approach in a fashion digestible to your audience, your readers, collaborators, and clients have comprehended and learned from your analysis and what its implications are without getting lost in the trees. They are equipped to react to the value in your work and participate in the next step of realizing its objectives. -------------------------------------------------------------------------------- Thanks for reading the second entry in this series on communicating data science . I covered the basics of presenting an analysis at a very high level. I'd love to learn what your approach is, how you realize the value in your work, and how you collaborate with others to achieve business goals. Leave a comment or send me a note ! If you missed my interview with Tyler Byers, a data scientist and storytelling expert, check it out here . Stay tuned to learn some data visualization fundamentals. communicating data science data analysis Reporting Tutorial writing * Liling TanGricean maxims should be ""maxims"" of quantity, quality, relation and manner, not ""maximums"" =) * Megan RisdalHaha, wow! I don't know how I did that. Fixed. Thank you! 🙂 * * * Albert CampsVery interesting, thanks!!! Trying to summarize it ended up being quite long anyway. A lot of distilled information. We work a bit differently. We include an executive summary + recommendations at the beginning of the presentation instead of putting them at the end, just after stating the question to answer. After that the audience knows what will come, and when the presentation is revisited it is a lot faster to check . If there's a need to dig deeper, there's always available all the analysis steps. Hoping to see the next one soon! 😀 * Megan RisdalThanks! I actually often do the same thing re: executive summaries in my day job, too! That's a really good point. There's definitely no one-size-fits-all approach which makes a high-level summarization misleading in certain ways. And now that I think of it, another strength in communicating data science is being able to be information dense & concise for times where you need to fit your work into a standalone one- or two-sheeter/executive summary. Hopefully more good stuff coming soon. 🙂 * * THE OFFICIAL BLOG OF KAGGLE.COM SearchCATEGORIES * Data Science News (38) * Kaggle News (120) * Kernels (22) * Tutorials (28) * Winners' Interviews (174) WANT TO SUBSCRIBE? Email Address * First Name * = required fieldPOPULAR TAGS Algo Trading Challenge Annual Santa Competition binary classification community computer vision CrowdFlower Search Results Relevance Dark Matter Deloitte diabetes Diabetic Retinopathy EEG data Elo Chess Ratings Competition Eurovision Challenge Facebook Recruiting Flavours of Physics: Finding τ → μμμ Flight Quest Grasp-and-Lift EEG Detection Heritage Health Prize How Much Did It Rain? image classification Intel Kaggle InClass Kernels logistic regression March Mania Merck multiclass classification natural language processing optimization problem Otto Product Classification Owen Zhang Practice Fusion Product Product News Profiling Top Kagglers Recruiting regression problem scikit-learn scripts of the week The Hunt for Prohibited Content Tourism Forecasting Tutorial video series Wikipedia Challenge XGBoostARCHIVES Archives Select Month July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 January 2014 December 2013 November 2013 September 2013 August 2013 July 2013 June 2013 May 2013 April 2013 March 2013 February 2013 January 2013 December 2012 November 2012 October 2012 September 2012 August 2012 July 2012 June 2012 May 2012 April 2012 March 2012 February 2012 January 2012 December 2011 November 2011 October 2011 September 2011 August 2011 July 2011 June 2011 May 2011 April 2011 March 2011 February 2011 January 2011 December 2010 November 2010 October 2010 September 2010 August 2010 July 2010 June 2010 May 2010 April 2010 Toggle the Widgetbar","See the forest, see the trees. Here lies the challenge in both performing and presenting an analysis. As data scientists, analysts, and machine learning engineers faced with fulfilling business obj…",Communicating data science: A guide to presenting your work,Live,1 7,"☰ * Login * Sign Up * Learning Paths * Courses * Our Courses * Partner Courses * Badges * Our Badges * BDU Badge Program * Watson Student Advisor * BLOG Welcome to the BDUBlog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * This Week in Data Science (April 18, 2017) * This Week in Data Science (April 11, 2017) * How to Become a Data Scientist * This Week in Data Science (April 4, 2017) * This Week in Data Science (March 28, 2017) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsBLOGROLL * RBloggers THIS WEEK IN DATA SCIENCE (APRIL 18, 2017) Posted on April 18, 2017 by Janice Darling Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * Top mistakes data scientists make when dealing with business people – A discussion of the three top mistakes data scientists make. * 4 Trends in Artificial Intelligence that affect enterprises. – Four AI trends that stand out in their affect on companies and enterprises. * R Best Practices: R you writing the R way! – A list of programming practices that result in improved readability, consistency, and repeatability. * The 5 Best Reasons To Choose MYSQL – and its 5 Biggest Challenges – Reasons to use MYSQL and the common challenges associated. * 7 types of job profiles that make you a Data Scientist – A discussion of the common skill sets of different data scientist profiles. * Detecting Hackers & Impersonators with Machine Learning – Applying Machine Learning to faster detect phishing attacks. * Some Lesser-Known Deep Learning Libraries – A list of lesser known but useful Deep Learning libraries. * In case you missed it: March 2017 roundup – Articles about R programming from Revolutions. * Investing, Fast & Slow – Part 2: Investment for Data Scientists 101 – Second part in a discussion series on investing and data science from Dataconomy. * 10 Free Must-Read Books for Machine Learning and Data Science – A list of interesting Machine Learning and Data Science reads. * Integrate Sparkr And R For Better Data Science Workflow – How to work with R and Sparkr for wrangling with large datasets. * Can Watson, the Jeopardy champion, solve Parkinson’s? – Toronto Researchers are using Watson to help find a cure for Parkinson’s. * The Henry Ford to debut ‘cognitive dress’ using IBM Watson technology – The Henry Ford will display a dress created from a collaboration between Marchesa and IBM Watson. * The Democratization of Machine Learning: What It Means for Tech Innovation – How accessible ML can further spur tech innovation. * 3 reasons why data scientist remains the top job in America – A discussion of why the role of data scientist has remained the top job in America. FEATURED COURSES FROM BDU * SQL and Relational Databases 101 – Learn the basics of the database querying language, SQL. * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out. * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used to detect patterns hidden in data. * Using R with Databases – Learn how to unleash the power of R when working with relational databases in our newest free course. * Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to apply deep learning to different data types in order to solve real world problems. UPCOMING DATA SCIENCE EVENTS * Data Science: Classification Algorithms in Python(Hands-On) –April 25, 2017 @ 6 – 8:30 pm Lighthouse Labs COOL DATA SCIENCE VIDEOS * Machine Learning With Python – Unsupervised Learning – Measuring the Distances Between Clusters – Using Single Linkage Clustering to measure the distance between Clusters. * Machine Learning With Python – Hierarchical Clustering Advantages & Disadvantages – A discussion of Hierarchical Clustering. * Machine Learning With Python – Unsupervised Learning K Means Clustering Advantages & Disadvantages – A discussion of K-Means Clustering. SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * RELATED Tags: Big Data , data science , weekly roundup -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Events * Ambassador Program * Resources * FAQ * Legal Follow us * * * * * * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (April 18, 2017)",Live,2 8,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCALE - BOOST THE PERFORMANCE OF YOUR DISTRIBUTED DATABASE Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 29, 2016Learn how distributed DBs (Cassandra, MongoDB, RethinkDB, etc) solve the problem of scaling persistent storage, but introduce latency as data size increases and become I/O bound. In single server DBs, latency is solved by introducing caching. In this talk, Akbar Ahmed shows you how to improve the performance of distributed DBs by using a distributed cache to move the data layer performance limitation from I/O bound to network bound. Akbar is the CEO and founder of DynomiteDB, a framework for turning single server data stores into linearly scalable, distributed databases. He is an Apache Cassandra certified developer and a Cassandra MVP, enjoys the expressiveness of both SQL and alternative query languages, and evaluates the entire database ecosystem every 6 months and has an MBA in Information Systems. -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the beach, reading, spending time with his wife and daughter and tinkering. Love this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud","Learn how distributed DBs solve the problem of scaling persistent storage, but introduce latency as data size increases and become I/O bound.",DataLayer Conference: Boost the performance of your distributed database,Live,3 12,"Skip navigation Sign in SearchLoading... Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE. WATCH QUEUE QUEUE Watch Queue Queue * Remove all * Disconnect The next video is starting stop 1. Loading... Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: ANALYZE NY RESTAURANT INSPECTIONS DATA developerWorks TVLoading... Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking... Subscribe Subscribed Unsubscribe 17KLoading... Loading... Working... Add toWANT TO WATCH THIS AGAIN LATER? Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO? Sign in to report inappropriate content. Sign in * Transcript * Statistics * Add translations 9 views 0LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1Loading... Loading... TRANSCRIPT The interactive transcript could not be loaded.Loading... Loading... Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning * CATEGORY * Science & Technology * LICENSE * Standard YouTube License Show more Show lessLoading... Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * Data Science Experience: Analyze NYC traffic collisions data with a community notebook - Duration: 8:08. developerWorks TV 5 views * New 8:08 -------------------------------------------------------------------------------- * Data Science Experience: Analyze Db2 Warehouse on Cloud data in RStudio - Duration: 5:30. developerWorks TV 3 views * New 5:30 * Data Science Experience: Analyze precipitation data using a community notebook - Duration: 5:15. developerWorks TV No views * New 5:15 * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21. IBM Analytics 8,386 views 8:21 * Data Science Experience: Sentiment Analysis of Twitter Hashtags Using Spark Streaming - Duration: 5:35. developerWorks TV No views * New 5:35 * Introduction to Spark and Data Science Experience - Duration: 49:24. Data Gurus 419 views 49:24 * Data Science Experience: Load and analyze public data sets - Duration: 2:46. developerWorks TV No views * New 2:46 * IBM Big SQL: Analyze HDFS data with IBM Cognos Analytics - Duration: 6:54. developerWorks TV No views * New 6:54 * Use IBM PixieDust and Data Science Experience to analyze San Francisco traffic - Duration: 11:57. scottdangelo 447 views 11:57 * Data Analytics Overview | Data Science With Python Tutorial - Duration: 18:23. Simplilearn 12,463 views 18:23 * Exploring Data Science Experience, a Platform for Data Scientists using Open Source Technologies - Duration: 54:10. Data Gurus 102 views 54:10 * Data Science Experience: Build SQL queries with Apache Spark - Duration: 3:29. developerWorks TV 2 views * New 3:29 * A data scientist experiments with Jupyter notebooks and Apache Spark: Part 1 - Duration: 13:30. IBM Analytics 4,093 views 13:30 * Creating the Data Science Experience - Duration: 3:55. IBM Analytics 3,197 views 3:55 * H2O With IBM's Data Science Experience (DSX) - Duration: 4:43. Matt McInnis 303 views 4:43 * Data Science With Python | Data Science Tutorial | Simplilearn - Duration: 7:52. Simplilearn 20,809 views 7:52 * Q&A with Lightbend’s Duncan DeVore on Reactive Microservices and JavaOne - Duration: 5:39. developerWorks TV 36 views * New 5:39 * IBM Analytics Engine Overview - Duration: 7:21. developerWorks TV 7 views * New 7:21 * Visual Machine Learning in Data Science Experience - Duration: 1:37. Armand Ruiz 2,996 views 1:37 * JavaOne: Microservice hands-on - Duration: 5:22. developerWorks TV No views * New 5:22 * Language: English * Content location: United States * Restricted Mode: Off History HelpLoading... Loading... Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Test new features * Loading... Working... Sign in to add this to Watch LaterADD TO Loading playlists...",This video demonstrates the power of IBM DataScience Experience using a simple New York State Restaurant Inspections data scenario. ,Analyze NY Restaurant data using Spark in DSX,Live,4 14,"Compose is all about immediacy. You want a new database - you can have it in seconds. You want your database metrics - they're just a click away in your dashboard. And, for MongoDB and Redis, we've also provided immediate access to the database through our browsers.Now, PostgreSQL joins them in allowing for immediate data browsing from the Compose Dashboard. We've just launched the first edition of our PostgreSQL data browser for PostgreSQL Compose deployments so you can try it out now.What the browser is designed for is quick, no local-footprint, viewing and modification of the database. No connections have to be set up, no ports opened, no URIs memorized – just log in to Compose, select your database and get browsing. It's a complement to the extensive and complex client tools that exist for PostgreSQL which, once configured and mastered, can let you peer into every corner of the database.We'll be expanding the Compose browser's capabilities over time with that philosophy in mind, but first, let's take a look at the capabilities already available in the PostgreSQL browser.We'll begin our tour of the PostgreSQL browser from the top. To get to the browser, log into the Compose dashboard, select 'Browser' in the sidebar to see this...The ""top"" of the browser's view shows the databases created in the PostgreSQL instance. In the screenshot above you can see that there are two, the default compose and a dvdrental database. You can also see the on-disk size of each of the databases. The dvdrental is having data imported into it for a future demonstration so let's take a look inside there by clicking on its row to reveal:A number of tables. These are the tables in the database, each one displayed with an estimated row count. If selected, the Admin tab in the sidebar only offers the option to delete the current database, but we're more interested in looking at some of the data here. If we click into the film table, we get a better view of that data:This is the Query view of the film table. The default query reads the first 20 items from the table and displays their contents in a table below the query. This table will include all the fields so some horizontal scrolling could be involved. You can edit the query to adjust and LIMIT the number of items displayed, add a WHERE clause to include your own selection criteria, add an ORDER BY clause and sort according to a field or add an OFFSET to skip a number of returned results. Using all of these would look like this.That OFFSET value can also be changed, by the LIMIT value, using the Next and Last buttons at the bottom of the table view so you can page through the data.If there's a primary key on the table, then you'll also be able to get a better look at the data in a row by clicking anywhere on a row to get to the update row view. We're going to give you the whole view of one here, though if you have a field-rich table, expect to scroll:Here we see nearly all the fields of the row; the only one missing is the primary key field, which if you look just above the field list is being used in the WHERE clause to select this record. The rest of the fields are displayed with both the field name and, usually, the type of that field along with an editable field to allow for modifications.So, we can see, going down this page, a ""title"" field, defined as a ""varchar(255)"" with a text area to edit its contents, and below that, a ""description"" field, defined as ""text"" also with a text area. Field validation takes place on submitting the update to the database, so if you put too much text in the ""title"" field, it's at update time that you'll be told there's too much text.The same goes for validating the numeric fields, like the smallint and numeric types, and the date intervals like the year field. The observant will notice a field with the type ""mpaa_rating"" further down the table. This field has its type set to a user defined enum type like this:CREATE TYPE mpaa_rating AS ENUM ('G','PG','PG-13','R','NC-17'Because the browser lets the database flag errors at update time, it means this field is also validated; enter a string value which doesn't match with one of the enum values and when you press update you'll get an error and the update will be completely rolled back.Back to the fields types in the table. The ""lastupdate"" field is effectively read-only as it'll be overwritten during updating. The ""fulltext"" field is a tsvector field and is also not editable - in this database's case it is updated on insert or update by a trigger. There is one field you can edit, the text array that is ""specialfeatures"", which takes PostgreSQL syntax for an array literal – { ""string"",""string"",""string"",... }.That covers editing, but you can also add new rows. You'll find the button for that in the query view in the top right marked ""Insert row"". It'll bring up an unpopulated page similar to the edit row page:This form is more forgiving of validation errors than the ""Update Row"" page in that if you do have a change which is rejected by the validation process, the fields you have entered are not cleared. Apart from that, it is functionally the same as the ""Update Row"" page.If we go back to the top table view, there are two tabs we haven't mentioned at the top of the page. The Indexes tab shows the current indexes that apply to the table we are looking at:Here, for example we can see the unique primary key on the film_id, a fulltext index on the tsvector fulltext field, a foreign key index on the language id, and a simple index on the movie title. There's also the option to drop any one of these indexes with the right side's drop button. As well as displaying indexes, you can create indexes, albeit, currently, simply unique or non-unique btree indexes. Enter the fields you want indexed between the parentheses, click Unique if you want a unique index and click name and enter an index name if you want to set a particular name for the index - then just click Create Index.The Settings tab currently offers one option, to drop the table. But this also comes with options:The option in this case is whether or not to drop any database objects that depend on the table, when you drop the table, using a CASCADE operator. Remember to check you have a working backup of your table, or whole database, before you drop the table as there's no going back on the drop without doing a restore. The browser doesn't have a ""Create Table"" option yet, so you can't manually rebuild the table without reaching for the psql command-line tool.We’ll be enhancing the PostgreSQL browser in the future. As you can see, there’s already a useful range of functionality for the database user on the go and we aim to make it your first stop for PostgreSQL control on Compose.",Using Compose's PostgreSQL data browser.,Browsing PostgreSQL Data with Compose,Live,5 15,"UPGRADING YOUR POSTGRESQL TO 9.5Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 26, 2016Upgrading your PostgreSQL deployment to version 9.5 is now possible through theCompose console. Working out how to perform this upgrade safely and reliably hasbeen an interesting process because from version 9.4 to 9.5 is a PostgreSQLmajor upgrade.""Wait"", you say, ""that's not a major upgrade.""With PostgreSQL it is...""A major release is numbered by increasing either the first or second part ofthe version number, e.g. 9.1 to 9.2.""The important thing about PostgreSQL major updates is that they usually changehow the data is stored internally. That's why whenever a PostgreSQL databasestarts up, it checks what version of PostgreSQL created the data directory. Ifit isn't the same major version, it'll refuse to run. Upgrading is traditionallydone by dumping the contents of the databases, updating the database softwareand then restoring the dump's contents to the freshly updated database. It's abit hands on and time consuming so we looked for an alternative.We'll be looking at how we came up with our approach to this in another articlecoming soon. Suffice to say we looked at engineering an upgrade system with aneye on resilience and redundancy which performed quickly.COMPOSE'S POSTGRESQL UPGRADEOur major version upgrade process begins with a backup. We start there as it's aknown point in the life of your data. You may wish to put your applications intomaintenance mode and create an on-demand backup to ensure that you have the mostrecent data.Whichever backup you go with, it will be be restored to a new PostgreSQLdeployment where we may, or may not, run the pg_upgrade tool.We call this process Deployment from backup and it supports the ability to change database version while it runs. At theend of the process, you'll have a freshly provisioned database in a fraction ofthe time it would take to dump and restore it. Backups are made automatically onCompose so there's always a recent backup, but you can always make one on demandto be completely up to date. Once you have a backup, you can restore it to a newdeployment by clicking on the restore icon.That will take you to the Deployment from backup dialog. At the top are details about the backup you have selected to restore;which deployment it is from and when it was created.The rest of the page is about the deployment to be created to house thisrestored backup. You can enter a new deployment name (or accept the delightfullygenerated default). You can then say generally where you'd like the deploymentcreated if you have a Compose Enterprise account, otherwise it defaults to""Compose Hosted"". If ""Compose Hosted"" is selected, a range of data centerlocations is then available to create the deployment in.Then we get to the Upgrade section. By default, this process will not upgradeyour database and will select the matching major version. If you click Create Deployment at this point you will effectively clone your database.If, on the other hand, you select a different version, like say 9.5, when youclick Create Deployment , something extra happens. Your data is restored into a new deployment butrather than start up a database instance, pg_upgrade is run to upgrade the stored data to the selected version.You'll now have your original PostgreSQL deployment and an upgraded PostgreSQLdeployment running concurrently. Validate the upgraded PostgreSQL deployment andswitch your applications to use that, then decommission the original PostgreSQLwhen you are ready.If you're unhappy with the upgraded database, you still have your originaldatabase to fall back to. We're not expecting anyone to have problems with the Deployment from backup process, but we like to build things which give you lots of room to recoverwith if anything does go astray.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writersince Apples came in II flavors and Commodores had Pets. Love this article? Headover to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose",Upgrading your PostgreSQL deployment to version 9.5 is now possible through the Compose console. Working out how to perform this upgrade safely and reliably has been an interesting process because from version 9.4 to 9.5 is a PostgreSQL major upgrade.,Upgrading your PostgreSQL to 9.5,Live,6 17,"Follow Sign in / Sign up 135 8 * Share * 135 * * Never miss a story from Several People Are Coding , when you sign up for Medium. Learn more Never miss a story from Several People Are Coding Get updates Get updates Ronnie Blocked Unblock Follow Following I engineer data at @SlackHQ. Professional business dog. Dec 7 -------------------------------------------------------------------------------- DATA WRANGLING AT SLACK By Ronnie Chen and Diana Pojar Research Data Management via janneke staaks licensed under Creative CommonsFor a company like Slack that strives to be as data-driven as possible, understanding how our users use our product is essential. The Data Engineering team at Slack works to provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions: “ Based on a team’s activity within its first week, what is the probability that it will upgrade to a paid team? ” or “ What is the performance impact of the newest release of the desktop app?” THE DREAM We knew when we started building this system that we would need flexibility in choosing the tools to process and analyze our data. Sometimes the questions being asked involve a small amount of data and we want a fast, interactive way to explore the results. Other times we are running large aggregations across longer time series and we need a system that can handle the sheer quantity of data and help distribute the computation across a cluster. Each of our tools would be optimized for a specific use case, and they all needed to work together as an integrated system. We designed a system where all of our processing engines would have access to our data warehouse and be able to write back into it. Our plan seemed straightforward enough as long as we chose a shared data format, but as time went on we encountered more and more inconsistencies that challenged our assumptions. THE SETUP Our central data warehouse is hosted on Amazon S3 where data could be queried via three primary tools: Hive , Presto and Spark . To help us track all the metrics that we want, we collect data from our MySQL database, our servers, clients, and job queues and push them all to S3. We use an in-house tool called Sqooper to scrape our daily MySQL backups and export the tables to our data warehouse. All of our other data is sent to Kafka, a scalable, append-only message log and then persisted on to S3 using a tool called Secor . For computation, we use Amazon’s Elastic MapReduce (EMR) service to create ephemeral clusters that are preconfigured with all three of the services that we use. Presto is a distributed SQL query engine optimized for interactive queries. It’s a fast way to answer ad-hoc questions, validate data assumptions, explore smaller datasets, create visualizations and use it for some internal tools, where we don’t need very low latency. When dealing with larger datasets or longer time series data, we use Hive , because it implicitly converts SQL-like queries into MapReduce jobs. Hive can handle larger joins and is fault-tolerant to stage failures, and most of our jobs in our ETL pipelines are written this way. Spark is a data processing framework that allows us to write batch and aggregation jobs that are more efficient and robust, since we can use a more expressive language, instead of SQL-like queries. Spark also allows us to cache data in memory to make computations more efficient. We write most of our Spark pipelines in Scala to do data deduplication and write all core pipelines. TYING IT ALL TOGETHER How do we ensure that all of these tools can safely interact with each other? To bind all of these analytics engines together, we define our data using Thrift , which allows us to enforce a typed schema and have structured data. We store our files using Parquet which formats and stores the data in a columnar format. All three of our processing engines support Parquet and it provides many advantages around query and space efficiency. Since we process data in multiple places, we need to make sure that our systems always are aware of the latest schema, thus we rely on the Hive Metastore to be our ground truth for our data and its schema. CREATE TABLE IF NOT EXISTS server_logs( team_id BIGINT, user_id BIGINT, visitor_id STRING, user_agent MAP, api_call_method STRING, api_call_ok BOOLEAN ) PARTITIONED BY (year INT, month INT, day INT, hour INT) STORED AS PARQUET LOCATION 's3://data/server_logs' Both Presto and Spark have Hive connectors that allow them to access the Hive Metastore to read tables and our Spark pipelines dynamically add partitions and modify the schema as our data evolves. With a shared file format and a single source for table metadata, we should be able to pick any tool we want to read or write data from a common pool without any issues. In our dream, our data is well defined and structured and we can evolve our schemas as our data needs evolve. Unfortunately, our reality was a lot more nuanced than that. COMMUNICATION BREAKDOWN All three processing engines that we use ship with libraries that enable them to read and write Parquet format. Managing the interoperation of all three engines using a shared file format may sound relatively straightforward, but not everything handles Parquet the same way, and these tiny differences can make big trouble when trying to read your data. Under the hood, Hive, Spark, and Presto are actually using different versions of the Parquet library and patching different subsets of bugs, which does not necessarily keep backwards compatibility. One of our biggest struggles with EMR was that it shipped with a custom version of Hive that was forked from an older version that was missing important bug fixes. What this means in practice is that the data you write with one of the tools might not be read by other tools, or worse, you can write data which is read by another tool in the wrong way. Here are some sample issues that we encountered: ABSENCE OF DATA One of the biggest differences that we found between the different Parquet libraries was how each one handled the absence of data. In Hive 0.13, when you use use Parquet, a null value in a field will throw a NullPointerException. But supporting optional fields is not the only issue. The way that data gets loaded can turn a block of nulls— harmless by themselves — into an error if no non-null values are also present ( PARQUET-136) . In Presto 0.147, the complex structures were the ones that made us uncover a different set of issues — we saw exceptions being thrown when the keys of a map or list are null . The issue was fixed in Hive, but not ported in the Presto dependency ( HIVE-11625 ). To protect against these issues, we sanitize our data before writing to the Parquet files so that we can safely perform lookups. SCHEMA EVOLUTION TROUBLES Another major source of incompatibility is around schema and file format changes. The Parquet file format has a schema defined in each file based on the columns that are present. Each Hive table also has a schema and each partition in that table has its own schema. In order for data to be read correctly, all three schemas need to be in agreement. This becomes an issue when we need to evolve custom data structures, because the old data files and partitions still have the original schema. Altering a data structure by adding or removing fields will cause old and new data partitions to have their columns appears with different offsets, resulting in an error being thrown. Doing a complete update will require re-serializing all of the old data files and updating all of the old partitions. To get around the time and computation costs of doing a complete rewrite for every schema update, we moved to a flattened data structure where new fields are appended to the end of the schema as individual columns. These errors that will kill a running job are not as dangerous as invisible failures like data showing up in incorrect columns. By default, Presto settings use column location to access data in Parquet files while Hive uses column names. This means that Hive supports the creation of tables where the Parquet file schema and the table schema columns are in different order, but Presto will read those tables with the data appearing in different columns! File schema: ""fields"": [{""name"":""user_id"",""type"":""long""}, {""name"":""server_name"",""type"":""string""}, {""name"":""experiment_name"", ""type"":""string""}] Table schema: (user_id BIGINT, experiment_name STRING, server_name STRING) ----------------- Hive ------------------ user_id experiment_name server_name 1 test1 slack-1 2 test1 slack-2 ---------------- Presto ----------------- user_id experiment_name server_name 1 slack-1 test1 2 slack-2 test1 It’s a simple enough problem to avoid or fix with a configuration change, but easily something that can slip through undetected if not checked for. UPGRADING EMR Upgrading versions is an opportunity to fix all of the workarounds that were put in earlier. But it’s very important to do this thoughtfully. As we upgrade EMR versions to resolve bugs or to get performance improvements, we also risk exchanging one set of incompatibilities with another. When libraries get upgraded, it’s expected that the new libraries are compatible with the older versions, but changes in implementation will not always allow older versions to read the upgraded versions. When upgrading our cluster, we must always make sure that the Parquet libraries being used by the analytics engines we are using are compatible with each other and with every running version of those engines on our cluster. A recent test cluster to try out a newer version of Spark resulted in some data types being unreadable by Presto. This leads to us being locked into certain versions until we implement workarounds for all of the compatibility issues and that makes cluster upgrades a very scary proposition. Even worse, when upgrades render our old workarounds unnecessary, we still have a difficult decision to make. For every workaround we remove, we have to decide if it’s more effective to backfill our data to remove the hack or perpetuate it to maintain backwards compatibility. How can we make that process easier? A COMMON LANGUAGE To solve some of these issues and to enable us to safely perform upgrades, we wrote our own Hive InputFormat and Parquet OutputFormat to pin our encoding and decoding of files to a specific version. By bringing control of our serialization and deserialization in house, we can safely use out-of-the-box clusters to run our tooling without worrying about being unable to read our own data. These formats are essentially forks of the official version which bring in the bug fixes across various builds. FINAL THOUGHTS Because the various analytics engines we use have subtly different requirements about serialization and deserialization of values, the data that we write has to fit all of those requirements in order for us to read and process it. To preserve the ability use all of those tools, we ended up limiting ourselves and building only for the shared subset of features. Shifting control of these libraries into a package that we own and maintain allows us to eliminate many of the read/write errors, but it’s still important to make sure that we consider all of the common and uncommon ways that our files and schemas can evolve over time. Most of our biggest challenges on the data engineering team were not centered around writing code, but around understanding the discrepancies between the systems that we use. As you can see, those seemingly small differences can cause big headaches when it comes to interoperability. Our job on the data team is to build a deeper understanding of how our tools interact with each other, so we can better predict how to build for, test, and evolve our data pipelines. -------------------------------------------------------------------------------- If you want to help us make Slack a little bit better every day, please check out our job openings page and apply. Thanks to Diana Pojar and Ross Harmes . Big Data Analytics 135 8 Blocked Unblock Follow FollowingRONNIE I engineer data at @SlackHQ . Professional business dog. FollowSEVERAL PEOPLE ARE CODING The Slack Engineering Blog","For a company like Slack that strives to be as data-driven as possible, understanding how our users use our product is essential. The Data Engineering team at Slack works to provide an ecosystem to…",Data Wrangling at Slack,Live,7 21,"* Host * Competitions * Datasets * Kernels * Jobs * Community ▾ * User Rankings * Forum * Blog * Wiki * * Sign up * Login Log in with — Remember me? Forgot your Username / Password ?$1,000,000 • 655 TEAMS DATA SCIENCE BOWL 2017 Merger and Entry Deadline31 MAR 2 MONTHS DEADLINE FOR NEW ENTRY & TEAM MERGERS Thu 12 Jan 2017 Wed 12 Apr 2017 (2 months to go)DASHBOARD * Home * Data * Make a submission * Information * Description * Evaluation * Rules * Prizes * About the DSB * Resources * Timeline * Tutorial * Forum * Kernels * New Script * New Notebook * Leaderboard FORUM (57 TOPICS) * Explore Digital Imaging and Communications in Med 42 minutes ago * Keras vs Cancer 48 minutes ago * mxnet + xgboost baseline [LB: 0.57] 1 hour ago * IP rights 1 hour ago * Full Preprocessing Tutorial 1 hour ago * Can we use TensorFlow? 1 hour ago Competition Details » Get the Data » Make a submissionCAN YOU IMPROVE LUNG CANCER DETECTION? In the United States, lung cancer strikes 225,000 people every year, and accounts for $12 billion in health care costs. Early detection is critical to give patients the best chance at recovery and survival. One year ago, the office of the U.S. Vice President spearheaded a bold new initiative, the Cancer Moonshot, to make a decade's worth of progress in cancer prevention, diagnosis, and treatment in just 5 years. In 2017, the Data Science Bowl will be a critical milestone in support of the Cancer Moonshot by convening the data science and medical communities to develop lung cancer detection algorithms. Using a data set of thousands of high-resolution lung scans provided by the National Cancer Institute, participants will develop algorithms that accurately determine when lesions in the lungs are cancerous. This will dramatically reduce the false positive rate that plagues the current detection technology, get patients earlier access to life-saving interventions, and give radiologists more time to spend with their patients. This year, the Data Science Bowl will award $1 million in prizes to those who observe the right patterns, ask the right questions, and in turn, create unprecedented impact around cancer screening care and prevention. The funds for the prize purse will be provided by the Laura and John Arnold Foundation. Visit DataScienceBowl.com to: • Sign up to receive news about the competition • Learn about the history of the Data Science Bowl and past competitions • Read our latest insights on emerging analytics techniques ACKNOWLEDGMENTS The Data Science Bowl is presented by COMPETITION SPONSORS Laura and John Arnold Foundation The Cancer Imaging Program of NCI American College of Radiology Amazon Web Services NVIDIA DATA SUPPORT PROVIDERS National Lung Screening Trial The Cancer Imaging Archive Dr. Bram van Ginneken, Professor of Functional Image Analysis and his team at Radboud University Medical Center in Nijmegen Lahey Hospital & Medical Center University of Copenhagen Nicholas Petrick, Ph.D., Acting Director Division of Imaging, Diagnostics and Software Reliability Office of Science and Engineering Laboratories Center for Devices and Radiological Health U.S. Food and Drug Administration SUPPORTING ORGANIZATIONS Bayes Impact Black Data Processng Associates Code the Change Data Community DC DataKind Galvanize Great Minds in STEM Hortonworks INFORMS Lesbians Who Tech NSBE Society of Asian Scientists & Engineers Society of Women Engineers University of Texas Austin, Business Analytics Program, McCombs School of Business US Dept. of Health and Human Services US Food and Drug Administration Women in Technology Women of Cyberjutsu Started: 2:00 pm, Thursday 12 January 2017 UTC Ends: 11:59 pm, Wednesday 12 April 2017 UTC (90 total days) Points: this competition awards standard ranking points Tiers: this competition counts towards tiers © 2017 Kaggle Inc Our Team Careers Terms Privacy Contact/Support","Kaggle is your home for data science. Learn new skills, build your career, collaborate with other data scientists, and compete in world-class machine learning challenges.",Data Science Bowl 2017,Live,8 28,"THE GRADIENT FLOW DATA / TECHNOLOGY / CULTURE Menu Search Skip to content * Home * About * Calendar * Contact * Hardcore Data Science and Data Engineering * The Data Show * Webcasts Search for:USING APACHE SPARK TO PREDICT ATTACK VECTORS AMONG BILLIONS OF USERS AND TRILLIONS OF EVENTS [A version of this post appears on the O’Reilly Radar .] THE O’REILLY DATA SHOW PODCAST: FANG YU ON DATA SCIENCE IN SECURITY, UNSUPERVISED LEARNING, AND APACHE SPARK. Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science: Stitcher , TuneIn , iTunes , SoundCloud , RSS . In this episode of the O’Reilly Data Show, I spoke with Fang Yu , co-founder and CTO of DataVisor . We discussed her days as a researcher at Microsoft, the application of data science and distributed computing to security, and hiring and training data scientists and engineers for the security domain. DataVisor is a startup that uses data science and big data to detect fraud and malicious users across many different application domains in the U.S. and China. Founded by security researchers from Microsoft , the startup has developed large-scale unsupervised algorithms on top of Apache Spark, to (as Yu notes in our chat) “predict attack vectors early among billions of users and trillions of events.” Several years ago, I found myself immersed in the security space and at that time tools that employed machine learning and big data were still rare. More recently, with the rise of tools like Apache Spark and Apache Kafka, I’m starting to come across many more security professionals who incorporate large-scale machine learning and distributed systems into their software platforms and consulting practices. Below are some highlights from our conversation: UNSUPERVISED LEARNING FOR DETECTING FRAUDULENT USERS AND BEHAVIOR Let me step back a little bit and explain how traditional solutions identify bad accounts or bad behavior. Traditionally, the typical solution is rule-based. For example, a user may not be allowed to just register, and immediately start to transfer money or immediately starting send a lot of email. That behavior is bad, so you write a rule based on that. But a rule-based solution is very reactive. You need to observe what attackers are doing and then based on that, you derive expert rules. Rule-based systems are hard to maintain and are always late because a human needs to observe the bad behavior and start to write the rules. Nowadays, a rule-based system is one solution, but a lot of online services are moving to a machine learning-based solution. They have some bad labels and then they train a model. Discover unknown attacks without requiring labels or training data. Source: Fang Yu, used with permission. In DataVisor, we developed a brand new solution, which is unsupervised. We do not require clients to give us labeled data. In our approach, we do not only look at a single user’s behavior. We put all the users together and study correlations between the users and how users link to each other, how similar are the users’ actions. Nowadays, bad attackers do not have a single bad account. They usually have tens of accounts, hundreds, even millions of accounts. Using these accounts, they can do spam, they can do “likes,” they do transactions. These accounts usually have high correlations among them because they’re controlled by robots or controlled by trained people. For us, we look at the user-user correlation. AN ECOSYSTEM THAT SUPPORTS ATTACKS ACROSS DIFFERENT INDUSTRIAL SECTORS Because we look at the account level and how users behave, our engine is quite general to different sectors. We have clients in social media, mobile gaming, and we’re also working with a client in financial services. The reason that our engine can work across different sectors is that we look at the notion of accounts and the underground ecosystem that supports massive attacks to different services [and which can] have the same set of people. Some people specialize in registering bad accounts, some people specialize in stealing credit cards, and some people specialize in writing templates, etc. So, there is an underground ecosystem in the tools they use, the data centers that they use, the VPNs they use. There are a lot of commonalities across different sectors. APACHE SPARK We have clients that send us billions of events per day, so it’s a huge amount of data, and you want to find a small amount of bad users. It’s like finding a needle in a haystack without any labels. It’s very challenging. There are also a lot of the social network elements associated with security. Some attackers want to actively friend because the more they friend, the more they can spam them, etc. The resulting graphs can be massive. One of our founding members also came from Berkeley and he used Spark before; when we wanted to scale the system, Spark was a very natural choice. We have had a very positive experience. Spark is very easy to use and it has a great community; it helped us scale our system pretty well. Note: Fang Yu’s frequent collaborator and DataVisor co-founder Yinglian Xie will speak about Leveraging Apache Spark to analyze billions of user actions to reveal hidden fraudsters at Strata + Hadoop World in San Jose this March . Related resources: * Scalable Machine Learning (video) * Secure Because Math? Challenges on Applying Machine Learning to Security (video) * The Security Data Lake (free report) SHARE THIS: * Click to share on Twitter (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Google+ (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Click to share on Pocket (Opens in new window) * Click to email (Opens in new window) * 02/25/2016 Ben Lorica data show , podcast , security , sparkPOST NAVIGATION ← →LEAVE A REPLY CANCEL REPLY Enter your comment here...Fill in your details below or click an icon to log in: * * * * * Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change ) You are commenting using your Twitter account. ( Log Out / Change ) You are commenting using your Facebook account. ( Log Out / Change ) You are commenting using your Google+ account. ( Log Out / Change ) CancelConnecting to %s Notify me of new comments via email. Notify me of new posts via email. SEARCH Search for:RSS FEED * RSS - Posts SITE MAP * About * Calendar * Contact * Hardcore Data Science and Data Engineering * The Data Show * Webcasts RECENT POSTS * Structured streaming comes to Apache Spark 2.0 * Don’t overlook simpler techniques and algorithms * Recent trends in recommender systems * Semi-supervised, unsupervised, and adaptive algorithms for large-scale time series * Practical machine learning techniques for building intelligent applications CATEGORIES * Data Engineer * Data Science * Finance * Marketing * Science * Uncategorized ARCHIVES * May 2016 (3) * April 2016 (2) * March 2016 (5) * February 2016 (3) * January 2016 (3) * December 2015 (3) * November 2015 (4) * October 2015 (5) * September 2015 (5) * August 2015 (3) * July 2015 (4) * June 2015 (4) * May 2015 (3) * April 2015 (6) * March 2015 (5) * February 2015 (7) * January 2015 (6) * December 2014 (7) * November 2014 (3) * October 2014 (3) * September 2014 (4) * August 2014 (5) * July 2014 (7) * June 2014 (6) * May 2014 (1) * April 2014 (4) * March 2014 (4) * February 2014 (7) * January 2014 (4) * December 2013 (5) * November 2013 (3) * October 2013 (3) * September 2013 (5) * August 2013 (4) * July 2013 (4) * June 2013 (5) * May 2013 (4) * April 2013 (4) * March 2013 (4) * February 2013 (2) * October 2012 (1) * August 2012 (1) My Tweets Blog at WordPress.com. | The Sorbet Theme . FollowFOLLOW “THE GRADIENT FLOW” Get every new post delivered to your Inbox. Join 35 other followers Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","[A version of this post appears on the O’Reilly Radar.] The O’Reilly Data Show podcast: Fang Yu on data science in security, unsupervised learning, and Apache Spark. Subscribe to the O’Reilly…",Using Apache Spark to predict attack vectors among billions of users and trillions of events,Live,9 30,"OFFLINE-FIRST IOS APPS WITH SWIFT & PART 1: THE DATASTOREJason H. Smith / January 25, 2016This walk-through is a sequel to Apple’s well-known iOS programmingintroduction, Start Developing iOS Apps (Swift) . Apple’s introduction walks us through the process of building the UI, data,and logic of an example food tracker app, culminating with a section on datapersistence: storing the app data as files in the iOS device.This series picks up where that document leaves off: syncing data betweendevices, through the cloud, with an offline-first design. You will achieve thisusing open source tools and the free IBM Cloudant service.This document is the first in the series, showing you how to use the CloudantSync datastore, CDTDatastore , for FoodTracker on the iOS device. Subsequent posts will cover syncing to thecloud and other advanced features such as accounts and data management.TABLE OF CONTENTS 1. Getting Started 2. CocoaPods 1. Learning Objectives 2. Install CocoaPods on your Mac 3. Install Cloudant Sync using CocoaPods 4. Change from a Project to a Workspace 3. Compile with Cloudant Sync 1. Learning Objectives 2. Create the CDTDatastore Bridging Header 3. Check the Build 4. Store Data Locally with Cloudant Sync 1. Offline First 2. Learning Objectives 3. The Cloudant Document Model 4. Design Plan 5. Remove NSCoding 6. Initialize the Cloudant Sync Datastore 7. Side Note: Deleting the Datastore in the iOS Simulator 8. Implement Storing and Querying Meals 9. Create Sample Meals in the Datastore 5. Conclusion 6. Download This ProjectGETTING STARTEDThe FoodTracker main screenThese lessons assume that you have completed the FoodTracker app from Apple’s walk-through. First, complete that walk-through. It will teach youthe process of beginning an iOS app and it will end with the chapter, Persist Data . Download the sample project from the final lesson (the “Download File” linkat the bottom of the page).Extract the zip file, Start-Dev-iOS-Apps-10.zip , browse into its folder with Finder, and double-click FoodTracker.xcodeproj . That will open the project in Xcode. Run the app (Command-R) and confirm thatit works correctly. If everything is in order, proceed with this document.COCOAPODSThe first step is to install CocoaPods which will allow you to quickly and easily use open source packages in your iOSapps. You will use the CocoaPods repository to integrate the Cloudant Sync Datastore library, called CDTDatastore .LEARNING OBJECTIVESAt the end of the lesson, you’ll be able to: 1. Install CocoaPods on your Mac 2. Use CocoaPods to download and integrate CDTDatastore with FoodTrackerINSTALL COCOAPODS ON YOUR MACThe CocoaPods web site has an excellent page, Getting Started , which covers installing and upgrading. For your purposes, you will use themost simple approach to installation, the command-line gem program.To install CocoaPods 1. Open the Terminal application 1. Click the Spotlight icon (a magnifying glass) in the Mac OS task bar 2. Type “terminal” in the Spotlight prompt, and press return 2. In Terminal, type this command:gem install cocoapods Note , if you receive an error message and the CocoaPods gem does not install, try this instead: sudo gem install cocoapods 3. Confirm that CocoaPods is installed with this command: pod --version You should see the CocoaPods version displayed in Terminal: 0.39.0 INSTALL CLOUDANT SYNC USING COCOAPODSTo install CDTDatastore as a dependency, create a Podfile , a simple configuration files which tell CocoaPods which packages this projectneeds.To create a Podfile 1. Choose File > New > File… (or press Command-N) 2. On the left side of the dialog that appears, under “iOS”, select Other. 3. Select Empty, and click Next. 4. In the Save As field, type Podfile . 5. The save location (“Where”) defaults to your project directory.The Group option defaults to your app name, FoodTracker. In the Targets section, make sure both your app and the tests for your app are not selected. 6. Click Create. Xcode will create a file called Podfile which is open in the Xcode editor.Next, configure CDTDatastore in the Podfile.To configure the Podfile 1. In Podfile , add the following codeplatform :ios, '9.1' pod ""CDTDatastore"", '~> 1.0.0' 2. Choose File > Save (or press Command-S)With your Podfile in place, you can now use CocoaPods to install theCDTDatastore pod.To install CDTDatastore 1. Open Terminal 2. Change to your project directory, the directory containing your new Podfile. For example,# Your 'cd' change to the folder you use. cd ""FoodTracker - Persist Data"" 3. Type this command. Note, *this may take a few minutes to complete . pod install --verbose You will see colorful output from CocoaPods in the terminal.CHANGE FROM A PROJECT TO A WORKSPACEBecause you are now integrating FoodTracker with the third-party CDTDatastorelibrary, your project is now a group of projects combined into one useful whole. XCode supports this, and CocoaPodshas already prepared you for this transition by creating FoodTracker.xcworkspace for you—a workspace encompassing both FoodTracker and CDTDatastore.To change to your project workspace 1. Choose File > Close Window (or press Command-W). 2. Choose File > Open (or press Command-O). 3. Select FoodTracker.xcworkspace and click Open.You will see a similar XCode view as before, but notice that you now have twoprojects now.Note , when you build or run the app, you may see compiler warnings fromCDTDatastore code and its dependencies. You can safely ignore these warnings.Checkpoint: Run your app. The app should behave exactly as before. Now you know that everything is in itsplace and working correctly.COMPILE WITH CLOUDANT SYNCYour next step is to compile FoodTracker along with CDTDatastore . You will not change any major FoodTracker code yet; however, this willconfirm that CDTDatastore and FoodTracker integrate and compile correctly.LEARNING OBJECTIVESAt the end of the lesson, you’ll be able to create a bridging header to link Swift and Objective-C code.CREATE THE CDTDATASTORE BRIDGING HEADERCDTDatastore is written in Objective-C. FoodTracker is a Swift project.Currently, the best way to integrate these projects together is with a bridging header . The bridging header, CloudantSync-Bridging-Header.h , will tell Xcode to compile CDTDatastore into the final app.To create a header file 1. Choose File > New > File (or press Command-N) 2. On the left side of the dialog that appears, under “iOS”, select Source. 3. Select Header File, and click Next. 4. In the Save As field, type CloudantSync-Bridging-Header . 5. Click the down-arrow expander button to the right of the “Save As” field. This will display the file system tree of the project. 6. Click the FoodTracker folder. 7. Confirm that the Group option defaults to your app name, FoodTracker. 8. In the Targets section, check the FoodTracker target. 9. Click Create. Xcode will create and open a file called CloudantSync-Bridging-Header.h . 10. Under the line which says #define CloudantSync_Bridging_Header_h , insert the following code:#import 11. Choose File > Save (or press Command-S)The header file contents are done. But, despite its name, this file is not yet a bridging header as far as Xcode knows. The final step is to tell Xcode that this file willserve as the Objective-C bridging header.To assign a project bridging header 1. Enter the Project Navigator view by clicking the upper-left folder icon (or press Command-1). 2. Select the FoodTracker project in the Navigator. 3. Under Project, select the FoodTracker project. (It has a blue icon). 4. Click “Build Settings”. 5. Click All to show all build settings 6. In the search bar, type “bridging header.” You should see Swift Compiler – Code Generation and inside it, Objective-C Bridging Header . 7. Double-click the empty space in the “FoodTracker” column, in the row “Objective-C Bridging Header”. 8. A prompt window will pop up. Input the following: FoodTracker/CloudantSync-Bridging-Header.h 9. Press returnYour bridging header is done! Xcode should look like this:CHECK THE BUILDCheckpoint: Run your app. This will confirm that the code compiles and runs. While you have not changedany user-facing app code, you have begun the first step to Cloudant Sync bycompiling CDTDatastore into your project.STORE DATA LOCALLY WITH CLOUDANT SYNCWith CDTDatastore compiled and connected to FoodTracker, the next step is toreplace the NSCoder persistence system with CDTDatastore. Currently, in MealTableViewController.swift , during initialization, the encoded array of meals is loaded from localstorage. When you add or change a meal, the entire meals array is encoded and stored on disk.You will replace that system with a document-based architecture—in other words,each meal will be one record (called a “document” or simply “doc”) in theCloudant Sync datastore.Keep in mind, this first step of using Cloudant Sync does not use the Internet at all . The first goal is simply to store app data locally, in CDTDatastore. Afterthat works correctly, you will add the ability to sync with Cloudant.OFFLINE FIRSTThis is the offline-first architecture , with Internet access being optional to use the app. All data operations areon the local device. If the device has an Internet connection, then the app willsync its data with Cloudant—covered in future posts in this series.LEARNING OBJECTIVESAt the end of the lesson, you’ll be able to: 1. Understand the Cloudant document model: 1. Key-value storage for simple data types 2. Attachment storage for binary data 3. The document ID and revision ID 2. Store meals in the Cloudant Sync datastore 3. Query for meals in chronological order, from the datastoreTHE CLOUDANT DOCUMENT MODELLet’s begin with a discussion of Cloudant basics. The document is the primary data model of the Cloudant database, not only CDTDatastore foriOS, but also for Android, the Cloudant hosted database, and even the opensource Apache CouchDB database.A document, often called a doc , is a set of key-value data. Do not think, “Microsoft Office document”; think“JSON object.” A document is a JSON object: keys (strings) can have values:Ints, Doubles, Bools, Strings, as well as nested Arrays and Dictionaries.Documents can also contain binary blobs, called attachments . You can add, change, or remove attachments in a very similar way as you wouldadd, change, or remove key-value data in a doc.All documents always have two pieces of metadata used to manage them. The document ID (sometimes called _id or simply id ) is a unique string identifying the doc. You use the ID to read, and write aspecific document. When you create a document, you may omit the _id value, in which case Cloudant will automatically generate a unique ID for thedocument.The revision ID (sometimes called _rev or revision ) is a string generated by the datastore which tracks when the doc changes. Therevision ID is mostly used internally by the datastore, especially to facilitatereplication. In practice, you need to remember the basics about revisions : * The revision ID changes every time you update a document. * When you update a document, you provide the current revision ID to the datastore, and the datastore will return to you the new revision ID of the new document. * When you create a document, you do not provide a revision ID, since there is no such “current” document.Finally, note that deleting a document is actually an update, with metadata setto indicate deletion, called a tombstone . Since a delete is an update just like any other, the deleted document willhave its own revision ID. The tombstones are necessary for replication:replicating a tombstone from one database to another will cause doc to bedeleted in both databases. As far as your app is concerned, it can consider thedocument deleted).DESIGN PLANWith this in mind, consider: how will the sample meals that are pre-loaded intothe app work? At first, you might think to create meal documents whenFoodTracker starts. That will work correctly the first time the user runs theapp; however, if the user changes or deletes the sample meals, those changes must persist . For example, if the user deletes the sample meals and then restarts the applater, those meals must remain deleted.To support this requirement, you will use document tombstones . This will be the basic design: * Each meal will be represented by a single document. User-created meals will have an automatically-generated document ID; but sample meals will have hard-coded document IDs: “meal1”, “meal2”, and “meal3”.// An example meal document: { ""_id"": ""meal1"", ""name"": ""Caprese Salad"", ""rating"": 4, ""created_at"": ""2016-01-03T02:15:49.727Z"" } * Sample meals have a hard-coded docId . Just before creating a sample meal, first try to fetch the meal by ID. * If CDTDatastore returns a meal doc, that means it has already been created. Do nothing . * If CDTDatastore returns a ""not_found"" error, that means the meal has never been created. Proceed with doc creation . * If CDTDatastore returns a different error, that means the meal has been created and then deleted. Do nothing . Now, you can put this understanding into practice by transitioning to CloudantSync for local app data storage.REMOVE NSCODINGBegin cleanly by removing the current NSCoding system from the model and thetable view controller.To remove NSCoding from the model 1. Open Meal.swift 2. Find the class declaration, which saysclassMeal: NSObject, NSCoding{ 3. Remove the word NSCoding and also the comma before it, making the new class declaration look like this: classMeal: NSObject{ 4. Delete the comment line, // MARK: NSCoding . 5. Delete the method below that, encodeWithCoder(_:) . 6. Delete the method below that, init?(_:) .Next, remove NSCoding from the table view controller.To remove NSCoding from the table view controller 1. Open MealTableViewController.swift 2. Find the method viewDidLoad() , and delete the comment beginning // Load any saved meals and also the if/else code below it:// Load any saved meals, otherwise load sample data.iflet savedMeals = loadMeals() { meals += savedMeals } else { // Load the sample data. loadSampleMeals() } 3. Delete the method loadSampleMeals() , which is immediately beneath the viewDidLoad() method. 4. Find the method tableView(_:commitEditingStyle:forRowAtIndexPath:) and delete the line of code saveMeals() . 5. Find the method unwindToMealList(_:) and delete its last two lines of code: a comment, and a call to saveMeals() . // Save the meals. saveMeals() 6. Delete the comment line, // MARK: NSCoding 7. Delete the method below that, saveMeals() . 8. Delete the method below that, loadMeals() .Checkpoint: Run your app. The app will obviously lose some functionality: loading stored meals, andcreating the first three sample meals; although you can still create, edit, andremove meals (but they will not persist if you quit the app). That is okay. Inthe next step, you will restore these functions using Cloudant Sync instead.INITIALIZE THE CLOUDANT SYNC DATASTORENow you will add loading and saving back to the app, using the Cloudant Syncdatastore. A meal will be a document, with its name and rating stored askey-value data, and its photo stored as an attachment. Additionally, you willstore a creation timestamp, so that you can later sort the meals in the orderthey were created.Begin with the Meal model, the file Meal.swift . You will add a new initialization method which can create a Meal object froma document. In other words, the init() method will set the meal name and rating from the document key-value data; andit will set the meal photo from the document attachment.Representing a Meal as a Cloudant document requires few changes besides theinitialization function. The only change to the the actual model is to addvariables for the underlying document ID, and the creation time. By rememberinga meal’s document ID, you will be able to change that doc when the user changesthe meal (e.g. by changing its rating, its name, or its photo). And by storingits creation time, you can later query the database for meals in the order thatthe user created them.To add Cloudant Sync datastore support 1. Open Meal.swift 2. In Meal.swift , in the section MARK: Properties , append these lines so that the variable declarations look like this:// MARK: Propertiesvar name: Stringvar photo: UIImage? var rating: Int// Data for Cloudant Syncvar docId: String? var createdAt: NSDate 3. In Meal.swift , edit the init?(_:photo:rating:) method to accept docId as a final argument, and to set the docId and createdAt properties . When you are finished, the method will look like this: init?(name: String, photo: UIImage?, rating: Int, docId: String?) { // Initialize stored properties.self.name = name self.photo = photo self.rating = rating self.docId = docId self.createdAt = NSDate() super.init() // Initialization should fail if there is no name or if the// rating is negative.if name.isEmpty || rating < 0 { returnnil } } Now add a convenience initializer. This initializer will use a givenCDTDatastore document to create a Meal object.To create a convenience initializer 1. Open Meal.swift 2. In Meal.swift, below the method init?(_:photo:rating:docId:) , add the following code:requiredconvenienceinit?(aDoc doc:CDTDocumentRevision) { iflet body = doc.body { let name = body[""name""] as! Stringlet rating = body[""rating""] as! Intvar photo: UIImage? = niliflet photoAttachment = doc.attachments[""photo.jpg""] { photo = UIImage( data: photoAttachment.dataFromAttachmentContent()) } self.init(name:name, photo:photo, rating:rating, docId:doc.docId) } else { print(""Error initializing meal from document: \(doc)"") returnnil } } That’s it for the model. The Meal class now tracks its underlying document IDand creation time; and it supports convenient initialization directly from ameal document.Since the Meal model initializer has a new docId: String? parameter, you will need to update the one bit of code which initializes Mealobjects, in the Meal view controller.To update the meal view controller 1. Open MealViewController.swift 2. In MealViewController.swift , find the function prepareForSegue(_:sender:) and change the last section of code to (dd , docId: docId ):// Set the meal to be passed to MealTableViewController after the// unwind segue.let docId = meal?.docId meal = Meal(name: name, photo: photo, rating: rating, docId: docId) Now the model has been updated to work from Cloudant Sync documents.Checkpoint: Run your app. The app should build successfully. This will confirm that all changes areworking together harmoniously. Of course, the app behavior is obviouslyincomplete, which you will correct in the next steps.All that remains is to use the datastore from the Meal table view controller.Begin by initializing the datastore and data.To initialize the datastore 1. Open MealTableViewController.swift 2. In MealTableViewController.swift , in the section MARK: Properties , append these lines so that the variable declarations look like this:// MARK: Propertiesvar meals = [Meal]() var datastoreManager: CDTDatastoreManager? var datastore: CDTDatastore? 3. In MealTableViewController.swift , append the following code at the end of the method viewDidLoad() : // Initialize the Cloudant Sync local datastore. initDatastore() Now write the initialization function. Begin by creating a code marker for thenew Cloudant Sync datastore methods.To create a code marker for your code 1. Open MealTableViewController.swift 2. In MealTableViewController.swift , find the last method in the class, unwindToMealList(_:) 3. Below that method, add the following:// MARK: Datastore This will be the section of the code where you implement all Cloudant Syncdatastore functionality.To implement datastore initialization , in MealTableViewController.swift , append the following code in the section MARK: Datastore :funcinitDatastore() { let fileManager = NSFileManager.defaultManager() let documentsDir = fileManager.URLsForDirectory(.DocumentDirectory, inDomains: .UserDomainMask).last! let storeURL = documentsDir.URLByAppendingPathComponent(""foodtracker-meals"") let path = storeURL.path do { datastoreManager = tryCDTDatastoreManager(directory: path) datastore = try datastoreManager!.datastoreNamed(""meals"") } catch { fatalError(""Failed to initialize datastore: \(error)"") }}SIDE NOTE: DELETING THE DATASTORE IN THE IOS SIMULATORSometimes during development, you may want to delete the datastore and startover. There are several ways to do this, for example, by deleting the app fromthe simulated device.However, here is a quick command you can paste into the terminal. It will removethe Cloudant Sync database. When you restart the app, the app will initialize anew datastore and behave as if this was its first time to run. For example, itwill re-create the sample meals again.To delete the datastore from the iOS Simulatorrm -i -rv $HOME/Library/Developer/CoreSimulator/Devices/*/data/Containers/Data/Application/*/Documents/foodtracker-mealsThis command will prompt you to remove the files. If you are confident that thecommand is working correct, you can omit the -i option.IMPLEMENT STORING AND QUERYING MEALSWith the datastore initialized, you need to write methods to store and retrievemeal documents. This is the cornerstone of your project. With a few methods tointeract with the datastore, you will enjoy all the benefits the Cloudant Syncdatastore brings: offline-first operation and cloud syncing.For FoodTracker, you will have two primary ways of persisting meals in thedatastore: creating meals and updating meals. Each of these will have its ownmethod, but the methods will share some common code to populate a meal documentwith the correct data. Begin by writing this method. Given a Meal object and aCloudant document, it will copy all of the meal data to the document, so thatthe latter can be created or updated as needed.To implement populating a meal document 1. Open MealTableViewController.swift 2. In MealTableViewController.swift , in the section MARK: Datastore , append a new method:funcpopulateRevision(meal: Meal, revision: CDTDocumentRevision?) { // Populate a document revision from a Meal.let rev: CDTDocumentRevision = revision ?? CDTDocumentRevision(docId: meal.docId) rev.body[""name""] = meal.name rev.body[""rating""] = meal.rating // Set created_at as an ISO 8601-formatted string.let dateFormatter = NSDateFormatter() dateFormatter.locale = NSLocale(localeIdentifier: ""en_US_POSIX"") dateFormatter.timeZone = NSTimeZone(abbreviation: ""GMT"") dateFormatter.dateFormat = ""yyyy-MM-dd'T'HH:mm:ss.SSS'Z'""let createdAtISO = dateFormatter.stringFromDate(meal.createdAt) rev.body[""created_at""] = createdAtISO iflet data = UIImagePNGRepresentation(meal.photo!) { let attachment = CDTUnsavedDataAttachment(data: data, name: ""photo.jpg"", type: ""image/jpg"") rev.attachments[attachment.name] = attachment } } Next, implement the method to create new meal documents. Note that sample mealswill have hard-coded document IDs, so that you can detect if they have alreadybeen created or not. User-created meals will have no particular doc ID.To implement meal document creation 1. In MealTableViewController.swift , in the section MARK: Datastore , append a new method:// Create a meal. Return true if the meal was created, or false if// creation was unnecessary.funccreateMeal(meal: Meal) -> Bool { // User-created meals will have docId == nil. Sample meals have a// string docId. For sample meals, look up the existing doc, with// three possible outcomes:// 1. No exception; the doc is already present. Do nothing.// 2. The doc was created, then deleted. Do nothing.// 3. The doc has never been created. Create it.iflet docId = meal.docId { do { try datastore!.getDocumentWithId(docId) print(""Skip \(docId) creation: already exists"") returnfalse } catchlet error asNSError { if (error.userInfo[""NSLocalizedFailureReason""] as? String != ""not_found"") { print(""Skip \(docId) creation: already deleted by user"") returnfalse } print(""Create sample meal: \(docId)"") } } let rev = CDTDocumentRevision(docId: meal.docId) populateRevision(meal, revision: rev) do { let result = try datastore!.createDocumentFromRevision(rev) print(""Created \(result.docId)\(result.revId)"") } catch { print(""Error creating meal: \(error)"") } returntrue } Now you are ready to write the update method. Note that “deleting” a Cloudantdocument is in fact a type of update . The update method will accept a Bool parameter indicating whether to deletethe document or not. However, to keep the rest of the code simple, you willwrite one-line convenience methods deleteMeal(_:) and updateMeal(_:) to set the deletion flag automatically.To implement deleting and updating meal documents 1. In MealTableViewController.swift , in the section MARK: Datastore , append the two convenience methods and then the full implementation.funcdeleteMeal(meal: Meal) { updateMeal(meal, isDelete: true) } funcupdateMeal(meal: Meal) { updateMeal(meal, isDelete: false) } funcupdateMeal(meal: Meal, isDelete: Bool) { guardlet docId = meal.docId else { print(""Cannot update a meal with no document ID"") return } let label = isDelete ? ""Delete"" : ""Update""print(""\(label)\(docId): begin"") // First, fetch the current document revision from the DB.var rev: CDTDocumentRevisiondo { rev = try datastore!.getDocumentWithId(docId) populateRevision(meal, revision: rev) } catch { print(""Error loading meal \(docId): \(error)"") return } do { var result: CDTDocumentRevisionif (isDelete) { result = try datastore!.deleteDocumentFromRevision(rev) } else { result = try datastore!.updateDocumentFromRevision(rev) } print(""\(label)\(docId) ok: \(result.revId)"") } catch { print(""Error updating \(docId): \(error)"") return } } Your app can now create, update, and delete meal docs. To complete this feature,these methods must be integrated with UI. When the user saves or deletes a meal,the controller must run these methods.To create and update meals 1. In MealTableViewController.swift , in the method unwindToMealList(_:) , modify the method body so that it calls updateMeal() or createMeal() as appropriate. The code will look as follows:iflet selectedIndexPath = tableView.indexPathForSelectedRow { // Update an existing meal. meals[selectedIndexPath.row] = meal tableView.reloadRowsAtIndexPaths([selectedIndexPath], withRowAnimation: .None) updateMeal(meal) } else { // Add a new meal.let newIndexPath = NSIndexPath(forRow: meals.count, inSection: 0) meals.append(meal) tableView.insertRowsAtIndexPaths([newIndexPath], withRowAnimation: .Bottom) createMeal(meal) } 2. In the method tableView(_:commitEditingStyle:forRowAtIndexPath) , insert a call to deleteMeal(_:) for the .Delete editing event. The code will look as follows. if editingStyle == .Delete { // Delete the row from the data sourcelet meal = meals[indexPath.row] deleteMeal(meal) meals.removeAtIndex(indexPath.row) tableView.deleteRowsAtIndexPaths([indexPath], withRowAnimation: .Fade) The final thing to write is the code to query for meals in the datastore. Thiscode has two parts: initializing an index during app startup (to query bytimestamp), and of course the code to query that index.To support querying meals by timestamp 1. In MealTableViewController.swift , in the method initDatastore() , append this code:datastore?.ensureIndexed([""created_at""], withName: ""timestamps"") // Everything is ready. Load all meals from the datastore. loadMealsFromDatastore() 2. In MealTableViewController.swift , in the section MARK: Datastore , append this method: funcloadMealsFromDatastore() { let query = [""created_at"": [""$gt"":""""]] let result = datastore?.find(query, skip: 0, limit: 0, fields:nil, sort: [[""created_at"":""asc""]]) guard result != nilelse { print(""Failed to query for meals"") return } meals.removeAll() result!.enumerateObjectsUsingBlock({ (doc, idx, stop) -> Voidiniflet meal = Meal(aDoc: doc) { self.meals.append(meal) } }) } That’s it! The most intricate part of your code is finished.CREATE SAMPLE MEALS IN THE DATASTORENow is time to create sample meal documents during app startup. This method willrun every time the app initializes. For each sample meal, it will call createMeal(_:) which will either create the documents or no-op, as needed.To create sample meals during app startup 1. In MealTableViewController.swift , in the section MARK: Datastore , add a new method:funcstoreSampleMeals() { let photo1 = UIImage(named: ""meal1"")! let photo2 = UIImage(named: ""meal2"")! let photo3 = UIImage(named: ""meal3"")! let meal1 = Meal(name: ""Caprese Salad"", photo: photo1, rating: 4, docId: ""sample-1"")! let meal2 = Meal(name: ""Chicken and Potatoes"", photo: photo2, rating: 5, docId:""sample-2"")! let meal3 = Meal(name: ""Pasta with Meatballs"", photo: photo3, rating: 3, docId:""sample-3"")! // Hard-code the createdAt property to get consistent revision IDs. That way, devices that share// a common cloud database will not generate conflicts as they sync their own sample meals.let comps = NSDateComponents() comps.day = 1 comps.month = 1 comps.year = 2016 comps.timeZone = NSTimeZone(abbreviation: ""GMT"") let newYear = NSCalendar.currentCalendar() .dateFromComponents(comps)! meal1.createdAt = newYear meal2.createdAt = newYear meal3.createdAt = newYear createMeal(meal1) createMeal(meal2) createMeal(meal3) } 2. In MealTableViewController.swift , in the method initDatastore() , insert a call to storeSampleMeals() before the code initializing the index. The final lines of the method will look as follows: storeSampleMeals() datastore?.ensureIndexed([""created_at""], withName: ""timestamps"") // Everything is ready. Load all meals from the datastore. loadMealsFromDatastore() } Checkpoint: Run your app. The app should behave exactly as it did at the beginning of this project.CONCLUSIONCongratulations! While the app remains unchanged superficially, you have made avery powerful upgrade to FoodTracker’s most important aspect: its data. You havetransformed the data layer from a minimal, unexceptional side note to become aflexible, powerful database. This database can be queried, searched, scaled, andreplicated between devices and through the cloud.The next update of this series will cover replicating this data to the cloudusing IBM Cloudant. Indeed, implementing cloud syncing is much simpler than thework from this lesson. You have completed laying the foundation!DOWNLOAD THIS PROJECTTo see the completed sample project for this lesson, download the file and viewit in Xcode.Download FileSHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: cloudant / iOS / Mobile / swift Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Apple's sample app, Food Tracker, taught you iOS. Now, take it further and sync data between devices, through the cloud, with an offline-first design.",Offline-First iOS Apps with Swift & Cloudant Sync; Part 1: The Datastore,Live,10 33,"Warehousing data from Cloudant to dashDB greatly enhances your options to analyze that data. Now, we have extended this capability to include GeoJSON documents. An ever increasing number of mobile and internet-of-things (IOT) applications capture and store geospatial data in NoSQL databases such as those provided by Cloudant. GeoJSON is the de-facto standard for such data. This new capability enables your data analysis to reflect geospatial aspects. For example, you can gain even more data insight by combining existing data with geospatial data from other sources, such as from weather services which becomes simple to integrate under IBM’s new partnership with The Weather Company. Or you can use the power of the statistical R language to do the geospatial analysis.In this post my co-author, Holger Kache, and I briefly describe how GeoJSON documents are warehoused. Then we will illustrate the new capabilities with some examples.As a Cloudant user interested in capturing geospatial information, GeoJSON is probably familiar to you. Cloudant already offers ways to store your spatial data in GeoJSON and the ability to do basic analysis with spatial functions. Anyway, let’s briefly touch on how GeoJSON documents are structured.In essence GeoJSON documents usually come in one of three form factors:Atomic geometrieswith a geometry type (like Point, LineString, Polygon, etc.) and coordinates, like this LineString:Featuretypes, which are atomic geometries along with some text properties. For example, a name:FeatureCollectiontypes, which combine lists of Feature types under a common roof:Because the three form factors have different structures, they would result in different table structures for the standard (non-GeoJSON) warehousing. For GeoJSON, in contrast, we don’t want that. We want one consistent table structure for all of the three form factors. Cloudant achieves this consistency by internally converting each document to a FeatureCollection type first. Let’s look at this in more detail.Suppose that we created a Cloudant database called geojson_demo that contains the three example documents from above. Let’s take a look how these documents are warehoused into dashDB tables. The process of scheduling a warehouse is unchanged from what was described in this earlier post. So we will jump directly to the newly created tables in the dashDB warehouse as seen in the dashDB Tables page:As you can see, the warehousing process created three tables: a base table called GEOJSON_DEMO, a feature table called GEOJSON_DEMO_FEATURES, and an overflow table GEOJSON_DEMO_OVERFLOW that contains potential issues encountered during the warehousing.Here is the base table GEOJSON_DEMO:Each document results in one row. You can see that they were all converted to the FeatureCollection type.The most interesting table is the GEOJSON_DEMO_FEATURES table because it contains the geometries in the GEOMETRY column along with the property name in the PROPERTIES_NAME column:In this table each geometry occupies one row. Again, all geometries appear as the Feature type in this table regardless of their initial structure. This makes it easy for you to access the different geometries in dashDB: You can always expect them to show up in the GEOMETRY column of the table along with the properties in columns named PROPERTIES_. The geometries are stored in one of dashDB’s geospatial data types such as ST_Point, ST_LineString, ST_Polygon or alike.You might have noticed another difference from standard warehousing: All table and column names are now uppercase instead of mixed case. The reason for this is that many database access tools like ESRI’s ArcMap expect uppercase database objects. Again, we made it as easy as possible for you to access the warehousing results.In this first version of the GeoJSON support, there are some restrictions that you should be aware of:First, we expect a homogeneous database with respect to geometry types. Consequently, if you mix different geometrytypes like Points, LineString, or Polygons in one database, we will warehouse only the documents that contain the most frequently occurring geometrytype and reject all others. You can find information about the rejected documents in the EXCEPTION column of the overflow table.We only support the default Coordinate Reference Systems (CRS) WGS84. If you specify a different CRS, you will get a warning in theWARNING column of the overflow table like this one:We do not support the geometry type GeometryCollection because there is no suitable geometry type available in dashDB. So if your datahas the GeometryCollections type, restructure it to the FeatureCollection type instead.The GeoJSON bounding box member bbox is neglected because in dashDB we internally calculate the bounding box of eachgeometry at loading time.Now that your GeoJSON data is in the warehouse, it is time to kick off some spatial analysis. Suppose you have warehoused the Boston criminal incidence reports (which you can find as GeoJSON database under http://opendata.cloudant.com/crimes). Let’s take a look at it by using ESRI’s ArcMap tool:Or by using ArcMap’s kernel density tool you can directly highlight the critical areas. Because the tool will import all the data first, this process is rather time consuming.But there is a way to speed things up. Look at this example where we join the crimes data with the neighborhood districts to highlight critical districts:For this analysis, we imported the neighborhoods districts, available in shape format, into dashDB and joined it with the point geometries of the warehoused crimes database. This join is fast because it is done at the database level without exporting the data to the tool first. The underlying SQL looks like this:1 SELECT CN.NAME, CN.NUM_CRIMES / BN.""Acres"" AS CDENSITY, BN.GEO_DATA2 FROM3 (SELECT N.""Name"" AS NAME, COUNT(C.""_ID"") AS NUM_CRIMES4 FROM5 CRIMES_FEATURES AS C,6 BOSTON_NEIGHBORHOODS AS N7 WHERE8 DB2GSE.ST_CONTAINS(N.GEO_DATA, C.GEOMETRY) = 19 GROUP BY N.""Name"") AS CN,10 BOSTON_NEIGHBORHOODS AS BN11 WHERE12 CN.NAME = BN.""Name"";As you can see, the neighborhood polygons N.GEO_DATA are joined with the points of the crimes table C.GEOMETRY (line 8). The crime density CDENSITY is calculated by the count of crimes per district divided by the area in acres (line 1).Once your data is in dashDB you also have the full power of the statistical R language at hand. As one example, you might be interested in how many crimes happen over the course of a day. Here is the answer:To produce this graph, within the dashDB console we chose the menu item Analytics > R Scripts, paste in the following script, and then clicked Submit. Again, this will be fast because we are using in-database analytics functions of R.# Initlibrary(ggplot2)# Connect to the database and read in the data frameidaInit(idaConnect(""BLUDB"","""",""""))qAnd finallyThere are lots of other ways that you can exploit geospatial data by using the Cloudant to dashDB integration: Use other data sources, take advantage ofthe geospatial capabilities of R, use it in a mobile scenario… you name it. So go ahead and try out what works for you – and let us know. Also, if youencounter any shortcomings, leave us a comment or email.References and some more linksThe GeoJSON Format SpecificationCloudant blog:Introducing Data Warehousing and Analytics with Cloudant and dashDBMore on IBM dashDBMore on geospatial processing with dashDBHolger’s blog on Warehouse style analytics for the cloudYouTube Video on Analyzing Geospatial Data with IBM dashDB and Esri ArcGIS for Desktop/* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE * * */var disqus_shortname = 'cloudant'; // required: replace example with your forum shortname/* * * DON'T EDIT BELOW THIS LINE * * */(function() {var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;dsq.src = 'https://' + disqus_shortname + '.disqus.com/embed.js?https';(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);Please enable JavaScript to view the comments powered by Disqus.dashDB: Analytics in Your Hands, Data Warehouse Infrastructure Out of Your WayExperts share tips for staying ahead of the game","Replicating data to a relational dashDB database greatly enhances your options to analyze that data. In addition to the ability to query the warehouse with SQL, you can use the power of the statistical R language to do the analysis. Now, we have extended Cloudant’s warehousing capability to include GeoJSON documents. An ever increasing number of mobile and internet-of-things (IOT) applications capture and store geospatial data in NoSQL databases such as those provided by Cloudant. GeoJSON is the de-facto standard for such data. This new capability enables your data analysis to reflect geospatial aspects.",Warehousing GeoJSON documents,Live,11 36,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Toggle navigation * Sign in * Create New RecipeRECIPE Learn moreLOADING... IBM LovesRecipes@IoTF TIMESERIES DATA ANALYSIS OF IOT EVENTS BY USING JUPYTER NOTEBOOK THIS RECIPE SHOWCASES HOW ONE CAN ANALYZE THE HISTORICAL TIME SERIES DATA, CAPTURED ON THE IBM WATSON IOT PLATFORM, IN A JUPYTER NOTEBOOK USING SPARK SQL AND PANDAS DATAFRAMES. ALSO, USE THE PRE-INSTALLED MATPLOTLIB LIBRARY TO VISUALIZE RESULTS. 1610 2 2Remixed from Recipes@IoTF * Warning : in_array() expects parameter 2 to be array, boolean given in /data/http/wp-content/themes/remix-theme/plugins/tutorial-favourites/public/class-tutorial-favourites.php on line 295 favourite' * * * * * REQUIREMENTS * IBM Bluemix account * Git (Optional) * Maven (Optional) SKILL LEVEL INTERMEDIATE Basic knowledge of 1. IBM Watson IoT Platform 2. Apache Spark 3. Cloudant NoSQL 4. Pandas for Data Manipulation RECIPES TO ENHANCE ANALYTICS IN IBM WATSON IOT PLATFORM Before you proceed, evaluate the following analytical recipes that suites your need. INTRODUCTION In the previous recipe “ Engage Machine Learning for detecting anomalous behaviors of things “, we saw how one can integrate IBM Watson IoT, Apache Spark service, Predictive Analysis service and Real-Time Insights to take timely action before an (unacceptable) event occurs. And in this recipe, we will make use of the data (historical data) produced by the previous recipe to discover the hiddern patterns, termperature trend over the days, month and year using Apache Spark SQL , Pandas DataFrame and Jupyter Notebook . What is Spark SQL and DataFrames? Apache Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL, the DataFrames API and the Datasets API. And in this recipe we will be using DataFrams to analyze and visualize the temperature data. A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs . To make things simple, this recipe does not alter the previous recipe setup, rather it just adds a Node-RED application and Cloudant NoSQL DB service on top of part 1 of the recipe as shown below, In case if you want to look at the overall architecture that shows the components of part1 and part2, take a look at this link . As shown, the Node-RED application will subscribe to the results (which contains the actual temperature, forecasted temperature, zscore and wzscore values) from the Watson IoT Platform and store them in a Cloudant NoSQL DB. This Cloundat NoSQL DB will act as a historical data storage. Once the Cloudant NoSQL DB is filled with enough data, this recipe will use the Jupyter Notebook to load the data into the Spark engine and use Spark SQL, Panda DataFrames, other graphical libraries to analyze the data and show the results in charts or graphs. Also, One can use the sample application present in the github to generate the historical data without running the previous recipe code. The steps are detailed in the following section. CREATE A NODE-RED APPLICATION In this step, we will create a Node-RED application which will store the results into Cloudant DB. Create Node-RED application 1. Open your favorite browser and go to Bluemix . If you are an existing Bluemix user, log in as usual. If you are new to Bluemix you can sign up for a free 30 day trial. 2. Once you signed up to Bluemix, click this link to create the Node-RED starter application in Bluemix. 3. Type a name for your application and click the Create button. 4. Wait for the Bluemix to create the application. Note that the Cloudant service is created along with the Node-RED application, so no need to create the Cloudant service separately. Create Node-RED flow 1. Once the application is created, Click on the application URL to open the Node-RED landing page. ( Note : Your application must be running for this to work, If your application has stopped for any reason, select the Restart button and wait for it to successfully restart). 2. Select “ Go to your Node-RED flow editor ” button to enter into the Node-RED flow editor. 3. Navigate to the menu at the top right of the screen and select Import from Clipboard. Copy the JSON string from the text area below and paste it into the dialog box in Node-RED and select OK. If there any issues, copy the contents from github .[{""id"":""f2818749.4a8ad8"",""type"":""ibmiot"",""z"":""519983d4.82448c"",""name"":""coi0nz""},{""id"":""f396adc3.b4e0d8"",""type"":""ibmiot in"",""z"":""519983d4.82448c"",""authentication"":""apiKey"",""apiKey"":""f2818749.4a8ad8"",""inputType"":""evt"",""deviceId"":"""",""applicationId"":"""",""deviceType"":""+"",""eventType"":""result"",""commandType"":"""",""format"":""json"",""name"":""IBM IoT"",""service"":""registered"",""allDevices"":true,""allApplications"":"""",""allDeviceTypes"":true,""allEvents"":false,""allCommands"":"""",""allFormats"":"""",""x"":206.1999969482422,""y"":144.1999969482422,""wires"":[[""a98be6b0.04c158"",""f42ea830.63aee8""]]},{""id"":""f42ea830.63aee8"",""type"":""cloudant out"",""z"":""519983d4.82448c"",""service"":""gateway-sample-cloudantNoSQLDB"",""cloudant"":"""",""name"":""Cloudant Store"",""database"":""recipedb"",""payonly"":true,""operation"":""insert"",""x"":460.1999969482422,""y"":231.1999969482422,""wires"":[]},{""id"":""a98be6b0.04c158"",""type"":""debug"",""z"":""519983d4.82448c"",""name"":""Debug"",""active"":true,""console"":""false"",""complete"":""payload"",""x"":382.49998474121094,""y"":144,""wires"":[]}] 4. This imported flow has 3 nodes * IBM IoT In node – This node subscribes to all ‘result’ events published by any device in the same organization * Debug node – This node displays the above events in the debug tab of the Node-RED flow * Cloudant Store – This node persists the event in the ‘ recipedb ‘ store in Cloudant DBNote that this imported flow is not complete. Carry out the following steps to complete the flow. 5. Double click on the IBM IoT node and enter the IoT credentials, such as API Key and API Token as shown below and leave the other fields as is, 6. If the previous recipe is already running, then you should observe the result events in the debug window of the Node-RED and also in the Cloudant NoSQL DB. (In case if you want to quickly load the Cloudant NoSQL DB without running the previous recipe code, carry out the steps mentioned in the last sub-section of this section) View the events in Cloudant NoSQL DB 1. Go to Bluemix Dashboard, 2. Click the Node-RED application that you created in this step. 3. Observe that a Cloudant NoSQL DB service present as part of the application. Click on the service and then the Launch button. 4. Observe that a Databased called “ recipedb ” is created where all the result events are stored. 5. Click recipedb to enter inside the database and click on any document to view the events. Retrieve the credentials of Cloudant DB to load the events in Spark 1. Go back to Bluemix Dashboard, 2. Click the Node-RED application that you created in this step. 3. Click on the Show Credentials tab as shown below and note down the username & password. This will be required to load the events into the Spark engine. Load Cloudant DB with sample data – Required only if you want to bypass the previous recipe and quickly generate the historical data, 1. Download and install Maven and Git if not installed already. 2. Clone the iot-predictive-analytics repository as follows:git clone https://github.com/ibm-messaging/iot-predictive-analytics-samples.git 3. Navigate to the DeviceDataGenerator project and build the project using maven,mvn clean package  (This will download all required dependencies and starts the building process. Once built, the sample can be located in the target directory, with the filename IoTDataGenerator-1.0.0-SNAPSHOT.jar) 4. Run the Historical generator sample using the following command: mvn exec:java -Dexec.mainClass=""com.ibm.iot.iotdatagenerator.HistoricalDataGenerator"" -Dexec.args="" ""  5. Observe that the application connects to the Cloudant NoSQL DB service and stores the simulated resultant events as documents. Observe that the timestamp of the first document is January 18th and the interval between 2 records are 2 minutes. One can modify the code HistoricalDataGenerator.java to control the timestamp. In this step, we have successfully created a Node-RED application to store the results into the Cloudant NoSQL DB. CREATE A SPARK SQL DATAFRAME In this step, we will create the Notebook application and load the Cloudant data into Apache Spark service. What is Jupyter Notebook? The Jupyter Notebook is a web application that allows one to create and share documents that contain executable code, mathematical formulae, graphics/visualization (matplotlib) and explanatory text. Its primary use includes: 1. Data cleaning and transformation, 2. Numerical simulation, 3. Statistical modeling, 4. Machine learning and much more. Create a Notebook 1. While the first notebook is running, go back to the Bluemix Catalog and open the same Apache Spark service that you created as part of recipe 1 . 2. Click NOTEBOOKS button to show existing Notebooks. Click on NEW NOTEBOOK button. 3. Enter a Name, under Language select Python and click CREATE NOTEBOOK button to create a new notebook. Load data into Spark and perform basic operations 1. Go to the notebook, In the first cell (next to In [ ] ), enter the following command that creates the SQLContext and click Run. The SQLContext is the main entry point into all functionality in Spark SQL and is necessary to create the DataFrames.sqlContext=SQLContext(sc) 2. Enter the following statements into the second cell, and then click Run . Replace hostname, username, and password with the hostname, username, and password for your Cloudant account. This command reads the recipedb database from the Cloudant account and assigns it to the cloudantdata variable.cloudantdata=sqlContext.read.format(""com.cloudant.spark""). option(""cloudant.host"",""hostname""). option(""cloudant.username"", ""username""). option(""cloudant.password"", ""password""). load(""recipedb"") 3. Enter the following statement into the third cell, and then click Run . This command will return the schema as shown below,cloudantdata.printSchema() out[3]:root |– _id: string (nullable = true) |– _rev: string (nullable = true) |– forecast: double (nullable = true) |– name: string (nullable = true) |– temperature: double (nullable = true) |– timestamp: string (nullable = true) |– wzscore: double(nullable = true) |– zscore: double (nullable = true) 4. Enter the following command in the next cell to look at one record (document) and click Run ,cloudantdata.take(1) [Row(_id=u’0001683791e04032a4ca0955b70b12f8′, _rev=u’1-e1fac6e387132edcb0450c7b2f26d35b’, forecast=17.530496368661147, name=u’datacenter’, temperature=17.53, timestamp=u’2016-Mar-14 14:28:00′, wzscore=-0.3973792127656496, zscore=-0.062204879122082946)] 5. Enter the following command in the next cell to get the number of rows in the Cloundant NoSQL DB and click Run ,cloudantdata.count() 33980 6. Enter the following command in the next cell to get only the temperature values and click Run, (Note that it will return only the top 20 rows ), cloudantdata.select(""temperature"").show() +—————+ |temperature| +—————+ | 18.66| | 18.5| | 18.53| | 18.56| | 17.595| …………. | 17.5| +—————+ only showing top 20 rows In this step we have successfully loaded the historical (Cloudant NoSQL DB) data into the Spark Service and explored the schema. CREATE A PANDAS DATAFRAME In this step we will convert the Spark SQL DataFrame into Pandas timeseries Dataframe and perform basic operations. The Python Data Analysis Library (a.k.a. pandas) provides high-performance, easy-to-use data structures and data analysis tools that are designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Create a Pandas DataFrame 1. Enter the following commands in the next cell to create a Pandas DataFrame from the Spark SQL DataFrame and click Run. This line prints the schema of the newly created Pandas DataFrame which will be same as the Spark SQL DataFrame,import pprint import pandas as pd pandaDF = cloudantdata.toPandas() #Fill NA/NaN values to 0 pandaDF.fillna(0, inplace=True) pandaDF.columns Index([u’_id’, u’_rev’, u’forecast’, u’name’, u’temperature’, u’timestamp’, u’wzscore’, u’zscore’], dtype=’object’) 2. Using len on a DataFrame will give the number of rows as shown below,len(pandaDF) 33980 3. Columns can be accessed in two ways in Pandas. The first is using the DataFrame like a dictionary with string keys,pandaDF[""temperature""] 0 17.69 1 17.38 2 16.56 3 17.69 4 17.50 ……….. 4. You can get multiple columns out at the same time by passing in a list of strings as shown below,pandaDF[[""timestamp"",""temperature""]] timestamp temperature 0 2016-Mar-14 14:28:00 17.530 5. The second way to access columns is using the dot syntax. This only works if your column name could also be a Python variable name (i.e., no spaces), and if it doesn’t collide with another DataFrame property or function name (e.g., count, sum).pandaDF.temperature Create datetime as the index By default Pandas DataFrame uses the sequence number as index, since we analyze the timeseries data its better If we use datetime instead of integers for our index, we will get some extra benefits from pandas when plotting later on. This section will focus on doing the same, 1. Enter the following code in the next cell to make the timestamp as the index.#import the datatime library from datetime import datetime # convert the time from string to panda's datetime pandaDF.timestamp = pandaDF.timestamp.apply(lambda d: datetime.strptime(d, ""%Y-%b-%d %H:%M:%S"")) pandaDF.index = pandaDF.timestamp # Drop the timestamp column as the index is replaced with timestamp now pandaDF = pandaDF.drop([""timestamp""], axis=1) pandaDF.head() # Also, sort the index with the timestamp pandaDF.sort_index(inplace=True) 2. Enter the following command in the next cell to retrieve a row corresponding to a particular time and click Run, one can retrieve the temperature reading based on a relative datetime by first finding a closest time and then querying for it as shown below,''' One can query the temperature based on the datetime, incase if you are not sure about the exact time, then use searchsorted() method to get to the nearest date ''' date = pandaDF.index.searchsorted(datetime(2016, 2, 18, 17, 44, 23)) pandaDF.ix[date] _id 4d5306615f5a416e97921b3b70b75ab7 _rev 1-29ae6dc3d931fdeea1d9d20e5d50c5b8 forecast 17.61406 name datacenter temperature 17.69 wzscore 1.695702 zscore 0.3832717 Name: 2016-02-18 17:46:00, dtype: object As shown above, the row corresponding to the closest time is retrieved . In this step we have successfully created a Panda dataframe and performed few basic operations. In the next section we will see how one can visualize the temperature data using the matplotlib visualization library. VISUALIZE TEMPERATURE READINGS When working with interactive notebooks, one can decide how to present results and information. So far, we have used normal print functions which are informative. In this section, we will show how one can visualize the temperature data using the Pandas DataFrames and matplotlib library. 1. Enter the following command in the next cell to generate the histogram for the temperature and click Run, This will tell how well the temperature readings are distributed,#tell Jupyter to render charts inline: %matplotlib inline import matplotlib.pyplot as plt pandaDF.temperature.hist()  2. Enter the following commands in the next cell to plot the overall temperature and click Run , Observe that the graph is drawn with the timestamp in x axis and temperature values in y axis. Also, observe 2 Red lines showing the upper and lower thresholds,# Draw overall temperature %matplotlib inline import matplotlib.pyplot as plt import numpy as np plotDF = pandaDF[['temperature']] import matplotlib.dates as dates fig, ax = plt.subplots() plotDF.plot(figsize=[20,10], ax=ax, grid=True) ax.set_xlabel(""Timestamp"",fontsize=20) ax.set_ylabel(""Temperature"",fontsize=20) ax.set_title(""Overall Temperature"", fontsize=20) ax.set_ylim([12,22]) # Draw lines to showcase the upper and lower threshold ax.axhline(y=19,c=""red"",linewidth=2,zorder=0) ax.axhline(y=15,c=""red"",linewidth=2,zorder=0) ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=5, maxticks=None, interval_multiples=False)) ax.xaxis.set_minor_formatter(dates.DateFormatter('%dn%an%H:%M:%S')) ax.xaxis.grid(True, which=""minor"") ax.xaxis.set_major_locator(dates.MonthLocator()) ax.xaxis.set_major_formatter(dates.DateFormatter('nnn%bn%Y')) plt.tight_layout() plt.show() 3. If you want to plot only the last 400 values, then enter the following commands in the next cell and click Run . This will help one to understand the recent state of the system.# Draw Last 400 temperature values plotDF = pandaDF[['temperature']] fig, ax = plt.subplots() plotDF.tail(400).plot(figsize=[20,10], ax=ax, grid=True) ax.set_xlabel(""Timestamp"",fontsize=20) ax.set_ylabel(""Temperature"",fontsize=20) ax.set_title(""Recent 400 Temperature Values"", fontsize=20) ax.set_ylim([12,22]) ax.axhline(y=19,c=""red"",linewidth=2,zorder=0) ax.axhline(y=15,c=""red"",linewidth=2,zorder=0) ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=5, maxticks=None, interval_multiples=False)) ax.xaxis.set_minor_formatter(dates.DateFormatter('%dn%an%H:%M:%S')) ax.xaxis.grid(True, which=""minor"") ax.yaxis.grid() ax.xaxis.set_major_locator(dates.MonthLocator()) ax.xaxis.set_major_formatter(dates.DateFormatter('nnn%bn%Y')) plt.tight_layout() plt.show() 4. Similarly you can plot temperature values along with zscore & wzscore by entering the following commands into the next cell and click Run, In the following example, we plot the graph between 2 days. # Draw temperature chart with normal zscore & wzscore start = datetime(2016, 4, 19) end = datetime(2016, 4, 20) plotDF = pandaDF.ix[start:end] plotDF = plotDF[['temperature','zscore','wzscore']] if (len(plotDF) > 0): fig, ax = plt.subplots() plotDF.plot(figsize=[20,10], ax=ax, grid=True) # format the axis ax.set_xlabel(""Timestamp"",fontsize=20) ax.set_ylabel(""Temperature and zscore"",fontsize=20) ax.set_title(""Temperatures between "" + str(start) + "" and "" + str(end) + "" with zscore"", fontsize=20) ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=1, maxticks=None, interval_multiples=False)) ax.xaxis.set_minor_formatter(dates.DateFormatter('%dn%an%H:%M:%S')) ax.xaxis.grid(True, which=""minor"") ax.yaxis.grid() ax.xaxis.set_major_locator(dates.MonthLocator()) ax.xaxis.set_major_formatter(dates.DateFormatter('nnn%bn%Y')) plt.tight_layout() plt.show() else: print ""There are no rows matching the given condition, Try changing the dates"" 5. Enter the following command to overlay the zscore & wzscore along with the temperature, this will help one to understand the deviations better,# Draw temperature chart with scaled zscore & wzscore # define a method that scales zscore with the temperature def scaleZscore(row): return row['zscore'] + row['temperature'] # define a method that scales wzscore with the temperature def scaleWZscore(row): return row['wzscore'] + row['temperature'] # apply the functions pandaDF['scaledzscore'] = pandaDF.apply(scaleZscore, axis=1) pandaDF['scaledwzscore'] = pandaDF.apply(scaleWZscore, axis=1) start = datetime(2016, 2, 19) end = datetime(2016, 2, 20) plotDF = pandaDF.ix[start:end] if (len(plotDF) > 0): # create a dataframe with a required fields that we want to plot plotDF = plotDF[['temperature','scaledzscore','scaledwzscore']] fig, ax = plt.subplots() plotDF.plot(figsize=[23,12], ax=ax) ax.set_xlabel(""Timestamp"",fontsize=20) ax.set_ylabel(""Temperature and zscore"",fontsize=20) ax.set_title(""Temperatures between "" + str(start) + "" and "" + str(end) + "" with scaled zscore"", fontsize=20) ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=1, maxticks=None, interval_multiples=True)) ax.axhline(y=19,c=""purple"",linewidth=2,zorder=0) ax.axhline(y=15,c=""purple"",linewidth=2,zorder=0) ax.set_ylim([13,21]) ax.xaxis.set_minor_formatter(dates.DateFormatter('%dn%an%H:%M:%S')) ax.xaxis.grid(True, which=""minor"") ax.yaxis.grid() ax.xaxis.set_major_locator(dates.MonthLocator()) ax.xaxis.set_major_formatter(dates.DateFormatter('nnn%bn%Y')) plt.tight_layout() plt.show() else: print ""There are no rows matching the given condition, Try changing the dates"" 6. One can visualize the temperature readings between 2 different times specified. For example, the following code allows one to visualize the temperature over the last 2 days, (Note, you might observe a failure if there isn’t any data in the last 2 days)from datetime import * import pytz # retrieve the current temperature now = datetime.now(pytz.timezone('UTC')) ''' get the start time that will be behind 2 days from now, just modify ""days = 2"" to ""hours=2"" in case if you want to retrieve the temperature from last 2 hours. ''' last_n_days = now - timedelta(days=2) plotDF = pandaDF.ix[last_n_days:now] if len(plotDF) > 0: plotDF = plotDF[['temperature','scaledzscore']] fig, ax = plt.subplots() plotDF.plot(figsize=[20,10], ax=ax) # choose the colours for each column with pd.plot_params.use('x_compat', True): plotDF.temperature.plot(color='b') plotDF.scaledzscore.plot(color='r'); ax.set_xlabel(""Timestamp"",fontsize=20) ax.set_ylabel(""Temperature and scaledzscore"",fontsize=20) ax.set_title(""Temperature in last 2 days"", fontsize=20) ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=1, maxticks=None, interval_multiples=True)) ax.axhline(y=19,c=""purple"",linewidth=2,zorder=0) ax.axhline(y=15,c=""purple"",linewidth=2,zorder=0) ax.set_ylim([13,21]) ax.xaxis.set_minor_formatter(dates.DateFormatter('%dn%an%H:%M:%S')) ax.xaxis.grid(True, which=""minor"") ax.yaxis.grid() ax.xaxis.set_major_locator(dates.MonthLocator()) ax.xaxis.set_major_formatter(dates.DateFormatter('nnn%bn%Y')) plt.tight_layout() plt.show() else: print ""There are no rows matching the given condition, Trying changing the dates"" In this step, we have successfully analyzed the temperature data and visualized the results using bar and line charts. OPERATIONS RELATED TO MAXIMUM TEMPERATURE In this step, we will see how to use the Pandas DataFrames to find the maximum temperature over the hour, day, year and etc.. 1. Enter the following command in the next cell to find out the overall maximum temperature and click Run,# find the maximum temperature maximum = pandaDF.temperature.max() maximum 19.78 2. Enter the following statements in the next cell to find out all the instances where the temperature has crossed 19 degree and click Run . Observe that it returns all the rows where the temperature is greater than 19 degree.threshold_crossed_days = pandaDF[pandaDF.temperature > 19] threshold_crossed_days 3. Enter the following command to return only the days and not the timestamp in which the temperature is crossed the threshold,threshold_crossed_days['timestamp'] = threshold_crossed_days.index days = threshold_crossed_days.timestamp.map(lambda t: t.date()).unique() print ""Number of times the threshold is crossed: "" + str(threshold_crossed_days.temperature.count()) print ""The days are --> "" + str(days)  Number of times the threshold is crossed: 100 The days are –> [datetime.date(2016, 2, 19) datetime.date(2016, 2, 21) datetime.date(2016, 2, 22) datetime.date(2016, 2, 24) …….] 4. Enter the following command to find the hourly maximum temperature for each years, the result will show 24 rows per year wherein each row will show the maximum temperature of the corresponding hour. This will be useful to find out the utilization of the equipment (assuming the temperature is directly propotional to the utilization of the equipment) in each hour, for example, how much the equipment is utilized in the first hour compared to 2nd hour and so on. Best examples could be the space utilization (Office Space, Parking Space and etc..) for each hour over the year.# Find out hourly maximum temperature for each year year_hour_max = pandaDF.groupby(lambda x: (x.year, x.hour)).max() fig, ax = plt.subplots() plotDF = year_hour_max[['temperature']] plotDF.temperature.plot(figsize=(15,5), ax=ax, title='Hourly Maximum temperature for each year') ax.set_xlabel(""Hour of each year"",fontsize=12) ax.set_ylabel(""Temperature"",fontsize=12) plt.show() 5. You can create a bar chart as well for better visualization by typing the following command in the next cell and click Run,# draw a bar chart for hourly maximum temperature fig, ax = plt.subplots() plotDF.temperature.plot(kind='bar',figsize=(15,5), ax=ax, title='Hourly Maximum temperature for each year') ax.set_xlabel(""Hour of each year"",fontsize=12) ax.set_ylabel(""Temperature"",fontsize=12) plt.show() 6. But if you want to observe the maximum temperature for each hour (every day) and plot it, enter the following code snippet,# Find out hourly maximum temperature for each day each_hour_max = pandaDF.groupby(lambda x: (x.year, x.month, x.day, x.hour)).max() fig, ax = plt.subplots() plotDF = each_hour_max[['temperature']] plotDF.temperature.plot(figsize=(15,5), ax=ax, title='Hourly Maximum temperature for each Day') ax.set_xlabel(""Hour of each day"",fontsize=12) ax.set_ylabel(""Temperature"",fontsize=12) plt.show()  7. Enter the following command to find out the maximum temperature for each day over the years,# Maximum temperature of each Day df = pandaDF df = df.drop([""_id""], axis=1) df = df.drop([""_rev""], axis=1) df = df.drop([""scaledzscore""], axis=1) df = df.drop([""scaledwzscore""], axis=1) df = df.drop([""forecast""], axis=1) df = df.drop([""zscore""], axis=1) df = df.drop([""wzscore""], axis=1) df['Year'] = map(lambda x: x.year, df.index) df['Month'] = map(lambda x: x.month, df.index) df['Day'] = map(lambda x: x.day, df.index) plotDF = df.groupby(['Day','Month','Year']).max() fig, ax = plt.subplots() plotDF.plot(kind='bar', figsize=[20,10], ax=ax) ax.axhline(y=19,c=""purple"",linewidth=2,zorder=0) ax.axhline(y=15,c=""purple"",linewidth=2,zorder=0) ax.set_ylim([10,25]) ax.set_title(""Daily maximum temperature"", fontsize=20) ax.set_xlabel(""Day"",fontsize=20) ax.set_ylabel(""Temperature"",fontsize=20) ax.xaxis.grid(True, which=""minor"") ax.yaxis.grid() plt.tight_layout() plt.show() In this step, we have seen how to use the Pandas DataFrames to explore and plot the maximum temperature data from the historical data. Similarly you can use the min() function to find the minimum temperatures. OPERATIONS RELATED TO AVERAGE TEMPERATURE In this step, we will see how to use the Pandas DataFrames to explore and plot the average temperature data from the historical data. 1. Enter the following command in the next cell to find out the average temperature and click Run ,#calculate temperature mean pandaDF.temperature.mean() 17.593230723955266 2. Enter the following command to find the average temperature for the last one hour ,from datetime import * import pytz # retrieve the current time now = datetime.now(pytz.timezone('UTC')) last_n_hours = now - timedelta(hours=1) pandaDF.ix[last_n_hours:now].temperature.mean()  17.589529391059482 3. Similarly, to find the average temperature of the last one day enter the following command,# Caculate average temperature for last day from datetime import * import pytz # retrieve the current time now = datetime.now(pytz.timezone('UTC')) last_n_days = now – timedelta(days=1) pandaDF.ix[last_n_days:now].temperature.mean() 17.583351550960117 4. Similarly use the following command to find out the average temperature for the last month ,# retrieve the current time now = datetime.now(pytz.timezone('UTC')) ''' get the start time that will be behind n days from now, just modify ""days = n"" to ""hours = n"" in case if you want to retrieve the temperature from last n hours ''' last_n_days = now - timedelta(days=30) pandaDF.ix[last_n_days:now].temperature.mean() 17.592428135954886 5. Enter the following command to find hourly average temperature for each years, the result will show 24 rows per year wherein each column will show the average temperature of the corresponding hour. This will be useful to find out the utilization of the equipment (assuming the temperature is directly propotional to the utilization of the equipment) in each hour, for example, how much the equipment is utilized in the first hour compared to 2nd hour and so on. Best examples could be the space utilization for each hour over the year.# Find out hourly Average temperature for each year year_hour_avg = pandaDF.groupby(lambda x: (x.year, x.hour)).mean() fig, ax = plt.subplots() plotDF = year_hour_avg[['temperature']] plotDF.temperature.plot(figsize=(15,5), ax=ax, title='Hourly Average temperature for each year') ax.set_xlabel(""Hour of each year"",fontsize=12) ax.set_ylabel(""Temperature"",fontsize=12) plt.show()  6. You can create a bar chart as well for better visualization by typing the following command in the next cell and click Run,# draw a bar chart for hourly average temperature fig, ax = plt.subplots() plotDF.temperature.plot(kind='bar',figsize=(15,5), ax=ax, title='Hourly Averagetemperature for each year') ax.set_xlabel(""Hour of each year"",fontsize=12) ax.set_ylabel(""Temperature"",fontsize=12) plt.show()  7. But if you want to find out the average temperature for each hour and plot it, enter the following code snippet, In the following example, we plot the hourly average for last 2 days,# retrieve the current temperature now = datetime.now(pytz.timezone('UTC')) ''' get the start time that will be behind 2 days from now, just modify ""days = 2"" to ""hours=2"" in case if you want to retrieve the temperature from last 2 hours ''' last_n_days = now - timedelta(days=2) plotDF = pandaDF.ix[last_n_days:now] # Find out hourly average temperature for each day plotDF = plotDF.groupby(lambda x: (x.year, x.month, x.day, x.hour)).mean() fig, ax = plt.subplots() plotDF = plotDF[['temperature']] plotDF.temperature.plot(figsize=(15,5), ax=ax, title='Hourly Average temperature of each Day') ax.set_xlabel(""Hour of each day"",fontsize=12) ax.set_ylabel(""Temperature"",fontsize=12) plt.show() 8. Enter the following command to find out the average temperature for each day over the years,# Average temperature of each Day df = pandaDF df = df.drop([""_id""], axis=1) df = df.drop([""_rev""], axis=1) df = df.drop([""scaledzscore""], axis=1) df = df.drop([""scaledwzscore""], axis=1) df = df.drop([""forecast""], axis=1) df = df.drop([""zscore""], axis=1) df = df.drop([""wzscore""], axis=1) df['Year'] = map(lambda x: x.year, df.index) df['Month'] = map(lambda x: x.month, df.index) df['Day'] = map(lambda x: x.day, df.index) plotDF = df.groupby(['Day','Month','Year']).mean() fig, ax = plt.subplots() plotDF.plot(kind='bar', figsize=[20,10], ax=ax) ax.axhline(y=19,c=""purple"",linewidth=2,zorder=0) ax.axhline(y=15,c=""purple"",linewidth=2,zorder=0) ax.set_ylim([10,25]) ax.set_title(""Daily Average temperature"", fontsize=20) ax.set_xlabel(""Day"",fontsize=20) ax.set_ylabel(""Temperature"",fontsize=20) ax.xaxis.grid(True, which=""minor"") ax.yaxis.grid() plt.tight_layout() plt.show() In this step, we have seen how to use the Pandas DataFrames to explore and plot the average temperature data from the historical data. CONCLUSION AND THE ROAD AHEAD This recipe showed how to analyze the historical timeseries data to understand the temperature trend over the day, month & year, The maximum temperature over the year, average temperature over the year and etc.. using the Spark SQL and Jupyter Notebook. One can use the average/maximum temperature derived from the data analysis and set the rule accordingly in IBM Real Time Insights service to create alerts. Developers can take a look at the code made available in this recipe and also in the Notebook in github repository to understand what’s happening under the hood. The Notebook present in the github has more operations than what is showed in this recipe. Developers can consider this recipe as a template for doing a timeseries historical data analysis and can modify the python code depending upon the use case. The next recipe would showcase more complex analytical components. Keep watching this space. TUTORIAL TAGS #python bluemix cloudant dataframe ibmiot iot iotf jupyter machine learning pandas spark sql timeseries watson Recipe Palette 0 step: steps: step steps waiting to be combinedWhat if you didn’t have to start from scratch when creating a recipe? With developerWorks Recipes you can leverage the work of other recipe writers in the developer community through the use of the Recipe Palette. Here’s how you do it: * When viewing a recipe, you can click the copy icon in the top right of each step to add the step to your Recipe Palette. * You can even build up your palette of steps from different recipes, ready to combine into something new. * When you’re happy with the contents of your palette, hit the Combine button, and you’ll have a new draft recipe readily populated with the steps you’ve collected from around the site. * Edit your new recipe, if needed, before publishing your recipe to the world. Clear Combine * Report Abuse Terms of Use * Third Party Notice IBM Privacy","This recipe showcases how one can analyze the historical time series data, captured on the IBM Watson IoT platform, in a Jupyter Notebook using Spark SQL and Pandas DataFrames. Also, use the pre-installed matplotlib library to visualize results. ",Timeseries Data Analysis of IoT events by using Jupyter Notebook,Live,12 37,"Maureen McElaney Blocked Unblock Follow Following dev advocate at @IBM Watson Data Platform. founder of @GDIBurlington. executive fellow at @BTVIgnite. content here is mine. Website: http://mcelaney.me/ Apr 24 -------------------------------------------------------------------------------- BRIDGING THE GAP BETWEEN PYTHON AND SCALA JUPYTER NOTEBOOKS USING THE PIXIEDUST PYTHON HELPER LIBRARY TO IMPORT SCALA PACKAGES There’s a reason you’ve been hearing a lot about data science notebooks lately: data scientists are in high demand , and the Python programming language is widely used . In particular, Jupyter Notebooks are a popular tool for creating and sharing code for quick analysis. Most Jupyter Notebooks you’ll see come in two main flavors: Python and Scala. While Python is great for collaborating with colleagues — clean syntax, tons of handy libraries, good documentation — sometimes you need the processing power of Scala. The Apache Spark data processing engine is built on Scala, so if you’re working with a big data set, a Scala Jupyter notebook is the way to go. The downside of Scala is that fewer people know it . HELLO, [SCALA] WORLD! FROM [PYTHON] PIXIEDUST David Taieb has led the charge for our team to build an open source application that we affectionately call PixieDust. The PixieDust Python helper library works as an add-on to your Jupyter notebook that lets you do all sorts of new things , like automatic chart rendering or progress monitors for cells running code. PixieDust can also help developers bridge contexts: call Scala code from a Python Notebook, or call Python code from a Scala Notebook. With this post, I’d like to demonstrate how to use PixieDust to import a Scala “Hello, world!” package into a Jupyter Python notebook. I’m going use the same code that was used in this article by Dustin V : Part 1 : How to add a custom library to a Jupyter Scala notebook in IBM Data Science Experience… I have been using IBM’s Data Science Experience platform for a few months now. Its a great platform to perform data… medium.comI’ll show you how to use the same Scala JAR that Dustin provides, but from within a Jupyter Python notebook instead. PREPARE TO NOTEBOOK! Here are the basic steps: 1. Set up an account in IBM Data Science Experience (DSX) 2. Create a project in DSX 3. Point to Dustin’s Scala JAR 4. Test Scala JAR from Python Notebook with PixieDust 1. SET UP AN ACCOUNT IN IBM DATA SCIENCE EXPERIENCE (DSX) Browse to http://datascience.ibm.com/ and sign up for a free trial. You’ll get a 30-day free trial that includes Jupyter and other tools. (You will need to provide a personal email address for the account). This step should take about 10–15 minutes to complete. 2. CREATE A PROJECT IN DSX When your DSX account is ready, it’s time to create a new project: Name your project (mine is called “pixiedust”), and use the defaults for the Spark and Object Storage instances. You’ll add a notebook in just a moment. 3. POINT TO DUSTIN’S SCALA JAR It’s optional, but you can refer to Dustin V ’s post for steps on how to compile your own JAR file. Otherwise, he makes the JAR available for test purposes, and I found that it works just fine. You’ll see this URL in our sample notebook: https://github.com/dustinvanstee/dv-hw-scala/raw/master/target/scala-2.10/dv-hw-scala-assembly-1.0.jar 4. TEST SCALA JAR FROM PYTHON NOTEBOOK WITH PIXIEDUST You can now use our sample notebook to test PixieDust’s Python-Scala bridge functionality. To run the notebook in your own account, first download it via the universal download arrow icon: Download our sample notebook before uploading and running it in your own DSX project.Head back to the DSX project you created in step 2, and add a notebook to your project: Choose the From File option. Name your notebook (mine is “HelloWorld”). Now, create your notebook. The notebook has all the configuration and sample code you’ll need. Just run the cells using the play icon , and make sure to restart your kernel when prompted in the cell output to avoid errors. Hello Scala - IBM Data Science Experience apsportal.ibm.com Using PixieDust in a Python Notebook to access a custom Scala Package.You will know that everything is working when you see PixieDust generate a chart at the end of your notebook. Now you know that Scala and Python can be BFFs with PixieDust and Jupyter Notebooks! A Python matplotlib chart generated by PixieDust on a Scala Spark DataFrame.If you enjoyed this article, please ♡ it to recommend it to other Medium readers. Thanks to Mike Broberg . * Data Science * Python * Pixiedust * Scala * Jupyter 1 Blocked Unblock Follow FollowingMAUREEN MCELANEY dev advocate at @IBM Watson Data Platform. founder of @GDIBurlington . executive fellow at @BTVIgnite . content here is mine. Website: http://mcelaney.me/ FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform. * Share * 1 * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","There’s a reason you’ve been hearing a lot about data science notebooks lately: data scientists are in high demand, and the Python programming language is widely used. In particular, Jupyter…",Bridging the Gap Between Python and Scala Jupyter Notebooks,Live,13 41,"Raj Singh Blocked Unblock Follow Following Developer Advocate and Open Data Lead at IBM Watson Data Platform Aug 15 -------------------------------------------------------------------------------- GOT ZIP CODE DATA? PREP IT FOR ANALYTICS. USING FINE-GRAINED U.S. CENSUS DATA AND JUPYTER NOTEBOOKS TO BETTER UNDERSTAND YOUR CUSTOMERS Who are those people lurking behind the statistics in your data? Whether you are looking at retail shoppers, insurance policy holders, banking customers or political constituents, the more you can flesh out the lives of the people behind the numbers, the better you will do at deriving useful insights into how to serve them. This is why demographic market segmentation is such an interesting industry. BLOCK PARTY Market segmentation is the process of dividing a target population into groups, or segments, based on some common characteristics. The strategies for creating these groups range from the simple — age, sex, race, income — to the sophisticated — “Uptown Individuals” or “Cozy Country Living.” Products such as Tapestry Segmentation from Esri or PRIZM from Claritas/Nielsen live at the sophisticated end, and carry a price tag to match. If you are not ready to take the plunge, however, you can do a lot on your own with U.S. Census data, some basic analytics skills, and a Jupyter notebook. The U.S. Census is a treasure trove of free demographic data, as I’ve written about before . You can find detailed statistics on age, income, race, housing, and occupation from the national level down to the block group (a very small area consisting of about 2,000 people in most places). That’s just the tip of the iceberg. There are many more interesting statistics you can tease out of Census data with a little bit of analytics skills. “Block groups are statistical divisions of census tracts and generally contain between 600 and 3,000 people.” Source: U.S. Census Bureau .THE CORE OF THE PROBLEM Some cities are denser than others. But where are those dense cores so you can finely target them? One statistic I find really interesting is how urban a person is. Do they live in the dense city, the suburbs, or out in the rural countryside? Depending on your question, location can be a more useful fact to know than age or income or family size. You might think that it’s pretty easy to figure out what places are city, suburban, and rural, but it turns out to be a bit of a challenge. For example, take the map of eastern Massachusetts below. The City of Boston is shaded in gray in the center of the picture. That’s a pretty poor representation of urban, as many towns around Boston are just as urban as the city (Cambridge, Somerville, and others). The Census has a place type called “Urban Areas,” which for the Boston area is the red line you see in the picture. It stretches waaaaaay out from the city to even go into New Hampshire to the north, and almost to Cape Cod to the south. This may make some sense when you look at the country as a whole, comparing Massachusetts to Minnesota for example, but it does a poor job of capturing true urban-ness. The dashed gray line is an even less useful designation from the Census called “Metropolitan Statistical Areas.” Depending on your definition, “urban” can mean a lot of different kinds of places. For instance, Boston’s urban core is mostly walkable; however, if you’re in Phoenix, you’ll need a car.Now look at the map below derived from the data I’ve prepared. Instead of using the most detailed level of Census data — block groups — I use zip codes because you’ll always have a zip code for your customers. Data geek note: these are actually “zip code tabulation areas” (ZCTAs), not true zip codes. ZCTAs are a zip code-esque structure the Census created to make zip code data better for mapping and spatial analysis.It shows most of Boston, and some neighboring zips, in red — true urban areas, places where people live primarily in multi-family housing, condos, or apartments. Toward the south, you can also see little red spots in Providence, RI; New Bedford, MA; and Fall River, MA. The orange color depicts areas called “Early Suburban.” Here you’ll find people living primarily in single-family homes, but lot size will be usually around a 1/4 to 1/2 acre. Then in light orange, you’ll see areas that are closer to rural with single-family homes on 1 acre lots or larger. Finally in a light tan color, is everything else — truly rural areas consisting primarily of 1+ acre residential lots, farms, and forests. Picking the core urban areas out of wider, more suburban metro area.METHODOLOGY: BEFORE AND AFTER THE CAR The methodology used to build this model comes from an academic article, “From Jurisdictional to Functional Analysis of Urban Cores & Suburbs” in New Geography . From that work, my notebook uses the following classifications for urban-ness: * Urban (pre-auto urban core): density > 2,900 sq. km * Auto suburban, early : median house built 1946 to 1979, density < 2,900 sq. km and density > 100 sq. km * Auto suburban, later : median house built after 1979, density < 2,900 sq. km and density > 100 sq. km * Auto exurban : all others From the requirements above, the key data needed to reproduce the model are population and the median age-of-home in an area. We can easily get these data from the U.S. Census American Community Survey. The instructions for doing this yourself, if you are so inclined, are in the Jupyter notebook referenced below. SHOW ME THE DATA If you are less interested in the details of the analysis, and just want the data to use in your own work, we’ve provided a public download of the CSV file in this GitHub repo . If you want to see the details of how it was built, read on. OH, THE URBANITY! I analyzed the data using Python in a Jupyter Notebook called urbanity.ipynb in the same GitHub repo . It uses the Pandas read_csv function to extract statistics on zip code areas, population counts, and median housing age from three larger data files. In the notebook, I then join those statistics into a single DataFrame and calculate population density per square kilometer. From there it’s a simple matter of running some SQL-like queries on the DataFrame to classify the zip codes into the four categories of interest. That’s it for the initial analysis. LOOKING AROUND THE U.S. The Jupyter notebook goes on to create an interactive map using Mapbox technology, which I’ll describe in detail in a forthcoming post. For now, I want to focus on what this map can tell us. As with the Boston example, other views from around the country each tell different stories about the composition of urban-ness, which when combined with your own data, can lead to deeper insights into customers or constituents. The dense Mid-Atlantic region from New York City to Baltimore. Contrastingly, urbanity in the South shows almost no dense urban areas. Combining both extremes, Los Angeles to the San Francisco Bay shows large swaths of rural areas.If you find the data useful, or want to know more about how to use it to build a custom analysis, please leave a comment here. Whether you’re in a Pre-Auto Urban Core or an Auto Exurban municipality, thank you for reading! Please ♡ this article to recommend it to other Medium readers. Thanks to Mike Broberg . * Jupyter * Analytics * Data Science * Mapbox * Market Segmentation Blocked Unblock Follow FollowingRAJ SINGH Developer Advocate and Open Data Lead at IBM Watson Data Platform FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform. * Share * * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Who are those people lurking behind the statistics in your data? Whether you are looking at retail shoppers, insurance policy holders, banking customers or political constituents, the more you can…",Got zip code data? Prep it for analytics. – IBM Watson Data Lab – Medium,Live,14 45,"* Home * Community * Projects * Blog * About * Advisory Council * Resources * Code * Contributions * University * IBM Design * Apache SystemML * Apache Spark™ SPARK.TC ☰ * Community * Projects * Blog * About * Advisory Council * Resources * Code * Contributions * University * IBM Design * Apache SystemML * Apache Spark™ STREAMING EXTEND STRUCTURED STREAMING FOR SPARK ML EARLY METHODS TO INTEGRATE MACHINE LEARNING USING NAIVE BAYES AND CUSTOM SINKS. To learn more about Structured Streaming and Machine Learning, check out Holden Karau’s and Seth Hendrickson’s session Spark Structured Streaming for machine learning at Strata + Hadoop World New York from 2:05pm to 2:45pm, Thursday September 29th. Spark’s new ALPHA Structured Streaming API has caused a lot of excitement because it brings the Data set/DataFrame/SQL APIs into a streaming context. In this initial version of Structured Streaming, the machine learning APIs have not yet been integrated. However, this doesn’t stop us from having fun exploring how to get machine learning to work with Structured Streaming. (Simply keep in mind this is exploratory, and things will change in future versions.) For our Spark Structured Streaming for machine learning talk on at Strata + Hadoop World New York 2016, we’ve started early proof-of-concept work to integrate structured streaming and machine learning available in the spark-structured-streaming-ml repo. If you are interested in following along with the progress toward Spark's ML pipelines supporting structured streaming, I encourage you to follow SPARK-16424 and give us your feedback on our early draft design document . One of the simplest streaming machine learning algorithms you can implement on top of structured streaming is Naive Bayes, since much of the computation can be simplified to grouping and aggregating. The challenge is how to collect the aggregate data in such a way that you can use it to make predictions. The approach taken in the current streaming Naive Bayes won’t directly work, as the ForeachSink available in Spark Structured Streaming executes the actions on the workers, so you can’t update a local data structure with the latest counts. Instead, Spark's Structured Streaming has an in-memory table output format you can use to store the aggregate counts. // Compute the counts using a Dataset transformation val counts = ds.flatMap{ case LabeledPoint(label, vec) => vec.toArray.zip(Stream from 1).map(value => LabeledToken(label, value)) }.groupBy($""label"", $""value"").agg(count($""value"").alias(""count"")) .as[LabeledTokenCounts] // Create a table name to store the output in val tblName = ""qbsnb"" + java.util.UUID.randomUUID.toString.filter(_ != '-').toString // Write out the aggregate result in complete form to the in memory table val query = counts.writeStream.outputMode(OutputMode.Complete()) .format(""memory"").queryName(tblName).start() val tbl = ds.sparkSession.table(tblName).as[LabeledTokenCounts] The initial approach taken with Naive Bayes is not easily generalizable to other algorithms, which cannot as easily be represented by aggregate operations on a Dataset . Looking back at how the early DStream-based Spark Streaming API implemented machine learning can provide some hints on one possible solution. Provided you can come up with an update mechanism on how to merge new data into your existing model, the DStream foreachRDD solution allows you to access the underlying micro-batch view of the data. Sadly, foreachRDD doesn't have a direct equivalent in Structured Streaming, but by using a custom sink, you can get similar behavior in Structured Streaming. The sink API is defined by StreamSinkProvider , which is used to create an instance of the Sink given a SQLContext and settings about the sink, and Sink trait, which is used to process the actual data on a batch basis. abstract class ForeachDatasetSinkProvider extends StreamSinkProvider { def func(df: DataFrame): Unit def createSink( sqlContext: SQLContext, parameters: Map[String, String], partitionColumns: Seq[String], outputMode: OutputMode): ForeachDatasetSink = { new ForeachDatasetSink(func) } } case class ForeachDatasetSink(func: DataFrame => Unit) extends Sink { override def addBatch(batchId: Long, data: DataFrame): Unit = { func(data) } } As with writing DataFrames to customs formats, to use a third-party sink, you can specify the full class name of the sink. Since you need to specify the full class name of the format, you need to ensure that any instance of the SinkProvider can update the model—and since you can’t get access to the sink object that gets constructed—you need to make the model outside of the sink. object SimpleStreamingNaiveBayes { val model = new StreamingNaiveBayes() } class StreamingNaiveBayesSinkProvider extends ForeachDatasetSinkProvider { override def func(df: DataFrame) { val spark = df.sparkSession SimpleStreamingNaiveBayes.model.update(df) } } You can use the custom sink shown above to integrate machine learning into Structured Streaming while you are waiting for Spark ML to be updated with Structured Streaming. // Train using the model inside SimpleStreamingNaiveBayes object // - if called on multiple streams all streams will update the same model :( // or would except if not for the hard coded query name preventing multiple // of the same running. def train(ds: Dataset[_]) = { ds.writeStream.format( ""com.highperformancespark.examples.structuredstreaming."" + ""StreamingNaiveBayesSinkProvider"") .queryName(""trainingnaiveBayes"") .start() } If you are willing to throw caution to the wind, you can access some Spark internals to construct a sink that behaves more like the original foreachRDD . If you are interested in custom sink support, you can follow SPARK-16407 or this PR . The cool part is, regardless of whether you want to access the internal Spark APIs, you can now handle batch updates in the same way Spark’s earlier streaming machine learning is implemented. While this certainly isn't ready for production usage, you can see that the Structured Streaming API offers a number of different ways it can be extended to support machine learning. You can learn more in High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark . SHARE ON * * Share HOLDEN KARAU DATE 22 September 2016TAGS streaming, data prosNEWSLETTER Subscribe to the Spark Technology Center newsletter for the latest thought leadership in Apache Spark™, machine learning and open source. SubscribeNEWSLETTER YOU MIGHT ALSO ENJOY MACHINE LEARNING EXTEND STRUCTURED STREAMING FOR SPARK ML by Holden Karau CAN APACHE™ SPARK REVEAL HOW PEOPLE REALLY USE IBM’S CLOUD STORAGE? by Shelly Garion APACHE SPARK™ 2.0: DEEP DIVE INTO SPARK CATALOG AND DDL NATIVE SUPPORTS by Xiao Li APACHE SPARK 2.0 APACHE SPARK™ 2.0: KEEPING COUNT by Christian KadnerSPARK TECHNOLOGY CENTER * Community * Projects * Blog * About * Advisory Council The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries. * * * *",Early methods to integrate machine learning using Naive Bayes and custom sinks.,Apache Spark™ 2.0: Extend Structured Streaming for Spark ML,Live,15 48,"* Home * Research * Partnerships and Chairs * Staff * Books * Articles * Videos * Presentations * Contact Information * Subscribe to our Newsletter * 中文 * Marketing Analytics * Credit Risk Analytics * Fraud Analytics * Process Analytics * Human Resource Analytics * Prof. dr. Bart Baesens * Prof. dr. Seppe vanden Broucke * Aimée Backiel * Libo Li * Sandra Mitrović * Klaas Nelissen * María Óskarsdóttir * Michael Reusens * Eugen Stripling * Tine Van Calster * Basic Java Programming * Principles of Database Management * Business Information Systems * Mini Lecture Series * Other Videos HIGHER-ORDER LOGISTIC REGRESSION FOR LARGE DATASETS Posted on February 11, 2017Contributed by: Sandra Mitrović This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps . Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch! -------------------------------------------------------------------------------- The performance of supervised predictive models is characterized by the generalization error, that is, the error, obtained on datasets different from the one used to train the model. More precisely, generalization error equals to expectation of predictive error over all datasets D and ground truth y and can be represented as: E D,y [ L ( y ,f( x ))], where f( x ) is predicted outcome for an input x and L is the chosen loss function. Obviously, the goal is to minimize generalization error, which, for several most typical choices of loss function, can be decomposed as: a L * Variance( x ) + Bias( x ) + b L * Noise( x ) Where factors a L and b L depend on the choice of a loss function L [1]. Bias measures systematic error, which stems from the predictive method used. Variance , on the other hand, is not related to the chosen modeling method, but rather to the dataset used. It represents fluctuations around the most commonly (in case of classification)/average (in case of regression) predicted value in different test datasets. Ideally, both bias and variance should be low, since high bias leads to under-fitting, which is inability of the model to fit the data (low predictive performance on train data), while high variance leads to over-fitting meaning that the obtained model is too much adjusted to the train data that it fails to generalize on test data (also known as “memorization” (of train data)). This, however, is not possible and hence, in practice, we have to make a trade-off between variance and bias. Different types of models have different bias/variance profiles, e.g. Naïve Bayes classifier has low variance and high bias, while decision tree has low bias and high variance. Logistic Regression (LR) is a well-established method, which despite being fairly simple has been proven to have good performances [2]. On one side, this is beneficial since it facilitates interpretation of the model and obtained results. On the other side, having low variance (and high bias), makes it a limited method in terms of its expressive power. We can overcome this drawback by introducing more complex features, obtained as a Cartesian product of the original features. Logistic regression of order n (denoted as LR n ) is defined as the logistic regression modeling the interactions of the n -th order (as defined in [3], although it can be, as well, defined to consider lower level interactions i.e. interactions of order ≤ n ). LR n allows modeling of much larger number of distributions, as compared to LR. Obviously, on smaller datasets this leads to over-fitting and different types of regularization are known in literature to penalize for the model complexity. But what happens in the case of really large datasets? It has been demonstrated that with the increase of training dataset, variance decreases and bias increases [4]. Hence, high variance of LRn would not be a problem, as long as bias could be controlled. Bias/variance profile of higher-order LR has been extensively investigated in [3], where LR n (for n =1,2,3) have been compared on 75 datasets from UCI repository. As it can be seen in the Figure 1 (borrowed from [3]), with increasing amount of data, higher-order LR perform better than the lower-order LR. In other words, with large enough datasets, as the order n increases, the bias of LR n descreases, This clearly motivates the usage of higher-order LR with large datasets. Once again, it is very important to emphasize the amount of data observed (i.e. the number of instances). For example, based on the zoomed part of graph, if we would make a strict cutoff at any number of instances i , where i < 1000, due to the fact that LR 1 learning curve has steeper decrease than both LR 2 and LR 3 , we would conclude that LR 1 performs the best (out of these three). As it can be seen from upper part of the figure, the same conclusion could be derived to the detriment of LR 3 after sufficiently enough number of instances. However, for extensively large dataset, it is obvious that LR 3 outperforms the other two. This phenomenon is due to the fact that lower-order logistic regressions have higher learning rate in the beginning of learning process. Figure 1: Learning curves of logistic regressions of different order plotted for increasing amount of data (an illustration from [3]). Click to view full size version. REFERENCES * [1] Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238). * [2] Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J. (2003). Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the operational research society, 54(6), 627-635. * [3] Zaidi, N. A., Webb, G. I., Carman, M. J., & Petitjean, F. (2016). ALR n: Accelerated HigherOrder Logistic Regression. In Proceedings of European Conference on Machine Learning. * [4] Brain, D., & Webb, G. I. (2002, August). The need for low bias algorithms in classification learning from large data sets. In European Conference on Principles of Data Mining and Knowledge Discovery (pp. 62-73). Springer Berlin Heidelberg. ‹ Do you see differences in maturity of analytics across business units in an organization? —Ad—We display ads on this section of the site. -------------------------------------------------------------------------------- Recent Posts * Higher-order Logistic Regression for Large Datasets * Do you see differences in maturity of analytics across business units in an organization? * Web Picks (week of 30 January 2017) * Offline Recommender Evaluation is Killing Serendipity * How can networked data be leveraged for analytics? Archives * February 2017 * January 2017 * December 2016 * November 2016 * October 2016 * September 2016 * August 2016 * July 2016 * June 2016 * May 2016 * April 2016 * March 2016 * February 2016 * January 2016 * December 2015 * November 2015 * October 2015 * September 2015 * August 2015 * July 2015 * June 2015 * May 2015 * * * © DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU Leuven KU Leuven, Department of Decision Sciences and Information Management Naamsestraat 69, 3000 Leuven, Belgium DataMiningApps on Twitter , Facebook , YouTube info@dataminingapps.com","The performance of supervised predictive models is characterized by the generalization error, that is, the error, obtained on datasets different from the one used to train the model.",Higher-order Logistic Regression for Large Datasets,Live,16 50,"Enterprise Pricing Articles Sign in Free 30-Day TrialCOMPOSE FOR MYSQL NOW FOR YOU Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 12, 2016Compose is pleased to bring a new database onto our platform in the form of Compose for MySQL. We've always considered MySQL as a potential Compose database, but had to wait for the arrival of a high availability solution that worked well enough that we could deliver it to Compose users. That solution came in the form of MySQL InnoDB Cluster, a new version of MySQL which has a reliable, performant replication and high availability architecture which is an excellent fit to the Compose environment. That meant we could bring MySQL's feature set to Compose and offer its proven and popular capabilities to our users. MySQL InnoDB Cluster is built around MySQL 5.7.15 which will allow us to offer many of the most recent innovations such as MySQL shell, X DevAPI and the JSON document store as part of our new MySQL deployments. Adopting this leading edge version of MySQL for our Compose for MySQL beta brings all the latest benefits of MySQL to our users. It also means that MySQL users can enjoy the power of Compose to set up their database with just one click, enjoy regular, automated backups and sleep better knowing their database is highly available in whichever cloud platform they choose. It means a database you can administer from the web, through an easy-to-use web front end. Give your administrators and developers their own accounts, with easily created roles to control access to your new database too. These are just some of the benefits of Compose for MySQL. So, how do you get going with the beta of the Compose for MySQL? Simply log in to your Compose account, select Create Deployment and select MySQL from the Beta list. Within minutes you'll have your own MySQL cluster up and running. It'll cost you $27 a month for your first GB of data and $18 a month extra for each extra GB. If you don't have a Compose account, sign up now for a 30 day free trial. Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud * Subscribe Join 75,000 other devs on our weekly database newsletter. © 2016 Compose","We've always considered MySQL as a potential Compose database, but had to wait for the arrival of a high availability solution that worked well enough that we could deliver it to Compose users. That solution came in the form of MySQL InnoDB Cluster.",Compose for MySQL now for you,Live,17 57,"Homepage Follow Sign in / Sign up * Home * Big Ideas * Founder Stories * Startup Culture * Growth & Scale * Venture Capital * * Global Conference * Luke de Oliveira Blocked Unblock Follow Following Deep learning, Infrastructure, and Open Source. Founder @ Vai, Visiting scientist @ Berkeley Labs, Stanford/Yale Alum. Feb 11 -------------------------------------------------------------------------------- FUELING THE GOLD RUSH: THE GREATEST PUBLIC DATASETS FOR AI It has never been easier to build AI or machine learning-based systems than it is today. The ubiquity of cutting edge open-source tools such as TensorFlow , Torch , and Spark , coupled with the availability of massive amounts of computation power through AWS , Google Cloud , or other cloud providers, means that you can train cutting-edge models from your laptop over an afternoon coffee. Though not at the forefront of the AI hype train, the unsung hero of the AI revolution is data — lots and lots of labeled and annotated data, curated with the elbow grease of great research groups and companies who recognize that the democratization of data is a necessary step towards accelerating AI. However, most products involving machine learning or AI rely heavily on proprietary datasets that are often not released, as this provides implicit defensibility . With that said, it can be hard to piece through what public datasets are useful to look at, which are viable for a proof of concept, and what datasets can be useful as a potential product or feature validation step before you collect your own proprietary data. It’s important to remember that good performance on data set doesn’t guarantee a machine learning system will perform well in real product scenarios. Most people in AI forget that the hardest part of building a new AI solution or product is not the AI or algorithms — it’s the data collection and labeling . Standard datasets can be used as validation or a good starting point for building a more tailored solution. This week, a few machine learning experts and I were talking about all this. To make your life easier, we’ve collected an (opinionated) list of some open datasets that you can’t afford not to know about in the AI world. -------------------------------------------------------------------------------- Legend: 📜 Classic — these are some of the more famous, legacy, or storied datasets in AI. It’s hard to find a researcher or engineer who hasn’t heard of them. 🛠 Useful — these are datasets that are about as close to real-world that a curated, cleaned dataset can be. Also, these are often general enough to be useful in both the product and R&D world. 📚 Academic baseline — these are datasets that are commonly used the in the academic side of Machine Learning and AI as benchmarks or baselines. For better or worse, people use these datasets to validate algorithms. 🗿 Old - these datasets, irrespective of utility, have been around for a while. COMPUTER VISION * 📚 📜 🗿 MNIST : most commonly used sanity check. Dataset of 25x25, centered, B&W handwritten digits. It is an easy task — just because something works on MNIST, doesn’t mean it works. * 📜 🗿 CIFAR 10 & CIFAR 100 : 32x32 color images. Not commonly used anymore, though once again, can be an interesting sanity check. * 🛠 📚 📜 ImageNet : the de-facto image dataset for new algorithms. Many image API companies have labels from their REST interfaces that are suspiciously close to the 1000 category WordNet hierarchy from ImageNet. * LSUN : Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.) and an associated competition. * 📚 PASCAL VOC : Generic image Segmentation / classification — not terribly useful for building real-world image annotation, but great for baselines. * 📚 SVHN : House numbers from Google Street View. Think of this as recurrent MNIST in the wild. * MS COCO : Generic image understanding / captioning, with an associated competition. * 🛠 Visual Genome : Very detailed visual knowledge base with deep captioning of ~100K images. * 🛠 📚 📜 🗿 Labeled Faces in the Wild : Cropped facial regions (using Viola-Jones ) that have been labeled with a name identifier. A subset of the people present have two images in the dataset — it’s quite common for people to train facial matching systems here. NATURAL LANGUAGE * 🛠 📚 Text Classification Datasets (Google Drive Link) from Zhang et al., 2015 : An extensive set of eight datasets for text classification. These are the most commonly reported baselines for new text classification baselines. Sample size of 120K to 3.6M, ranging from binary to 14 class problems. Datasets from DBPedia, Amazon, Yelp, Yahoo!, Sogou, and AG. * 🛠 📚 WikiText : large language modeling corpus from quality Wikipedia articles, curated by Salesforce MetaMind . * 🛠 Question Pairs : first dataset release from Quora containing duplicate / semantic similarity labels. * 🛠 📚 SQuAD : The Stanford Question Answering Dataset — broadly useful question answering and reading comprehension dataset, where every answer to a question is posed as a span , or segment of text. * CMU Q/A Dataset : Manually-generated factoid question/answer pairs with difficulty ratings from Wikipedia articles. * 🛠 Maluuba Datasets : Sophisticated, human-generated datasets for stateful natural language understanding research. * 🛠 📚 Billion Words : large, general purpose language modeling dataset. Often used to train distributed word representations such as word2vec or GloVe . * 🛠 📚 Common Crawl : Petabyte-scale crawl of the web — most frequently used for learning word embeddings. Available for free from Amazon S3 . Can also be useful as a network dataset for it’s crawl of the WWW. * 📚 📜 bAbi : synthetic reading comprehension and question answering dataset from Facebook AI Research (FAIR) . * 📚 The Children’s Book Test ( download link ): Baseline of (Question + context, Answer) pairs extracted from Children’s books available through Project Gutenberg. Useful for question-answering, reading comprehension, and factoid look-up. * 📚 📜 🗿 Stanford Sentiment Treebank : standard sentiment dataset with fine-grained sentiment annotations at every node of each sentence’s parse tree. * 📜 🗿 20 Newsgroups : one of the classic datasets for text classification, usually useful as a benchmark for either pure classification or as a validation of any IR / indexing algorithm. * 📜 🗿 Reuters : older, purely classification based dataset with text from the newswire. Commonly used in tutorials. * 📜 🗿 IMDB : an older, relatively small dataset for binary sentiment classification. Fallen out of favor for benchmarks in the literature in lieu of larger datasets. * 📜 🗿 UCI’s Spambase : Older, classic spam email dataset from the famous UCI Machine Learning Repository . Due to details of how the dataset was curated, this can be an interesting baseline for learning personalized spam filtering. SPEECH Most speech recognition datasets are proprietary — the data holds a lot of value for the company that curates. Most datasets available in the field are quite old. * 📚 🗿 2000 HUB5 English : English-only speech data used most recently in the Deep Speech paper from Baidu. * 📚 LibriSpeech : Audio books data set of text and speech. Nearly 500 hours of clean speech of various audio books read by multiple speakers, organized by chapters of the book containing both the text and the speech. * 🛠 📚 VoxForge : Clean speech dataset of accented english, useful for instances in which you expect to need robustness to different accents or intonations. * 📚 📜 🗿 TIMIT : English-only speech recognition dataset. * 🛠 CHIME : Noisy speech recognition challenge dataset. Dataset contains real, simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings. * TED-LIUM : Audio transcription of TED talks. 1495 TED talks audio recordings along with full text transcriptions of those recordings. RECOMMENDATION AND RANKING SYSTEMS * 📜 🗿 Netflix Challenge : first major Kaggle style data challenge. Only available unofficially, as privacy issues arose . * 🛠 📚 📜 MovieLens : various sizes of movie review data — commonly used for collaborative filtering baselines. * Million Song Dataset : large, metadata-rich, open source dataset on Kaggle that can be good for people experimenting with hybrid recommendation systems. * 🛠 Last.fm : music recommendation dataset with access to underlying social network and other metadata that can be useful for hybrid systems. NETWORKS AND GRAPHS * 📚 Amazon Co-Purchasing and Amazon Reviews : crawled data from the “ users who bought this also bought… ” section of Amazon, as well as amazon review data for related products. Good for experimenting with recommendation systems in networks. * Friendster Social Network Dataset : Before their pivot as a gaming website, Friendster released anonymized data in the form of friends lists for 103,750,348 users. GEOSPATIAL DATA * 🛠 📜 OpenStreetMap : Vector data for the entire planet under a free license . It includes (an older version of) the US Census Bureau’s TIGER data. * 🛠 Landsat8 : Satellite shots of the entire Earth surface, updated every several weeks. * 🛠 NEXRAD : Doppler radar scans of atmospheric conditions in the US. ❗️People often think solving a problem on one dataset is equivalent to having a well thought out product. Use these datasets as validation or proofs of concept , but don’t forget to test or prototype how the product will function and obtain new, more realistic data to improve its operation. Successful data-driven companies usually derive strength from their ability to collect new, proprietary data that improves their performance in a defensible way. -------------------------------------------------------------------------------- PLEASE CONTRIBUTE! If you think we’ve missed a dataset or two (which we definitely have!) or have a conflicting opinion about a dataset discussed here, please let me know with a comment, or you can shoot me an email at lukedeo@ldo.io ! P.S. — This post is part of a open, collaborative effort to build an online reference, the Open Guide to Practical AI , which we’ll release in draft form in a few weeks. See this popular previous guide for an example. If you’d like to get updates on or help with with this effort, drop me a comment or email me at lukedeo@ldo.io . Special thanks to Joshua Levy , Srinath Sridhar , and Max Grigorev . Thanks to Joshua Levy . Machine Learning Artificial Intelligence Deep Learning Data Science Big Data 759 18 Blocked Unblock Follow FollowingLUKE DE OLIVEIRA Deep learning, Infrastructure, and Open Source. Founder @ Vai, Visiting scientist @ Berkeley Labs, Stanford/Yale Alum. FollowSTARTUP GRIND The life, work, and tactics of entrepreneurs around the world — by founders, for founders. Welcoming submissions on technology trends, product design, growth strategies, and venture investing. * Share * 759 * * Never miss a story from Startup Grind , when you sign up for Medium. Learn more Never miss a story from Startup Grind Get updates Get updates","It has never been easier to build AI or machine learning-based systems than it is today. The ubiquity of cutting edge open-source tools such as TensorFlow, Torch, and Spark, coupled with the…",The Greatest Public Datasets for AI – Startup Grind,Live,18 60,"METRICS MAVEN: MODE D'EMPLOI - FINDING THE MODE IN POSTGRESQL Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 6, 2016In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll have a look at mode to round out our series on mean, median, and mode. Mode is the simplest to understand of the three metrics we've been looking at (mean, median, and mode) so we'll keep this article short and sweet and get straight to it. If you want to start with a review of mean or median before looking at mode, then have a look at Calculating a Mean or A Look at Median for a refresher. For our examples in this article, we'll continue to use the orders data from our dog products catalog that we've used in the previous articles: order_id | date | item_count | order_value ------------------------------------------------ 50000 | 2016-09-02 | 3 | 35.97 50001 | 2016-09-02 | 2 | 7.98 50002 | 2016-09-02 | 1 | 5.99 50003 | 2016-09-02 | 1 | 4.99 50004 | 2016-09-02 | 7 | 78.93 50005 | 2016-09-02 | 0 | (NULL) 50006 | 2016-09-02 | 1 | 5.99 50007 | 2016-09-02 | 2 | 19.98 50008 | 2016-09-02 | 1 | 5.99 50009 | 2016-09-02 | 2 | 12.98 50010 | 2016-09-02 | 1 | 20.99 MODE The mode of a series is the most frequently occurring value. In some series this may indicate popularity. In others, it is an indication of commonality, more conspicuous than the average or the median. For our use case, together with mean and median , the mode can help us really zero in on why we're seeing the results that we do from our hypothetical pet supply business. Unlike median, for which we covered 4 different query options in our previous article , PostgreSQL offers a built-in function starting in the 9.4 version to find the mode in a series: MODE() . Let's dive right into some examples. We'll start by finding the mode for item_count with this query: SELECT MODE() WITHIN GROUP (ORDER BY item_count) AS item_count_mode FROM orders; As you can see, the syntax for MODE() looks a little awkward. You use the WITHIN GROUP (ORDER BY ...) clause to indicate the field you want to get the mode of. We encountered this clause when finding the median in option 4 of our previous article. This clause is used with the ordered set aggregates introduced in PostgreSQL 9.4, such as PERCENTILE_CONT and RANK . Once you start to use these aggregate functions, you'll easily get the hang of it. Now back to what we were doing... Our result from the query above is 1. Orders from our dog products catalog contain only 1 item most frequently. That's disappointing for the business. Secretly, we'd hoped customers would buy whole product lines of items for their pooches! ZEROES AND NULLS You may be wondering right about now how MODE() handles zeroes and NULLs since one of our orders has a ""0"" item count and a NULL order value. From our previous articles, we know that this is an important aspect to consider for obtaining the best metrics for the use case. MODE() and the other ordered set aggregates ignore NULL values by default. That's good news because we determined previously that we should be ignoring orders that have a 0 item_count or a NULL order_value . Those would clearly be invalid orders. MODE() does not, however, ignore zeroes. In our case, it does not matter much since we have only one zero value in our orders, but if we didn't know that, we would actually want to write the query including a WHERE condition for the item_count to not be zero, like so: SELECT MODE() WITHIN GROUP (ORDER BY item_count) AS item_count_mode FROM orders WHERE item_count <� WHAT MODE CAN TELL US ABOUT OUR BUSINESS Now that we've got the handling of zeroes and NULLs squared away, let's look at the mode for order_value to get more insight into orders: SELECT MODE() WITHIN GROUP (ORDER BY order_value) AS order_value_mode FROM orders; The result we get back is $5.99. Hmmmm.... these are pretty strong indicators for why our business isn't performing as well as we want it to be. Customers are most frequently only purchasing 1 item at a time with a value of $5.99. If we look back at the values we got from mean and median for each of these fields, the story becomes clearer with each metric: Mean item count = 2.10 Median item count = 1.5 Mode item count = 1 Mean order value = $19.98 Median order value = $10.48 Mode order value = $5.99 If we were relying on just the mean (or even the median) to get a sense of our business performance, we would have inadvertently been believing we were doing much better than we actually are. Now, in full recognition of the reality that our orders are not where we want them to be, we can take action. We might offer a discount for customers who purchase multiple items in one order or we might promote higher-priced items more strongly than lower-priced ones. Armed with these metrics, we can decide how to increase orders and improve our business. WRAPPING UP This concludes our look at mean, median, and mode and why each of them are important metrics to get a handle on. As we've seen, they each provide a slightly different perspective on the data. By using all of them together, we can do a much better job of understanding how our business is doing (and then determining the actions we should take) than by using just one of them alone. This also concludes 2016 for the Metrics Maven series! Join us next year as we go even deeper into metrics - how to calculate and apply them to get the most from your data. Until then, wishing you all happy holidays! Image by: Peggy_Marco Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud * Subscribe Join 75,000 other devs on our weekly database newsletter. © 2016 Compose","In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll have a look at mode to round out our series on mean, median, and mode.",Finding the Mode in PostgreSQL,Live,19 62,"Homepage Follow Sign in / Sign up Homepage * Home * Data Science Experience * * Watson Data Platform * IBM Data Science Experience Blocked Unblock Follow Following Apr 25 -------------------------------------------------------------------------------- WORKING INTERACTIVELY WITH RSTUDIO AND NOTEBOOKS IN DSX It is often useful to use RStudio for one piece of your analysis and notebooks (whether in R, or in another language) for other parts of your analysis. This article will step you through the process of interacting between the two environments by saving data to the underlying Spark service’s distributed GPFS file structure. The following is top level view of the cluster stack for Data Science Experience (DSX): Notice that the RStudio kernel is not a typical “edge node” and instead accesses DSX via the sparklyr snappy connect pipeline. For instructions on how to connect an RStudio instance to a Spark instance follow the examples provided in the /home/rstudio/ibm-sparkaas-demo folder from your DSX home: As the code shows, you will need to know the names of your spark service kernels: > kernels <- list_spark_kernels() > kernels [1] ""test1"" ""Apache Spark-ty"" ""sparklyr"" ""Apache Spark-jw"" For my RStudio session, I have four spark kernels available (“test1”, “Apache Spark-ty”, “sparklyr”, “Apache Spark-jw”). To get R to interact with a Spark service choose one of the services listed: > # connect to Spark kernel > sc <- spark_connect(config = ""sparklyr"") For more information on how to use the sparklyr connection to your spark instance see the impressive sparklyr documentation at spark.rstudio.com . In order to interact between notebooks and RStudio you will need a notebook running on the same Spark session as you connected to with sparklyr: Notice that above, I connected RStudio to the SparkAAS named “sparklyr”, and here I am creating my R notebook on the Spark Service named the same. This way they will be able to interact on that service. Notice that we also could interact with the Spark objects with Python or Scala notebooks. To see these two interfaces working together, let’s save some spark data to the distributed file system. First, we will need to find the address to the Spark service GPFS home directory. Here we are seeing the address to the tenant root directory plus the default /notebook/work : The R code used in the Notebook above is: getwd() Alternatively you could use the “ls” command on the unix command line. In R you could use the system command: system(""pwd"", intern=TRUE) In Python you can use the “!” command to access the system prompt: !pwd In any case, you just need to find your tenant name for the associated Spark service. Back in RStudio, we can view our tenant name by looking at the spark context: > sc$config$tenant.id[[1]][1] [1] ""s106-a1450be504b787-ea7328759346"" This tenant name should be the exact same as the tenant name in your notebook (above my tenant name is “s106-a1450be504b787-ea7328759346” for this particular Spark service). In order to allow notebook and RStudio elements to interact, we will want to move out of the /notebook/work folder in our notebook, and create new folder (or access an already existing folder) in our Spark service’s root directory. In our notebook, we will move up the to the tenant root directory and create a folder called spark_work1 : The code used in the R Notebook above is: #get the working directory getwd() #set the working directory to the root home for GPFS tenant_name = ""s106-a1450be504b787-ea7328759346"" #FILL IN YOUR TENANT NAME HERE setwd(paste0(""/gpfs/global_fs01/sym_shared/YPProdSpark/user/"", tenant_name)) getwd() #make a directory with systemt command system(""mkdir spark_work1"", intern=TRUE) #move to the directory and verify that it is empty setwd(""./spark_work1"") getwd() system(""ls"", intern=TRUE) Next we will save a sample file from RStudio to this tenant’s GPFS: Here is the R code used above: library(dplyr) #connect to the correct spark instance sc <- spark_connect(config = ""sparklyr"") #get the tenant name from your Notebook tenant_name = ""s106-a1450be504b787-ea7328759346"" #create a temp file in spark iris_tbl = copy_to(sc, iris, ""iris"") #save the temp file to arquet spark_write_parquet( iris_tbl, paste0(""/gpfs/global_fs01/sym_shared/YPProdSpark/user/"", tenant_name, ""/spark_work1/iris_tbl_parquet"")) And now we can take a look at the file, and load it in a notebook: Using the following R code from a notebook on the same spark instance: setwd(""/gpfs/global_fs01/sym_shared/YPProdSpark/user/s106-a1450be504b787-ea7328759346/spark_work1"") system(""ls"", intern= T) dat = read.parquet(sqlContext, ""/gpfs/global_fs01/sym_shared/YPProdSpark/user/s106-a1450be504b787-ea7328759346/spark_work1/iris_tbl_parquet/"" ) take(dat,5) One thing to take note of is that RStudio and R notebooks in DSX interface with Spark through two different APIs (sparklyr and SparkR, respectively). This means that your R code in the two different UIs will not be plug and play, and more importantly, that you cannot save and interact models between the two interfaces. -------------------------------------------------------------------------------- Originally published at datascience.ibm.com on April 25, 2017 by Jim Crozier . * Spark * Rstudio One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. Blocked Unblock Follow FollowingIBM DATA SCIENCE EXPERIENCE FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","It is often useful to use RStudio for one piece of your analysis and notebooks (whether in R, or in another language) for other parts of your analysis. This article will step you through the process…",Working interactively with RStudio and notebooks in DSX,Live,20 63,"Raj Singh Blocked Unblock Follow Following Developer Advocate and Open Data Lead at IBM Watson Data Platform Jun 14 -------------------------------------------------------------------------------- MAPPING FOR DATA SCIENCE WITH PIXIEDUST AND MAPBOX ADD ANOTHER LAYER TO YOUR JUPYTER NOTEBOOKS WITH BUILT-IN MAP RENDERING You’re doing your data a disservice if you don’t use maps. Nearly all data has a spatial component — customer locations, crime, election results, traffic incidents, points-of-purchase, infrastructure locations—and if you’re not familiar with some basic mapping tools, you’re not doing good data science. Seeing your data on a map has many powerful benefits. At the exploratory stage of data science, it’s a great way to get a feel for the geographic distribution of your data. Home sales over a million dollars, winter 2016. Darker means higher price. Data courtesy of Redfin.com .MAKING SCATTERPLOTS LESS SCATTERSHOT A standard scatterplot is a good starting point. It can work OK for spatial data — just use longitude and latitude as your X and Y values — but plotting those points on a real map shows your data’s relationship to real-world features. For example, a scatterplot might reveal that the data is clustered into four major groupings, but a map could show not only those four groupings, but also that they are all near subway stations. Overlaying your data on a map can surface unseen patterns. A plain scatterplot on the home sales data set lacks context without a map.At the presentation stage of data science, using maps is a no-brainer. Combining data with maps is a natural storytelling device (when the story you’re telling has a geographic aspect, of course). So you’ve never done any mapping before, and it seems hard? Not to worry. PixieDust makes it easy , with a little help from Mapbox APIs (and to a lesser extent the Google Maps API), you can get up and running with some beautiful map-based visualizations in no time! GETTING STARTED If you’re unfamiliar with PixieDust, check out this introductory article and get your Jupyter Notebook-based data science environment up and running. PixieDust comes with mapping goodness baked in. As you read through the next section, you can follow along in the Jupyter notebook called pixiedust_mapbox_geocharts hosted on the IBM Data Science Experience . GEOCHARTS A Google GeoChart, which you can easily render in PixieDust given the correct field names.If your data contains a place name column such as country, province or state names, you can make what Google calls a GeoChart — a map that shows those regions color-coded based on the value of a numeric column in your data. To create a GeoChart in PixieDust, you must first have a Spark or Pandas DataFrame with a place name column. Invoke PixieDust on that DataFrame with the display() command: display( mydataframe ) . Click on the chart menu (to the right of the table button) and select the Map item (it’s the one with the globe icon). The options dialog should pop up. If it doesn’t, click on the Options button and drag the field that has place names into Keys . Then for the Values field, choose any numeric field you want to visualize. Within the Display Mode menu, choose: * Region to color the entire area of your named places, e.g., countries, provinces, or states. * Markers to place a circle in the center of the region which is scaled according to the data selected for the Value field. * Text to label regions with labels like Russia or Asia . This is good stuff, but if you’re doing heavyweight data science, in most cases your data will be disaggregated down to the point (latitude/longitude) level. This is where we chose to use our mapping partner Mapbox’s API instead of one of Google’s mapping APIs. Selecting the map option from PixieDust’s chart menu, and using the mapbox renderer.POINT MAPPING WITH MAPBOX The Mapbox option lets you create a map of geographic point data. Your DataFrame needs at least the following three fields in order to work with this renderer: * a latitude field named latitude , lat , or y * a longitude field named longitude , lon , long , or x * a numeric field for visualization To use the Mapbox renderer, you need a free API key from Mapbox. You can get one on their website at https://www.mapbox.com/signup/ . When you get your key, enter it in the Options dialog box. In the Options dialog, drag both your latitude and longitude fields into Keys . Then choose any numeric fields for Values . Only the first one you choose is used to color the map thematically, but any other fields specified in Values appear in a tooltip when you hover over a data point on the map. You can also choose the style of the underlying base map, which gives context to your data. The image below uses Mapbox’s “light” style, which works great with a data overlay, as the lightly colored streets and place names don’t fight for attention with your data. Following the instructions above using PixieDust’s test dataset #6, you can reproduce this tasteful map.If you want to see what a place really looks like, zoom in and switch to satellite view: I can’t wait to see what cool things you do with the mapping features in PixieDust. Let me know how you use it here in the comments below. And please ♡ this article to recommend it to other Medium readers. Thanks to Mike Broberg . * Mapbox * Geospatial * Google Maps * Jupyter * Pixiedust Blocked Unblock Follow FollowingRAJ SINGH Developer Advocate and Open Data Lead at IBM Watson Data Platform FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform. * Share * * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","You’re doing your data a disservice if you don’t use maps. Nearly all data has a spatial component — customer locations, crime, election results, traffic incidents, points-of-purchase, infrastructure…",Mapping for Data Science with PixieDust and Mapbox – IBM Watson Data Lab – Medium,Live,21 64,"IMPORTING JSON DOCUMENTS WITH NOSQLIMPORT Glynn Bird / September 19, 2016Two years ago, I started to write couchimport , a command-line line utility to allow me to import comma-separated and tab-separated files into a Apache CouchDB™ or Cloudant NoSQL database. cat mydata.tsv | couchimport --db mydatabase I built the tool for my own purposes but decided to share it publicly and open-sourced couchimport in case anyone else would find it useful. I learned a lot by writing this project: * publishing the project to npm (the Node.js package manager) allows other users to easily install the code — npm install -g couchimport * if your library/utility would save someone an hour of effort, then it’s worth open-sourcing it * using the Node.js Stream API allows your application to deal with file streams and HTTP streams interchangeably * a decent README is essential if you expect folks to use your software. In many cases the README.md file is the documentation for your project. The purpose of couchimport is to write data to CouchDB in chunks using data from a text file: It also allows a transform function to be added to the workflow: this function gets called with each new document and can modify it to cast data types, remove unwanted fields or to reorganise the structure of the document. The project turned out to be a very useful way for folks to get started with CouchDB because pre-existing data is likely to be in a spreadsheet or a relational database which can easily be exported to CSV/TSV. I also found I needed to use couchimport’s functionality programmatically, and so I exposed some of its functions to the world so that couchimport can be npm install -ed into anyone’s Node.js project. In fact, couchimport is the importer used in our Simple Search Service project. INTRODUCING NOSQLIMPORT I recently refactored the couchimport code to make it work with other JSON document stores, so today I’m publishing nosqlimport : This can be be installed as a command-line utility: npm install -g nosqlimport Or as a library to be used in your own project: npm install --save nosqlimport On its own, nosqlimport only writes its data to the terminal, but it has three other optional npm modules that can be added for Apache CouchDB, MongoDB and ElasticSearch support: npm install -g nosqlimport-couchdb npm install -g nosqlimport-mongodb npm install -g nosqlimport-elasticsearch The type of database that is written to is defined by the --nosql or -n command line switch at run-time, e.g.: cat movies.tsv | nosqlimport -n couchdb IMPORTING DATA INTO COUCHDB Firstly, define your CouchDB or Cloudant URL as an environment variable: export NOSQL_URL=http://localhost:5984 Or: export NOSQL_URL=https://myusername:mypassword@myaccount.cloudant.com The CouchDB or Cloudant “database” to write data to can also be defined as an environment variable: export NOSQL_DATABASE=mydatabase Then import a text file: cat movies.tsv | nosqlimport -n couchdb If you’d prefer to supply all the details as command-line switches, then that’s possible too: cat movies.tsv | nosqlimport -n couchdb -u https://myusername:mypassword@myaccount.cloudant.com -db mydatabase IMPORTING DATA INTO MONGODB Firstly, define your MongoDB URL as an environment variable: export NOSQL_URL=mongodb://localhost:27017/mydatabase The MongoDB “collection” to write data to can also be defined as an environment variable: export NOSQL_DATABASE=mycollection Then import a text file: cat movies.tsv | nosqlimport -n mongodb If you’d prefer to supply all the details as command-line switches, then that’s possible too: cat movies.tsv | nosqlimport -n mongodb -u mongodb://localhost:27017/mydatabase -db mycollection IMPORTING DATA INTO ELASTICSEARCH Firstly, define your MongoDB URL as an environment variable: export NOSQL_URL=http://localhost:9200/myindex The ElasticSearch “type” to write data to can also be defined as an environment variable: export NOSQL_DATABASE=mytype Then import a text file: cat movies.tsv | nosqlimport -n elasticsearch If you’d prefer to supply all the details as command-line switches, then that’s possible too: cat movies.tsv | nosqlimport -n elasticsearch -u http://localhost:9200/myindex -db mytype SPECIFYING THE DELIMITER By default, nosqlimport expects text files with a tab character delimiting the columns in the text file, but this can be specified at run time by supplying a --delimiter or -d parameter: cat movies.csv | nosqlimport -d ',' -n couchdb TRANSFORM FUNCTIONS Transform functions are entirely optional but are a very powerful way of modifying the JSON object before it is written to the database. You may need to: * cast data types to force strings to be numbers, or booleans prior to saving * remove some documents that don’t need saving in the first place * rearrange the JSON object e.g. generate a GeoJSON object from a text file of latitudes and longitudes A transform function is saved to a text file before calling nosqlimport and contains a single JavaScript function exported via module.exports . The transform function is called for each row in the incoming text file (except the first line which contains the column headings), and the document it synchronously returns is added to the write buffer. For example, if our source data looked like this: name latitude longitude description live Middlesbrough 54.576841 -1.234976 A large industrial town on the south bank of the River Tees true Boston 42.358056 -71.063611 The largest city in Massachusetts. true Atlantis 0 0 A fictional island falseThe documents being generated and passed to the transform function would look like this: { ""name"": ""Middlesbrough"", ""latitude"": ""54.576841"", ""longitude"": ""-1.234976"", ""description"": ""A large industrial town on the south bank of the River Tees"", ""live"": ""true"" } Notice how: * the object’s keys were inferred from the incoming file’s first line * the values are all strings — because a CSV file doesn’t contain any sense of a column’s data type. In this example, we cast the latitude and longitude values to numbers and force the live value to be a boolean: module.exports = function(doc) { doc.latitude = parseFloat(doc.latitude); doc.longitude = parseFloat(doc.longitude); doc.live = (doc.live === 'true'); return doc; }; To prevent certain documents from being saved, then simply return {} instead of a populated object: module.exports = function(doc) { if (doc.live === 'true') { return doc; } else { // nothing is written to the database return {} } }; Or you can elect to craft a new JSON document in your own format based on the data being imported, in this case GeoJSON: module.exports = function(doc) { if (doc.live === 'true') { var newdoc = { type: 'Feature', geometry: { type: 'Point', coordinates: [ parseFloat(doc.latitude), parseFloat(doc.longitude) ] }, properties: { name: doc.name } }; return newdoc; } else { return {}; } }; A transform function is used by supplying the path to the file containing the code using the -t parameter: cat places.tsv | nosqlimport -n mongodb -t './geojson.js' USING NOSQLIMPORT IN YOUR OWN APPLICATION If you are building a Node.js application and need to be able to import files of content, streams or HTTP streams into a NoSQL database, then you can use nosqlimport in your own project as a dependency. Add it to your project with: npm install --save nosqlimport Add the database-specifc module: npm install --save nosqlimport-couchdb npm install --save nosqlimport-mongodb npm install --save nosqlimport-elasticsearch And call the code: var nosqlimport = require('nosqlimport'); // connection options var opts = { nosql: 'couchdb', url: 'http://localhost:5984', database: 'mydb'}; // import the data nosqlimport.importFile('./places.tsv', null, opts, function(err, data) { console.log(err, data); }); Or, supply a JavaScript function to transform the data: var nosqlimport = require('nosqlimport'); // cast lat/long to numbers and live to boolean var transformer = function(doc) { doc.latitude = parseFloat(doc.latitude); doc.longitude = parseFloat(doc.longitude); doc.live = (doc.live === 'true'); return doc; }; // connection options var opts = { nosql: 'couchdb', url: 'http://localhost:5984', database: 'mydb', transform: transformer}; // import the data nosqlimport.importFile('./places.tsv', null, opts, function(err, data) { console.log(err, data); }); LINKS nosqlimport and its plugins are open-source projects, so please raise issues or contribute PRs if you can! * https://www.npmjs.com/package/nosqlimport * https://www.npmjs.com/package/nosqlimport-couchdb * https://www.npmjs.com/package/nosqlimport-mongodb * https://www.npmjs.com/package/nosqlimport-elasticsearch SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Click to share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: cloudant / CouchDB / Elasticsearch / MongoDB / Node.js / NoSQL Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Introducing nosqlimport, an npm module to help you import comma-separated and tab-separated files into your JSON document store of choice.",Move CSVs into different JSON doc stores,Live,22 66,"This video shows you how to build and query a Cloudant Geospatial index using the new Maps in the Cloudant dashboard! Watch the other videos in this series titled ""Introducing Cloudant Geospatial"" and ""Cloudant Geospatial in Action"". Find more videos in the Cloudant Learning Center at http://www.cloudant.com/learning-center.",This video shows you how to build and query a Cloudant Geospatial index using the new Maps in the Cloudant dashboard!,Tutorial: How to build and query a Cloudant geospatial index,Live,23 68,"THE CONVERSATIONAL INTERFACE IS THE NEW PARADIGM Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 30, 2016In 1962 Thomas Kuhn published The Structure of Scientific Revolutions . In it he posited that science moves forward with brief, dramatic episodes of revolution in the paradigms of thought followed by longer terms of assimilating and exploring these changes. A stepwise function if you will from revolution to revolution. One could say that the brief history of software is governed by a similar abstraction. From the era of the desktop app to the era of the web page to the era of the mobile app to the latest paradigm shift which seems to be happening now: the conversation. As developers it behooves us to keep up, even if it just appeals to the ""look it's new and shiny"" which some of us have, with these dramatic changes. Certainly, the hype cycle in the short term will get to the point that the conversation bots or assistants or whatever the eventual designated name will be will overrun what is actually possible. Eventually, though this new paradigm like all of those before it will take a long period of time to work its way forward and move into many aspects of computing. What follows is an example which is not even a toy app but we will carry it no further. The goal is to expose you to some of the differences which are currently apparent in this next revolution. It is still early and it is unclear who will win (Siri or Alexa or Facebook Messenger or some unrealesed thing from Google or ...) and what the ultimate ecosystem will look like. It does seem clear though that whoever does win they won't be able to do it all. No one company can write all of the desktop apps or all of the web pages or even all of the mobile apps. Conversational apps will be the same. We will end up with some provider(s) who will deliver the interface to the users either via a message line like Slack or WhatsApp or via voice like Siri and Alexa (both of these ultimately get turned into text lines too). These providers will most likely sit at the center of an ecosystem which will handle NLP (Natural Language Processing), semantic analysis, and other core tasks such as location and calendar integration. So, what will this leave? All of the niche domains provided by all of the many businesses and organizations in the world! It's a huge opportunity. Things like the following: 1. Ask your local grocery store bot if they have an item currently in stock. e.g. @CurbMarket Do you have any local strawberries today? 2. Tell a clothing merchant to notify you next time they have a big sale. e.g. @OakHallClothier Tell me when you have your next sale 3. Use a service to estimate when auto maintenance is due. e.g. @autobot i have a 2011 Toyota Highlander with 48000 miles. Tell me when my next oil change is due. BOTKIT There are many tools for bots today with new ones arriving, some fading and others ""on the horizon"". Currently, there are ""bits and pieces"" for particulars like dialogs (IBM Dialog) and NLP (IBM AlchemyAPI) all the way to large sdk's for voice and digital assistants (Alexa, Siri, and Google). This non comprehensive list points to a few facts about this current space of chatbots. It's early and there is a large scope of investment occurring. While all of these warrant investigating if you are interested in this space, the easiest entry currently is a project called Botkit. It's an open source Javascript library built by the folks at howdy.ai with some assistance from the folks at Slack . It runs as a Node server which can connect via a socket to Slack's Realtime API or it can even handle webhooks from Slack, Facebook, and Twilio. Botkit provides a simple framework to handle the basics of creating a chat application. Starting with Slack's Realtime APISlack in some ways is the simplest and arguably most useful of the current platforms. Many teams use Slack with some basic integrations on a daily basis. Many of these bots appear as users inside of Slack and have an online presence in a channel at the same level of a user. It is very easy to connect a bot once you have a token from Slack: var Botkit = require('botkit'); if(!process.env.token) { console.log(""Must set slack token in env.""); process.exit(1); } var controller = Botkit.slackbot({ debug: false }); controller.spawn({ token: process.env.token }).startRTM(function(err) { if(err) { throw new Error(err); } }); The controller above is the core driver that creates the direct connection to Slack via a socket. Then once the bot is connected it can listen for many types of events such as a direct_message or mention or even more basic things like rtm_open and user_channel_join . Often though we just want the bot to hear certain things and react to them: controller.hears(['hello','hi'], ['mention'], function(bot, msg) { bot.reply(msg, ""yello""); }); The above does just that. It registers to hear hello or hi when the bot is mention ed and then it fires the callback which in this case just replies with a yello . In essence, we just performed the hello world of building and integrating a bot with slack. A ConversationWhile hello world is nice, a modestly complex interaction such as step by step conversation really isn't that much more difficult: controller.hears(['what', 'you'], ['mention'], function(bot,msg) { bot.startConversation(msg, function(err, convo) { convo.say('I help you track vehicle maintenance.'); convo.say('You tell me about your vehicle and how much you drive.'); convo.say('then I\'ll keep track of things and notify you when it\'s time for maintenance.' ); convo.ask('Would you like to know more?', [ { pattern: bot.utterances.yes, callback: function(res, convo) { convo.say(""just tell me to 'add' so I can ask you a couple of questions""); convo.next(); } }, { pattern: bot.utterances.no, callback: function(res, convo) { convo.say(""awww""); convo.next(); } }, { default: true, callback: function(res, convo) { convo.repeat(); convo.next(); } } ]); }) }); Once again you register a top level handler with controller.hear . It listens for what and you with the bot's name mentioned. When that is heard the callback will fire. In this instance it is the bot.startConversation that is most interesting because it starts a stateful flow with that particular user. Typically, this is the kind of construct which can be used to gather information for whatever it is that your app provides to your user. Analagous in some ways to an HTML form yet this is more like a dynamic workflow. The above example does little more than give some overview as to what this particular bot might actually do. It's like a help message for the user. First, it gives a brief overview with the convo.say s then it asks a question. The ask can handle yes and no. If it doesn't get either it does the default and just asks again and again until it does get the yes or no so that it can continue. Truly, not very smart but still a start and a base from which many smarts can be built up. A Multi Step Conversation A FOUNDATION TO BUILD UPON This example of creating a bot which has a presence that can react to textual messages is the foundation of this next revolution. While the examples above are simplistic they do provide some structure and a view into the basic text lines of voice and chat applications. These are the starting points for much more sophisticated applications. Botkit itself has support for plugging in middleware which can pre and post process messages. It would be normal to extend an application with functionality that does deep language analysis or some kind of machine learning to the recognize and trigger portions of the above. Throw in some user context of location and schedules and even some limited knowledge that a digital assistant might have about an individual and the possibilities become plentiful indeed. SOME LINKS 1. Botkit 2. Hubot, an alternative from github 3. Slack API 4. Twilio messaging 5. Facebook Messenger 6. Alexa 7. Code Example on Github Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton is a spirit runner. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose",Botkit provides a simple framework to handle the basics of creating a chat application. What follows is an example which is not even a toy app but we will carry it no further. ,The Conversational Interface is the New Paradigm,Live,24 69,"Skip navigation Upload Sign in SearchLoading... Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE. WATCH QUEUE QUEUE Watch Queue Queue * Remove all * Disconnect 1. Loading... Watch Queue Queue __count__/__total__ Find out why CloseCREATING THE DATA SCIENCE EXPERIENCE IBM Analytics Subscribe Subscribed Unsubscribe 18,909 18KLoading... Loading... Working... Add toWANT TO WATCH THIS AGAIN LATER? Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO? Sign in to report inappropriate content. Sign in * Transcript * Statistics 448 views 10LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 11 1DON'T LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 2Loading... Loading... TRANSCRIPT The interactive transcript could not be loaded.Loading... Loading... Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Jun 7, 2016Want to learn more about how we created the Data Science Experience? We've interviewed hundreds of data scientists and analyzed how they think, how they learn, how they build off the work of others, and how the get feedback to improve their input. Data scientists need a one stop shop environment that enables them to learn, create and collaborate, and that's where there Data Science Experience comes in. We think you're going to love it. Learn more about the Data Science Experience at http://ibm.co/data-science Subscribe to the IBM Analytics Channel: https://www.youtube.com/subscription_... The world is becoming smarter every day, join the conversation on the IBM Big Data & Analytics Hub: http://www.ibmbigdatahub.com https://www.youtube.com/user/ibmbigdata https://www.facebook.com/IBManalytics https://www.twitter.com/IBMbigdata https://www.linkedin.com/company/ibm-... https://www.slideshare.net/IBMBDA * CATEGORY * Science & Technology * LICENSE * Standard YouTube License Show more Show lessLoading... Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21. IBM Analytics 166 views * New 8:21 -------------------------------------------------------------------------------- * Data Science Training | Data Science Tutorial | Online Data Science Training - Duration: 58:05. Intellipaat 1,364 views 58:05 * Tetiana Ivanova - How to become a Data Scientist in 6 months a hacker’s approach to career planning - Duration: 56:26. PyData 8,009 views 56:26 * How to learn data science - Duration: 32:41. Vik Paruchuri 4,321 views 32:41 * Data Science @Stanford- Bonnie Berger, PhD - Duration: 51:37. Stanford 1,397 views 51:37 * Apache Spark Maker Community Event: The livestream playback - Duration: 1:30:23. IBM Analytics 610 views 1:30:23 * Data Just Right: A Practical Introduction to Data Science Skills - DataEDGE 2013 - Duration: 1:24:51. Berkeley School of Information 24,125 views 1:24:51 * Introducing the Data Science Experience - Duration: 2:31. IBM Analytics 2,608 views 2:31 * The Science of Doubt: Creating Good Controls for Data Science Experiments. - Duration: 20:49. Next Day Video 372 views 20:49 * Educating the Next Generation of Data Scientists - DataEDGE 2013 - Duration: 41:45. Berkeley School of Information 3,575 views 41:45 * Lecture 1 | Machine Learning (Stanford) - Duration: 1:08:40. Stanford 1,048,059 views 1:08:40 * Scalable Data Science and Deep Learning with H2O, Arno Candel, 20150603 - Duration: 1:27:05. San Francisco Bay ACM 2,773 views 1:27:05 * The Future of Data Science - Data Science @ Stanford - Duration: 25:49. Stanford 29,189 views 25:49 * What it's Like to Interview as a Data Scientist - Duration: 17:03. Dose of Data 9,133 views 17:03 * How Data Science Plus Prescriptive Insights Drive Sales Performance - Duration: 46:17. InsideSales.com 334 views 46:17 * Daniel Moisset - Bridging the gap: from Data Science to service - Duration: 36:57. PyData 316 views 36:57 * Don't rely on spreadsheets: Empower your business with IBM SPSS Statistics - Duration: 1:00:34. IBM Analytics 6 views * New 1:00:34 * Data Science for Fun and Profit - Duration: 1:00:09. Tech Talk 9,712 views 1:00:09 * Outthink Threats - Duration: 20:29. IBM Security 366 views 20:29 * Joel Horwitz, IBM Analytics & Sri Satish Ambati, H2O - Apache Spark Maker Community Event 2016 - Duration: 25:17. SiliconANGLE 89 views 25:17 * Loading more suggestions... * Show more * Language: English * Country: Worldwide * Restricted Mode: Off History HelpLoading... Loading... Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Try something new! * Loading... Working... Sign in to add this to Watch LaterADD TO Loading playlists...","Want to learn more about how we created the Data Science Experience? We've interviewed hundreds of data scientists and analyzed how they think, how they lear...",Creating the Data Science Experience,Live,25 75,"GOOGLE RESEARCH BLOG The latest news from Research at GoogleUSING MACHINE LEARNING TO PREDICT PARKING DIFFICULTY Friday, February 03, 2017 Posted by James Cook, Yechen Li, Software Engineers and Ravi Kumar, Research Scientist "" When Solomon said there was a time and a place for everything he had not encountered the problem of parking his automobile. "" - Bob Edwards , Broadcast Journalist Much of driving is spent either stuck in traffic or looking for parking . With products like Google Maps and Waze , it is our long-standing goal to help people navigate the roads easily and efficiently. But until now, there wasn’t a tool to address the all-too-common parking woes. Last week, we launched a new feature for Google Maps for Android across 25 US cities that offers predictions about parking difficulty close to your destination so you can plan accordingly. Providing this feature required addressing some significant challenges: * Parking availability is highly variable, based on factors like the time, day of week, weather, special events, holidays, and so on. Compounding the problem, there is almost no real time information about free parking spots. * Even in areas with internet-connected parking meters providing information on availability, this data doesn’t account for those who park illegally, park with a permit, or depart early from still-paid meters. * Roads form a mostly-planar graph, but parking structures may be more complex, with traffic flows across many levels, possibly with different layouts. * Both the supply and the demand for parking are in constant flux, so even the best system is at risk of being outdated as soon as it’s built. To face these challenges, we used a unique combination of crowdsourcing and machine learning (ML) to build a system that can provide you with parking difficulty information for your destination, and even help you decide what mode of travel to take — in a pre-launch experiment, we saw a significant increase in clicks on the transit travel mode button, indicating that users with additional knowledge of parking difficulty were more likely to consider public transit rather than driving. Three technical pieces were required to build the algorithms behind the parking difficulty feature: good ground truth data from crowdsourcing, an appropriate ML model and a robust set of features to train the model on. Ground Truth Data Gathering high-quality ground truth data is often a key challenge in building any ML solution. We began by asking individuals at a diverse set of locations and times if they found the parking difficult. But we learned that answers to subjective questions like this produces inconsistent results - for a given location and time, one person may answer that it was “ easy ” to find parking while another found it “ difficult. ” Switching to objective questions like “ How long did it it take to find parking? ” led to an increase in answer confidence, enabling us to crowdsource a high-quality set of ground truth data with over 100K responses. Model Features With this data available, we began to determine features we could train a model on. Fortunately, we were able to turn to the wisdom of the crowd , and utilize anonymous aggregated information from users who opt to share their location data, which already is a vital source of information for estimates of live traffic or popular times and visit durations . We quickly discovered that even with this data, some unique challenges remain. For example, our system shouldn’t be fooled into thinking parking is plentiful if someone is parking in a gated or private lot. Users arriving by taxi might look like a sign of abundant parking at the front door, and similarly, public-transit users might seem to park at bus stops. These false positives, and many others, all have the potential to mislead an ML system. So we needed more robust aggregate features. Perhaps not surprisingly, the inspiration for one of these features came from our own backyard in downtown Mountain View. If Google navigation observes many users circling downtown Mountain View during lunchtime along trajectories like this one, it strongly suggests that parking might be difficult: Our team thought about how to recognize this “fingerprint” of difficult parking as a feature to train on. In this case, we aggregate the difference between when a user should have arrived at a destination if they simply drove to the front door, versus when they actually arrived, taking into account circling, parking, and walking. If many users show a large gap between these two times, we expect this to be a useful signal that parking is difficult. From there, we continued to develop more features that took into account, for any particular destination, dispersion of parking locations, time-of-day and date dependence of parking (e.g. what if users park close to a destination in the early morning, but further away at busier hours?), historical parking data and more. In the end, we decided on roughly twenty different features along these lines for our model. Then it was time to tune the model performance. Model Selection & Training We decided to use a standard logistic regression ML model for this feature, for a few different reasons. First, the behavior of logistic regression is well understood, and it tends to be resilient to noise in the training data; this is a useful property when the data comes from crowdsourcing a complicated response variable like difficulty of parking. Second, it’s natural to interpret the output of these models as the probability that parking will be difficult, which we can then map into descriptive terms like “ Limited parking ” or “ Easy .” Third, it’s easy to understand the influence of each specific feature, which makes it easier to verify that the model is behaving reasonably. For example, when we started the training process, many of us thought that the “fingerprint” feature described above would be the “silver bullet” that would crack the problem for us. We were surprised to note that this wasn’t the case at all — in fact, it was features based on the dispersion of parking locations that turned out to be one of the most powerful predictors of parking difficulty. Results With our model in hand, we were able to generate an estimate for difficulty of parking at any place and time. The figure below gives a few examples of the output of our system, which is then used to provide parking difficulty estimates for a given destination. Parking on Monday mornings, for instance, is difficult throughout the city, especially in the busiest financial and retail areas. On Saturday night, things are busy again, but now predominantly in the areas with restaurants and attractions. Output of our parking difficulty model in the Financial District and Union Square areas of San Francisco. Red denotes a higher confidence that parking is difficult. Top row: a typical Monday at ~8am (left) and ~9pm (right). Bottom row: the same times but on a typical Saturday. We’re excited about the opportunities to continue to improve the model quality based on user feedback. If we are able to better understand parking difficulty, we will be able to develop new and smarter forms of parking assistance — we’re very excited about future applications of ML to help make transportation more enjoyable! Google Labels: crowd-sourcing , Google Maps , Machine Learning   LABELS  * accessibility * ACL * ACM * Acoustic Modeling * Adaptive Data Analysis * ads * adsense * adwords * Africa * AI * Algorithms * Android * API * App Engine * App Inventor * April Fools * Art * Audio * Australia * Automatic Speech Recognition * Awards * Cantonese * China * Chrome * Cloud Computing * Collaboration * Computational Imaging * Computational Photography * Computer Science * Computer Vision * conference * conferences * Conservation * correlate * Course Builder * crowd-sourcing * CVPR * Data Center * Data Discovery * data science * datasets * Deep Learning * DeepDream * DeepMind * distributed systems * Diversity * Earth Engine * economics * Education * Electronic Commerce and Algorithms * electronics * EMEA * EMNLP * Encryption * entities * Entity Salience * Environment * Europe * Exacycle * Expander * Faculty Institute * Faculty Summit * Flu Trends * Fusion Tables * gamification * Gmail * Google Books * Google Brain * Google Cloud Platform * Google Docs * Google Drive * Google Genomics * Google Maps * Google Play Apps * Google Science Fair * Google Sheets * Google Translate * Google Trips * Google Voice Search * Google+ * Government * grants * Graph * Graph Mining * Hardware * HCI * Health * High Dynamic Range Imaging * ICLR * ICML * ICSE * Image Annotation * Image Classification * Image Processing * Inbox * Information Retrieval * internationalization * Internet of Things * Interspeech * IPython * Journalism * jsm * jsm2011 * K-12 * KDD * Klingon * Korean * Labs * Linear Optimization * localization * Machine Hearing * Machine Intelligence * Machine Learning * Machine Perception * Machine Translation * MapReduce * market algorithms * Market Research * ML * MOOC * Multimodal Learning * NAACL * Natural Language Processing * Natural Language Understanding * Network Management * Networks * Neural Networks * Ngram * NIPS * NLP * open source * operating systems * Optical Character Recognition * optimization * osdi * osdi10 * patents * ph.d. fellowship * PhD Fellowship * PiLab * Policy * Professional Development * Proposals * Public Data Explorer * publication * Publications * Quantum Computing * renewable energy * Research * Research Awards * resource optimization * Robotics * schema.org * Search * search ads * Security and Privacy * Semi-supervised Learning * SIGCOMM * SIGMOD * Site Reliability Engineering * Social Networks * Software * Speech * Speech Recognition * statistics * Structured Data * Style Transfer * Supervised Learning * Systems * TensorFlow * Translate * trends * TTS * TV * UI * University Relations * UNIX * User Experience * video * Video Analysis * Vision Research * Visiting Faculty * Visualization * VLDB * Voice Search * Wiki * wikipedia * WWW * YouTube ARCHIVE  *   2017 * Feb * Jan *   2016 * Dec * Nov * Oct * Sep * Aug * Jul * Jun * May * Apr * Mar * Feb * Jan *   2015 * Dec * Nov * Oct * Sep * Aug * Jul * Jun * May * Apr * Mar * Feb * Jan *   2014 * Dec * Nov * Oct * Sep * Aug * Jul * Jun * May * Apr * Mar * Feb * Jan *   2013 * Dec * Nov * Oct * Sep * Aug * Jul * Jun * May * Apr * Mar * Feb * Jan *   2012 * Dec * Oct * Sep * Aug * Jul * Jun * May * Apr * Mar * Feb * Jan *   2011 * Dec * Nov * Sep * Aug * Jul * Jun * May * Apr * Mar * Feb * Jan *   2010 * Dec * Nov * Oct * Sep * Aug * Jul * Jun * May * Apr * Mar * Feb * Jan *   2009 * Dec * Nov * Aug * Jul * Jun * May * Apr * Mar * Feb * Jan *   2008 * Dec * Nov * Oct * Sep * Jul * May * Apr * Mar * Feb *   2007 * Oct * Sep * Aug * Jul * Jun * Feb *   2006 * Dec * Nov * Sep * Aug * Jul * Jun * Apr * Mar * Feb FEED Google on Follow @googleresearch Give us feedback in our Product Forums .COMPANY-WIDE * Official Google Blog * Public Policy Blog * Student Blog PRODUCTS * Android Blog * Chrome Blog * Lat Long Blog DEVELOPERS * Developers Blog * Ads Developer Blog * Android Developers Blog * Google * Privacy * Terms","Much of driving is spent either stuck in traffic or looking for parking. With products like Google Maps and Waze, it is our long-standing goal to help people navigate the roads easily and efficiently. But until now, there wasn’t a tool to address the all-too-common parking woes.",Using Machine Learning to predict parking difficulty,Live,26 77,"Skip navigation Upload Sign in SearchLoading... Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE. WATCH QUEUE QUEUE Watch Queue Queue * Remove all * Disconnect 1. Loading... Watch Queue Queue __count__/__total__ Find out why CloseGETTING THE BEST PERFORMANCE WITH PYSPARK Apache Spark Subscribe Subscribed Unsubscribe 15,557 15KLoading... Loading... Working... Add toWANT TO WATCH THIS AGAIN LATER? Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO? Sign in to report inappropriate content. Sign in * Transcript * Statistics 286 views 4LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 5 0DON'T LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1Loading... Loading... TRANSCRIPT The interactive transcript could not be loaded.Loading... Loading... Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Jun 16, 2016 * CATEGORY * Science & Technology * LICENSE * Standard YouTube License Loading... Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * docker and spark - Duration: 5:49. Jeffrey Ellin 405 views 5:49 -------------------------------------------------------------------------------- * Understanding Memory Management In Spark For Fun And Profit - Duration: 29:00. Apache Spark 187 views 29:00 * Paco Nathan: NLP and text analytics at scale with PySpark and notebooks - Duration: 42:54. PyData 1,680 views 42:54 * Peter Hoffmann: Indroduction to the PySpark DataFrame API - Duration: 44:49. PyData 1,823 views 44:49 * How to detect Phishing URLs using PySpark Decision Trees - PyCon India 2015 - Duration: 24:56. Python India 198 views 24:56 * Hadoop Certification - CCA - Pyspark - 01 Joining Data Sets using Python - Duration: 32:43. itversity 1,186 views 32:43 * Distributed Natural Language Processing with Anaconda Platform Tools on a Spark Cluster and PySpark - Duration: 6:33. Continuum Analytics 1,597 views 6:33 * Top 5 Mistakes When Writing Spark Applications - Duration: 29:38. Apache Spark 543 views 29:38 * Spark and Couchbase: Augmenting the Operational Database with Spark - Duration: 29:34. Apache Spark 57 views 29:34 * Lessons Learned From Running Spark On Docker - Duration: 26:36. Apache Spark 197 views 26:36 * Assignment 4 - Using Docker to deploy Apache Spark - Duration: 14:56. Anders Rahbek 303 views 14:56 * Best Practices for running PySpark - Duration: 29:42. Apache Spark 2,650 views 29:42 * Holden Karau - Improving PySpark Performance: Spark performance beyond the JVM - Duration: 43:23. PyData 675 views 43:23 * ODSC West 2015 | Juliet Hougland - ""PySpark Best Practices"" - Duration: 46:01. Open Data Science 155 views 46:01 * Hadoop Certification - CCA - Pyspark - Developing Word Count program - flatMap, map, reduceByKey - Duration: 22:22. itversity 933 views 22:22 * Hadoop Certification - CCA - Pyspark - Filtering data - Duration: 21:42. itversity 581 views 21:42 * Hadoop Certification - CCA - Submitting pyspark applications - Duration: 12:50. itversity 1,252 views 12:50 * HUG Meetup Feb 2016: Running Spark Clusters in Containers with Docker - Duration: 37:14. ydntheater 178 views 37:14 * Operational Tips For Deploying Apache Spark - Duration: 29:53. Apache Spark 133 views 29:53 * Holden Karau: A brief introduction to Distributed Computing with PySpark - Duration: 53:32. PyData 2,502 views 53:32 * Loading more suggestions... * Show more * Language: English * Country: Worldwide * Restricted Mode: Off History HelpLoading... Loading... Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Try something new! * Loading... Working... Sign in to add this to Watch LaterADD TO Loading playlists...",This talk assumes you have a basic understanding of Spark and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs – this is the talk for you.,Getting The Best Performance With PySpark,Live,27 81,"ACCESS DENIED Sadly, your client does not supply a proper User-Agent, and is consequently excluded. We have an inordinate number of problems with automated scripts which do not supply a User-Agent, and violate the automated access guidelines posted at arxiv.org -- hence we now exclude them all. (In rare cases, we have found that accesses through proxy servers strip the User-Agent information. If this is the case, you need to contact the administrator of your proxy server to get it fixed.) If you believe this determination to be in error, see http://arxiv.org/denied.html for additional information.","In this paper, we propose gcForest, a decision tree ensemble approach with performance highly com- petitive to deep neural networks. ",Deep Forest: Towards An Alternative to Deep Neural Networks,Live,28 82,"Homepage Follow Sign in / Sign up Homepage * Home * Data Science Experience * * Watson Data Platform * IBM Data Science Experience Blocked Unblock Follow Following Jan 27 -------------------------------------------------------------------------------- EXPERIENCE IOT WITH COURSERA I’m very happy and proud to announce that IBM is the first non-academic supplier to offer a data science course on Coursera. We’ve worked very hard to make this course a great learning experience for anyone interested in data science and IoT, and IBM Data Science Experience is central to the course. It has been great to see all of our hard work payoff. In addition to launching our first Coursera course, Exploring and Visualizing IoT Data , on January 9, 2017, we also kicked off a data science degree program. Since my team and I are working for the IBM Watson IoT division, and IoT is one of the most prominent disruptors in that space, the choice was obvious that we create a course on the topic of exploring and visualizing IoT data. The course is applicable to any time series problem including stock exchange data or social media streams, and even non-time series data. Those interested in learning more on the hardware and cloud data integration part of this topic might want to have a look at the course A developer’s guide to the Internet of Things (IoT) . I really would have loved to immediately start with artificial intelligence methods for IoT time-series forecasting and anomaly detection, but this would have been the wrong starting point of the journey. To help guide you through that journey, we decided to create a data science degree (in Coursera terms, a specialization) and the courses mentioned above will set the stage and make you familiar with technologies like message brokers, NoSQL databases, Object Storage, Apache SparkSQL, Python and Matplotlib. Using that technology stack, we introduce statistical measures to gain insight on IoT data and learn how to visualize it. Having laid the foundation with the 1st course, we are currently creating a 2nd course on IoT time-series analysis using Apache Spark 2.0 Structured Streaming on the highly optimized tungsten and catalyst engine. We will teach you how to detect anomalies and predict future events using advance statistical methods. Then finally, the last course will talk about artificial intelligence methods using deep learning frameworks — auto encoders and recurrent LSTM networks for anomaly detection and forecasting. So stay tuned! And take the course to start your journey :) Course links: https://www.coursera.org/learn/developer-iot https://www.coursera.org/learn/exploring-visualizing-iot-data -------------------------------------------------------------------------------- Originally published at datascience.ibm.com on January 27, 2017 by Romeo Kienzler . * Object Storage * NoSQL * Python * IoT * Education One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. Blocked Unblock Follow FollowingIBM DATA SCIENCE EXPERIENCE FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",I’m very happy and proud to announce that IBM is the first non-academic supplier to offer a data science course on Coursera. We’ve worked very hard to make this course a great learning experience for…,Experience IoT with Coursera,Live,29 84,"KDNUGGETS Data Mining, Analytics, Big Data, and Data Science Subscribe to KDnuggets News | Follow | Contact * SOFTWARE * NEWS * Top stories * Opinions * Tutorials * JOBS * Academic * Companies * Courses * Datasets * EDUCATION * Certificates * Meetings * Webinars KDnuggets Home » News » 2016 » Jun » Opinions, Interviews, Reports » How open API economy accelerates the growth of big data and analytics ( 16:n22 )LATEST NEWS, STORIES * Top tweets, Jun 22-28: #Bayesian #Statistics explaine... Peeking Inside Convolutional Neural Networks Mining Twitter Data with Python Part 5: Data Visualisa... U. Chicago Center for Data Science and Public Policy: ... KDnuggets 16:n23, Jun 29: Machine Learning Trends & Fu... More News & Stories | Top Stories HOW OPEN API ECONOMY ACCELERATES THE GROWTH OF BIG DATA AND ANALYTICS Previous post Next post Tweet Tags: API , Big Data Analytics , Open Data -------------------------------------------------------------------------------- An open API is available on the internet for free. We review the growth of API economy and how organizations have been realizing the potential of open APIs in transforming their business. By Kaushik Pal , TechAlpine. comments The already huge world of big data and analytics has got a boost in the form of open Application Programming Interfaces (APIs). The use of open APIs has been generating huge volumes of big data. Since open APIs are now accessed by the general public, mainly via apps and software programs, it has resulted into an exponential growth of data. Open APIs are also contributing to the creation of analytics because a group of APIs now have cognitive abilities which enable them to deliver analytics to systems. The growth of open APIs and other APIs has given birth to the term “ API economy ”. Prominent business houses such as Google and Yahoo have been offering public APIs for different purposes such as weather updates and traffic management. What is an open API? An open API is made available in the Internet and is available for use free of cost. For example, a startup software company specialized in the insurance domain may make its underwriting calculation software available as an open API. Interested third-party developers may access the calculation software as per the terms and conditions of the API availability. The third-party developers may use the calculator in any manner unless they are bound by specific terms and conditions of API usage. Usually, open APIs are not bound by any terms and conditions. Open API provides benefits to both the owner and the user of it. For the owner, whenever the open API is used, it means its products and services are getting publicity while it retains the ownership of the code. For the user, open APIs relieve the third-party developers of the effort required to build an entire software program from the scratch. The software the third-party developers are building is a mashup between the source software and new code. Here is are some sites that list many public APIs * Any-API, Documentation and Test Consoles for Over 225 Public APIs * ProgrammableWeb API Directory, search over 15,000 APIs * Wikipedia list of open APIs * NASA open APIs * Data.gov APIs What is an API economy? So huge and ubiquitous has been the emergence of open APIs that many experts have been using the term API economy to refer to the transactions taking place with the help of APIs. More and more organizations have been realizing the potential of open APIs in transforming their business and have been rolling out open APIs. Impact of API economy on big data and analytics So far, the impact of the open API economy on big data and analytics has been felt in the following four areas: Growth in the data volume The volume of data has grown even more with the growth of open APIs. Let us understand how open APIs have contributed to the growth of big data with the example of the online education domain. Online education is highly popular now, students use apps, websites to learn. Now, the educational content is stored in different storage systems and it is a tedious and difficult task to connect so many storage systems with the apps and also maintain them. In such a case, open APIs can really help. Open APIs can help apps and websites interact with different data storage systems. Now, when a student uses an app to access say, interactive lessons on Java, an open API takes the request to the database which sends the required data through the API after proper authorization, if applicable. Open APIs make it easy to connect to multiple data sources through apps. To access a data source, all that is needed is to just call an API which delivers the requested information. More and more people are using open APIs because of the convenience it provides. Over time, the data volume has grown because more data is being generated for example student details, course details, student performance and analysis and patterns. Cognitive APIs Cognitive API is a relatively new development in the world of APIs and it is especially applicable for analytics. A cognitive API accepts a request in a certain format from a system and delivers it to another system. Now, the recipient system provides analytics as response which is delivered to the requesting system. Cognitive APIs are capable of processing complex, unstructured data and delivering analytics. Many organizations use such APIs to create their own products and services. Faster access to big data APIs can provide big data applications faster access to the data storage. This results in faster retrieval, processing and analytics. Such APIs can sit as a layer between distributed computing applications and storage. APIs now available to the layman There was a time when APIs were the exclusive territory of the developers. Developers still know APIs in and out but the layman have also been using APIs, albeit indirectly. People have been using apps which connect to the APIs. The APIs take requests and delivers responses from the server which the user views. This factor has significantly accounted for the huge growth of big data. Important Statistics The statistics below establishes that the API economy has been getting stronger and influencing big data and analytics. * There is a difference between the growth prediction and actual growth of public APIs, as per the ProgrammableWeb directory of APIs. This is shown by the image below. Source: nordicapis.com/tracking-the-growth-of-the-api-economy/ * However, the above estimate may be deceptive because there are other API directories too. Also, the impact of APIs is best gauged when they are consumed by third-party APIs. Now, such instances are not recorded often but that does not diminish the importance of the APIs. * The image below shows that the number of API calls has increased significantly over the years. Source: blog.mailchimp.com/10m-api-calls-per-day-more/ * As per Netflix, the number of requests the Netflix API has received over the years has exponentially increased. From less than 1 billion requests in January, 2010, the number of requests has increased to more than 20 billion requests in January, 2011. Source: http://techblog.netflix.com/2011/02/redesigning-netflix-api.html * A significant development has been the APIs becoming more inclusive. There was a time when only developers could understand APIs. Now, APIs can be used even by non-development people. Laymen are able to access APIs, albeit without their knowledge, through apps. * Smartphones use countless mobile services that are built on APIs. Conclusion It seems that open APIs are synonymous with convenience, time savings and efficiency. There are good reasons that businesses consider open APIs an important business development tool. With due importance given to other influences open APIs have had on big data and analytics, the involvement of general public seems to be the most important driver of big data and analytics growth. Considering the present times, the API economy seems to on course for an explosive growth over the next few years and it will redefine many businesses. Related : * Data Science and Cognitive Computing with HPE Haven OnDemand: The Simple Path to Reason and Insight * HPE Haven OnDemand and Microsoft Azure Machine Learning: Power Tools for Developers and Data Scientists * Machine Learning at your fingertips – 60+ free APIs, from HPE Haven OnDemand -------------------------------------------------------------------------------- Previous post Next post -------------------------------------------------------------------------------- MOST POPULAR LAST 30 DAYS Most viewed 1. 7 Steps to Mastering Machine Learning With Python R vs Python for Data Science: The Winner is ... What is the Difference Between Deep Learning and “Regular” Machine Learning? TensorFlow Disappoints - Google Deep Learning falls shallow 9 Must-Have Skills You Need to Become a Data Scientist Top 10 Data Analysis Tools for Business How to Explain Machine Learning to a Software Engineer Most shared 1. What is the Difference Between Deep Learning and “Regular” Machine Learning? Data Science of Variable Selection: A Review R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results A Visual Explanation of the Back Propagation Algorithm for Neural Networks Machine Learning Key Terms, Explained How to Build Your Own Deep Learning Box Big Data Business Model Maturity Index and the Internet of Things (IoT) MORE RECENT STORIES * The Big Data Ecosystem is Too Damn Big Civis Analytics: Data Scientist, Statistics Civis Analytics: Lead Data Engineer An Inside Update on Natural Language Processing Webinar, Jun 30: Introducing Anaconda Mosaic: Visualize. Explo... 5 More Machine Learning Projects You Can No Longer Overlook U. of Iowa: Business Analytics & Information Systems, Lec... U. of Iowa: Lecturer: Business Analytics & Information Sy... Top Stories, June 20-26: New Machine Learning Book, Free Draft... BigDebug: Debugging Primitives for Interactive Big Data Proces... Mining Twitter Data with Python Part 4: Rugby and Term Co-occu... Improving Nudity Detection and NSFW Image Recognition Highmark Health: Medical Economics Consultant Regularization in Logistic Regression: Better Fit and Better G... Doing Data Science: A Kaggle Walkthrough Part 6 – Creati... Highmark Health: Lead Decision Support Analyst Top Machine Learning Libraries for Javascript Predictive Analytics World in October: Government, Business, F... Ten Simple Rules for Effective Statistical Practice: An Overview Bank of America: Statistician KDnuggets Home » News » 2016 » Jun » Opinions, Interviews, Reports » How open API economy accelerates the growth of big data and analytics ( 16:n22 ) © 2016 KDnuggets. About KDnuggets Subscribe to KDnuggets News | Follow @kdnuggets | | X",An open API is available on the internet for free. We review the growth of API economy and how organizations have been realizing the potential of open APIs in transforming their business. ,How open API economy accelerates the growth of big data and analytics,Live,30 87,"Skip navigation Sign in SearchLoading... Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE. WATCH QUEUE QUEUE Watch Queue Queue * Remove all * Disconnect The next video is starting stop 1. Loading... Watch Queue Queue __count__/__total__TRY AD-FREE FOR 3 MONTHS Loading... Sign up by October 31st for an extended 3-month trial of YouTube Red.Working... No thanks Try it free Find out why CloseDATA SCIENCE EXPERIENCE: SIGN UP FOR FREE TRIAL developerWorks TVLoading... Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking... Subscribe Subscribed Unsubscribe 17KLoading... Loading... Working... Add toWANT TO WATCH THIS AGAIN LATER? Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO? Sign in to report inappropriate content. Sign in * Transcript * Statistics * Add translations 56 views 1LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 2 0DON'T LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1Loading... Loading... TRANSCRIPT The interactive transcript could not be loaded.Loading... Loading... Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning * CATEGORY * Science & Technology * LICENSE * Standard YouTube License Show more Show lessLoading... Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * FREE 4G NET ON ANY ANDROID DEVICE II NO DATA PACK REQUIRED II 2017 - Duration: 7:03. CLAX TECH 499 views 7:03 -------------------------------------------------------------------------------- * Advances Towards Building an Artificial Brain | IBM' Dharmendra Modha - Duration: 23:00. Artificial Intelligence AI 1 view * New 23:00 * How to Get Unlimited Cell Data for Free (Any Carrier or Phone) - Duration: 8:17. ThioJoe 5,859,252 views 8:17 * IBM Watson Machine Learning: Score a Predictive Model Built with IBM SPSS Modeler - Duration: 5:31. developerWorks TV 7 views * New 5:31 * The Data Science Experience - Duration: 42:45. Evolving Education with Cognitive & Data Sciences 1,170 views 42:45 * Data science expert interview: Jennifer Shin - Duration: 7:29. IBM Analytics 17,327 views 7:29 * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21. IBM Analytics 8,386 views 8:21 * Use IBM PixieDust and Data Science Experience to analyze San Francisco traffic - Duration: 11:57. scottdangelo 447 views 11:57 * Introducing the Data Science Experience - Duration: 2:31. IBM Analytics 14,839 views 2:31 * Data Science Experience: Analyze precipitation data using a community notebook - Duration: 5:15. developerWorks TV No views * New 5:15 * Datascience made simple with IBM DSX | HackerEarth Webinar - Duration: 1:06:11. HackerEarth 264 views 1:06:11 * Creating the Data Science Experience - Duration: 3:55. IBM Analytics 3,197 views 3:55 * H2O With IBM's Data Science Experience (DSX) - Duration: 4:43. Matt McInnis 303 views 4:43 * HOW TO GET UNLIMITED DATA 3G/4G INTERNET FOR FREE - Duration: 1:41. Lame Dabe 1,818 views 1:41 * IBM Blockchain Business Models - Duration: 10:13. IBMBlockchain 78 views * New 10:13 * Data Science Experience: Load and analyze public data sets - Duration: 2:46. developerWorks TV No views * New 2:46 * IBM Watson Text to Speech Demo - Duration: 9:27. James Belton 1 view * New 9:27 * Visual Machine Learning in Data Science Experience - Duration: 1:37. Armand Ruiz 2,996 views 1:37 * Data science expert interview: Holden Karau - Duration: 6:21. IBM Analytics 4,722 views 6:21 * Content made easy with IBM Watson Content Hub - Duration: 3:13. IBM Watson Customer Engagement 843 views * New 3:13 * Language: English * Content location: United States * Restricted Mode: Off History HelpLoading... Loading... Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Test new features * Loading... Working... Sign in to add this to Watch LaterADD TO Loading playlists...",This video shows you how to sign up for a free trial of IBM Data Science Experience (DSX).,Sign up for a free trial in DSX,Live,31 90,"No Free Hunch Navigation * kaggle.com * kaggle.com A Kaggler's Guide to Model Stacking in PracticeA KAGGLER'S GUIDE TO MODEL STACKING IN PRACTICE Ben Gorman | 12.27.2016 INTRODUCTION Stacking (also called meta ensembling) is a model ensembling technique used to combine information from multiple predictive models to generate a new model. Often times the stacked model (also called 2nd-level model) will outperform each of the individual models due its smoothing nature and ability to highlight each base model where it performs best and discredit each base model where it performs poorly. For this reason, stacking is most effective when the base models are significantly different. Here I provide a simple example and guide on how stacking is most often implemented in practice. Feel free to follow this article using the related code and datasets here in the Machine Learning Problem Bible . This tutorial was originally posted here on Ben's blog, GormAnalysis . MOTIVATION Suppose four people throw a combined 187 darts at a board. For 150 of those we get to see who threw each dart and where it landed. For the rest, we only get to see where the dart landed. Our task is to guess who threw each of the unlabelled darts based on their landing spot. K-NEAREST NEIGHBORS (BASE MODEL1) Let’s make a sad attempt at solving this classification problem using a K-Nearest Neighbors model. In order to select the best value for K, we’ll use 5-fold Cross-Validation combined with Grid Search where K=(1, 2, … 30). In pseudo code: 1. Partition the training data into five equal size folds. Call these test folds. 2. For K = 1, 2, … 10 3. For each test fold 1. Combine the other four folds to be used as a training fold 2. Fit a K-Nearest Neighbors model on the training fold (using the current value of K) 3. Make predictions on the test fold and measure the resulting accuracy rate of the predictions Calculate the average accuracy rate from the five test fold predictions 4. Keep the K value with the best average CV accuracy rate With our fictitious data we find K=1 to have the best CV performance (67% accuracy). Using K=1, we now train a model on the entire training dataset and make predictions on the test dataset. Ultimately this will give us about 70% classification accuracy. SUPPORT VECTOR MACHINE (BASE MODEL2) Now let’s make another sad attempt at solving the problem using a Support Vector Machine. Additionally, we’ll add a feature DistFromCenter that measures the distance each point lies from the center of the board to help make the data linearly separable. With R’s LiblineaR package we get two hyper parameters to tune: TYPE 1. L2-regularized L2-loss support vector classification (dual) 2. L2-regularized L2-loss support vector classification (primal) 3. L2-regularized L1-loss support vector classification (dual) 4. support vector classification by Crammer and Singer 5. L1-regularized L2-loss support vector classification COST Inverse of the regularization constant The grid of parameter combinations we’ll test is the cartesian product of the 5 listed SVM types with cost values of (.01, .1, 1, 10, 100, 1000, 2000). That is type cost 1 0.01 1 0.1 1 1 … … 5 100 5 1000 5 2000Using the same CV + Grid Search approach we used for our K-Nearest Neighbors model, here we find the best hyper-parameters to be type = 4 with cost = 1000. Again, we use these parameters to train a model on the full training dataset and make predictions on the test dataset. This’ll give us about 61% CV classification accuracy and 78% classification accuracy on the test dataset. STACKING (META ENSEMBLING) Let’s take a look at the regions of the board each model would classify as Bob, Sue, Mark, or Kate. Unsurprisingly, the SVM does a good job at classifying Bob’s throws and Sue’s throws but does poorly at separating Kate’s throws and Mark’s throws. The opposite appears to be true for the K-nearest neighbors model. HINT : Stacking these models will probably be fruitful. There are a few schools of thought on how to actually implement stacking. Here’s my personal favorite applied to our example problem: 1. Partition the training data into five test folds train ID FoldID XCoord YCoord DistFromCenter Competitor 1 5 0.7 0.05 0.71 Sue 2 2 -0.4 -0.64 0.76 Bob 3 4 -0.14 0.82 0.83 Sue … … … … … … 183 2 -0.21 -0.61 0.64 Kate 186 1 -0.86 -0.17 0.87 Kate 187 2 -0.73 0.08 0.73 Sue2. Create a dataset called train_meta with the same row Ids and fold Ids as the training dataset, with empty columns M1 and M2. Similarly create a dataset called test_meta with the same row Ids as the test dataset and empty columns M1 and M2 train_meta ID FoldID XCoord YCoord DistFromCenter M1 M2 Competitor 1 5 0.7 0.05 0.71 NA NA Sue 2 2 -0.4 -0.64 0.76 NA NA Bob 3 4 -0.14 0.82 0.83 NA NA Sue … … … … … … … … 183 2 -0.21 -0.61 0.64 NA NA Kate 186 1 -0.86 -0.17 0.87 NA NA Kate 187 2 -0.73 0.08 0.73 NA NA Suetest_meta ID XCoord YCoord DistFromCenter M1 M2 Competitor 6 0.06 0.36 0.36 NA NA Mark 12 -0.77 -0.26 0.81 NA NA Sue 22 0.18 -0.54 0.57 NA NA Mark … … … … … … … 178 0.01 0.83 0.83 NA NA Sue 184 0.58 0.2 0.62 NA NA Sue 185 0.11 -0.45 0.46 NA NA Mark3. For each test fold {Fold1, Fold2, … Fold5} 3.1 Combine the other four folds to be used as a training fold train fold1 ID FoldID XCoord YCoord DistFromCenter Competitor 1 5 0.7 0.05 0.71 Sue 2 2 -0.4 -0.64 0.76 Bob 3 4 -0.14 0.82 0.83 Sue … … … … … … 181 5 -0.33 -0.57 0.66 Kate 183 2 -0.21 -0.61 0.64 Kate 187 2 -0.73 0.08 0.73 Sue3.2 For each base model M1: K-Nearest Neighbors (k = 1) M2: Support Vector Machine (type = 4, cost = 1000) 3.2.1 Fit the base model to the training fold and make predictions on the test fold. Store these predictions in train_meta to be used as features for the stacking model train_meta with M1 and M2 filled in for fold1 ID FoldID XCoord YCoord DistFromCenter M1 M2 Competitor 1 5 0.7 0.05 0.71 NA NA Sue 2 2 -0.4 -0.64 0.76 NA NA Bob 3 4 -0.14 0.82 0.83 NA NA Sue … … … … … … … … 183 2 -0.21 -0.61 0.64 NA NA Kate 186 1 -0.86 -0.17 0.87 Bob Bob Kate 187 2 -0.73 0.08 0.73 NA NA Sue4. Fit each base model to the full training dataset and make predictions on the test dataset. Store these predictions inside test_meta test_meta ID XCoord YCoord DistFromCenter M1 M2 Competitor 6 0.06 0.36 0.36 Mark Mark Mark 12 -0.77 -0.26 0.81 Kate Sue Sue 22 0.18 -0.54 0.57 Mark Sue Mark … … … … … … … 178 0.01 0.83 0.83 Sue Sue Sue 184 0.58 0.2 0.62 Sue Mark Sue 185 0.11 -0.45 0.46 Mark Mark Mark5. Fit a new model, S (i.e the stacking model) to train_meta, using M1 and M2 as features. Optionally, include other features from the original training dataset or engineered features S: Logistic Regression (From LiblineaR package, type = 6, cost = 100). Fit to train_meta 6. Use the stacked model S to make final predictions on test_meta test_meta with stacked model predictions ID XCoord YCoord DistFromCenter M1 M2 Pred Competitor 6 0.06 0.36 0.36 Mark Mark Mark Mark 12 -0.77 -0.26 0.81 Kate Sue Sue Sue 22 0.18 -0.54 0.57 Mark Sue Mark Mark … … … … … … … … 178 0.01 0.83 0.83 Sue Sue Sue Sue 184 0.58 0.2 0.62 Sue Mark Sue Sue 185 0.11 -0.45 0.46 Mark Mark Mark MarkThe main point to take home is that we’re using the predictions of the base models as features (i.e. meta features) for the stacked model. So, the stacked model is able to discern where each model performs well and where each model performs poorly. It’s also important to note that the meta features in row i of train_meta are not dependent on the target value in row i because they were produced using information that excluded the target_i in the base models’ fitting procedure. Alternatively, we could make predictions on the test dataset using each base model immediately after it gets fit to each test fold. In our case this would generate test-set predictions for five K-Nearest Neighbors models and five SVM models. Then we would average the predictions per model to generate our M1 and M2 meta features. One benefit to this is that it’s less time consuming than the first approach (since we don’t have to retrain each model on the full training dataset). It also helps that our train meta features and test meta features should follow a similar distribution. However, the test metas M1 and M2 are likely more accurate in the first approach since each base model was trained on the full training dataset (as opposed to 80% of the training dataset, five times in the 2nd approach). STACKED MODEL HYPER PARAMETER TUNING So, how do you tune the hyper parameters of the stacked model? Regarding the base models, we can tune their hyper parameters using Cross-Validation + Grid Search just like we did earlier. It doesn’t really matter what folds we use, but it’s usually convenient to use the same folds that we use for stacking. Tuning the hyper parameters of the stacked model is where things get interesting. In practice most people (including myself) simply use Cross Validation + Grid Search using the same exact CV folds used to generate the Meta Features. There’s a subtle flaw to this approach – can you spot it? Indeed, there’s a small bit of data leakage in our stacking CV procedure. Consider the 1st round of Cross Validation for the stacked model. We fit a model S to {fold2, fold3, fold4, fold5}, make predictions on fold1 and evaluate performance. But the meta features in {fold2, fold3, fold4, fold5} are dependent on the target values in fold1. So, the target values we’re trying to predict are themselves embedded into the features we’re using to fit our model. This is leakage and in theory S could deduce information about the target values from the meta features in a way that would cause it to overfit the training data and not generalize well to out-of-bag samples. However, you have to work hard to conjure up an example where this leakage is significant enough to cause the stacked model to overfit. In practice, everyone ignores this theoretical hole (and frankly I think most people are unaware it even exists!). STACKING MODEL SELECTION AND FEATURES How do you know what model to choose as the stacker and what features to include with the meta features? In my opinion, this is more of an art than a science. Your best bet is to try different things and familiarize yourself with what works and what doesn’t. Another question is, what (if any) other features should you include in for the stacking model in addition to the meta features? Again this is somewhat of an art. Looking at our example, it’s pretty evident that DistFromCenter plays a part in determining which model will perform well. The KNN appears to do better at classifying darts thrown near the center and the SVM model does better at classifying darts thrown away from the center. Let’s take a shot at stacking our models using Logistic Regression. We’ll use the base model predictions as meta features and DistFromCenter as an additional feature. Sure enough the stacked model performs better than both of the base models – 75% CV accuracy and 86% test accuracy. Now let’s take a look at its classification regions overlaying the training data, just like we did with the base models. The takeaway here is that the Logistic Regression Stacked Model captures the best aspects of each base model which is why it performs better than either base model in isolation. STACKING IN PRACTICE To wrap this up, let’s talk about how, when, and why you might use stacking in the real world. Personally, I mostly use stacking in machine learning competitions on Kaggle . In general, stacking produces small gains with a lot of added complexity – not worth it for most businesses. But Stacking is almost always fruitful so it’s almost always used in top Kaggle solutions. In fact, stacking is really effective on Kaggle when you have a team of people trying to collaborate on a model. A single set of folds is agreed upon and then every team member builds their own model(s) using those folds. Then each model can be combined using a single stacking script. This is great because it prevents team members from stepping on each others toes, awkwardly trying to stitch their ideas into the same code base. One last bit. Suppose we have dataset with (user, product) pairs and we want to predict the probability that a user will purchase a given product if he/she is presented an ad with that product. An effective feature might be something like, using the training data, what percent of the products advertised to a user did he actually purchase in the past? So, for the sample ( user1 , productA ) in the training data, we want to tack on a feature like UserPurchasePercentage but we have to be careful not to introduce leakage into the data. We do this as follows: 1. Split the training data into folds 2. For each test fold 3. Identify the unique set of users in the test fold Use the remaining folds to calculate UserPurchasePercentage (percent of advertised products each user purchased) Map UserPurchasePercentage back to the training data via ( fold id , user id ) Now we can use UserPurchasePercentage as a feature for our gradient boosting model (or whatever model we want). Effectively what we’ve just done is built a predictive model that predicts user_i will purchase product_x with probability based on the percent of advertised products he purchased in the past and used those predictions as a meta feature for our real model. This is a subtle but valid and effective form of stacking – one which I often do implement in practice and on Kaggle. BIO I’m Ben Gorman – math nerd and data science enthusiast based in the New Orleans area. I spent roughly five years as the Senior Data Analyst for Strategic Comp before starting GormAnalysis . I love talking about data science, so never hesitate to shoot me an email if you have questions: bgorman@gormanalysis.com . As of September 2016, I’m a Kaggle Master ranked in the top 1% of competitors world-wide. model ensembling model stackingTHE OFFICIAL BLOG OF KAGGLE.COM SearchCATEGORIES * Data Science News (50) * Kaggle News (124) * Kernels (31) * Tutorials (36) * Winners' Interviews (188) WANT TO SUBSCRIBE? Email Address * First Name Last Name * = required fieldPOPULAR TAGS #1 Kaggler Annual Santa Competition binary classification community computer vision convolutional neural networks CrowdFlower Search Results Relevance Dark Matter Deloitte diabetes Diabetic Retinopathy Draper Satellite Image Chronology EEG data Elo Chess Ratings Competition Eurovision Challenge Facebook Recruiting Flight Quest Heritage Health Prize How Much Did It Rain? image classification image processing Intel Kaggle InClass Kernels logistic regression March Mania Merck multiclass classification natural language processing open data open data spotlight Otto Product Classification Practice Fusion Product Product News Profiling Top Kagglers Recruiting regression problem scikit-learn scripts of the week Tourism Forecasting Tutorial video series Wikipedia Challenge XGBoostARCHIVES Archives Select Month December 2016 November 2016 October 2016 September 2016 August 2016 July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 January 2014 December 2013 November 2013 September 2013 August 2013 July 2013 June 2013 May 2013 April 2013 March 2013 February 2013 January 2013 December 2012 November 2012 October 2012 September 2012 August 2012 July 2012 June 2012 May 2012 April 2012 March 2012 February 2012 January 2012 December 2011 November 2011 October 2011 September 2011 August 2011 July 2011 June 2011 May 2011 April 2011 March 2011 February 2011 January 2011 December 2010 November 2010 October 2010 September 2010 August 2010 July 2010 June 2010 May 2010 April 2010 Toggle the Widgetbar",Stacking is a model ensembling technique used to combine information from multiple predictive models to generate a new model. Often times the stacked model will outperform each of the individual mo…,A Kaggler's Guide to Model Stacking in Practice,Live,32 91,"Working Vis * * * Home * About This Blog * Brunel","Analytics and visualization often go hand-in-hand. One of the great things about notebooks such as IPython/Jupyter is that they provide a single interface to numerous data analysis technologies that often can be used together. So, using Brunel within notebooks is a very natural fit. For example, I can use a wide variety of python libraries to cleanse, shape and analyze data–and then use Brunel to visualize those results.",Using Brunel in IPython/Jupyter Notebooks,Live,33 96,"Homepage Follow Sign in / Sign up Steve Moore Blocked Unblock Follow Following Parent, playwright, artistic director of Physical Plant Theater, & IBMer. Jul 10 -------------------------------------------------------------------------------- NEW MENTAL MODELS FOR MACHINE LEARNING: PART 1 Machine learning has already extended into so many aspects of daily life that it can be handy for us to simply memorize a set of go-to examples of its impact on certain industries. For instance, we might think of fraud detection as the canonical example of machine learning in the financial sector. Or we might think of Watson’s cognitive approach to oncology as the canonical example of machine learning in healthcare. Or, yet again, we might point to recommendation engines at Netflix and Amazon as canonical examples of machine learning in retail. Certainly, those are tremendous demonstrations of the power of the technology — and in aggregate, they give a sense of machine learning’s pervasive presence in our lives. But the convenience of go-to examples might come at a cost. In particular, citing the same handy examples might keep us from noticing the wide diversity of machine learning use cases within individual sectors. This post is the first in a series aimed at shaking up our intuitions about the things that machine learning is making possible in specific sectors — to look beyond the same set of use cases that always come to mind. Let’s start with Government… 1. BRINGING ML TO ENVIRONMENTAL PROTECTION As much as any commercial sector, Government is under constant pressure to do more with less, to serve more constituents more effectively and more intelligently. That includes agencies tasked with environmental protection like the DCMR Milieudienst Rijnmond , which battles pollution, waste, and other environmental threats for the region surrounding Rotterdam in the Netherlands. By combining various IBM Analytics software, a strong partnership with the Dutch security firm DataExpert , and a suite of remote sensors, the team could use machine learning to help identify and evaluate environmental hazards in real time — and can sort the hazards by severity and urgency. By detecting and assessing environmental threats algorithmically, the system can identify key risks and lack of compliance. Automating and improving that aspect of their work can give the DCMR more time and energy for other action that could boost public safety. 2. ML AND JOB SECURITY FOR BELGIANS In the same corner of Europe, an employment and vocational agency called VDAB is striving to give workers in Belgium’s Flanders region the information and resources they need to find and keep work. Thankfully, unemployment in Belgium is falling — from 8.2% to 6.8% in the last year — but even at 6.8%, there’s clearly more work to do. One of the agency’s key goals is reducing the duration of unemployment for young workers while finding ways to direct limited resources where they’re truly needed. The machine learning solution: an ML model crafted by IBM Global Business Services that crunches past data to predict the duration of unemployment for each job seeker. By focusing attention on the young Belgians most at risk, the agency can do more to interrupt the patterns of joblessness and kick off self-reinforcing steps toward job security — a longterm boon to the economy at large. 3. ML IN THE FIGHT TO FEED THE YOUNG Halfway around the world, we find the the Instituto Colombiano de Bienestar Familiar , a children and family welfare organization working nationwide in Colombia for the prevention and protection of early childhood, childhood, adolescence and the welfare of families. On a tight budget, the organization still manages to reach more than 8 million Colombians with its programs and services. Among those 8 million, 38,730 in 2016 were malnourished children who received 29,552 emergency food rations and more than five million dietary supplements. That work didn’t happen by accident. Behind the scenes, the analytics firm Infórmese used IBM SPSS Modeler to provide predictive analytics and micro-targeting capabilities that optimize the distribution of aid to Colombia’s poorest and most remote areas. GOOD GOVERNANCE Governments and their agencies across the world are using machine learning at the national and local level to do more than process tax returns or make the buses run on time. Let’s put these three new examples in our tool belts as we continue to advocate for machine learning — and as we look for new ways to bring its capabilities to bear. * Machine Learning Blocked Unblock Follow FollowingSTEVE MOORE Parent, playwright, artistic director of Physical Plant Theater, & IBMer. FollowINSIDE MACHINE LEARNING Deep-dive articles about machine learning and data. Curated by IBM Analytics. * Share * * * * Never miss a story from Inside Machine learning , when you sign up for Medium. Learn more Never miss a story from Inside Machine learning Get updates Get updates","Machine learning has already extended into so many aspects of daily life that it can be handy for us to simply memorize a set of go-to examples of its impact on certain industries. For instance, we…",Top 10 Machine Learning Use Cases: Part 1,Live,34 99,"Nick Kasten Blocked Unblock Follow Following Computer Science / Math Student @ Texas State University Aug 30 -------------------------------------------------------------------------------- GAZE INTO MY REDDIT CRYSTAL BALL USING WATSON MACHINE LEARNING TO PREDICT A POST’S POTENTIAL Editor’s note: This article is part of an occasional series by the 2017 summer interns on the Watson Data Platform developer advocacy team, depicting projects they developed using Bluemix data services, Watson APIs, the IBM Data Science Experience, and more.Reddit is a social news-aggregation and discussion forum that receives millions of new posts every day. Some of these posts are links or images, but some contain only text, and usually serve to request/provide information or spark some kind of discussion. Users on the site can “upvote” or “downvote” these posts, nudging the post’s score by one in either a positive or negative direction. The end result of this system is a ranked list of posts for users to scroll through, divided into “subreddits” (subjects) with those posts having the highest scores situated at the top. A look at the Reddit interface from the MachineLearning subreddit.What if there were a way, using Watson Machine Learning and Watson Cognitive Services , to predict the score of a post before putting it on Reddit? Spoiler alert: there is! In this article, I’ll describe an app I built to help with my Reddit game, and what I learned about machine learning in the process. I’ll also share the code so you can try it yourself. INTRODUCING THE REDDIT CRYSTAL BALL The Reddit Crystal Ball is an app that predicts how high a score your post will receive on Reddit, when posted at the current time. If there’s a time later in the day at which the app thinks you could get a better score, you’ll be notified of that as well. This post would likely earn a higher score later in the day.The app uses Watson’s machine learning service to make its prediction, which is based on a few different factors: * Subreddit * Current Time of Day * Average Word Size * Watson Social Tone Analysis I used these features to build a machine learning model with Spark ML, which I then deployed on Bluemix using the Watson ML service. This creates a “scoring endpoint,” which allows us to interact with and query our model through a REST API that can be accessed from any platform, using any programming language. EVALUATING ALGORITHMS To make predictions, the machine learning model uses an algorithm called K-Means Clustering to group similar posts into clusters. The clusters of posts are then analyzed to determine the average score for posts placed in that cluster, which are then separated into 4 groups: Low, Medium, High, and Great. This method wasn’t my first choice. I initially attempted using decision-tree- and probabilistic-based algorithms like Random Forest and Naive Bayes to predict a specific score, but quickly learned that predicting an exact score was not going to work well given the constraints of this data set. Because I wanted to document the process of gathering data, processing it, and creating a machine learning model, I chose to build this project in a Jupyter Notebook . A notebook is an environment that allows documentation and executable code to live together, side-by-side, so it was perfect for this project. In a Jupyter Notebook, documentation and code live side-by-side inside cells.IMPLEMENTATION IN A DATA SCIENCE NOTEBOOK After stepping through my notebook , you’ll not only understand how the data was processed and used to train a model, but you’ll also be able to interact with that model and use it to make predictions on you own posts. These interactive elements are called PixieApps . A PixieApp is an app created with Python, using the PixieDust helper library , that runs in the notebook itself. Using the templating language Jinja 2 , it becomes relatively easy to create a nice UI that helps the data come alive. WHAT I [MACHINE] LEARNED After playing with the data and interacting with the model in the PixieApp, some interesting trends emerged. While all the features influenced the prediction, the most important were the choice of subreddit and the time a post was made. This makes sense, since different sections of the site are likely to be most active at different times, and it follows that posts would score higher during these periods of activity. At the same time, a post containing a link — which can drastically increase the average word size of a post — or posts that skew far in a certain direction in the tone analysis can be predicted to score higher or lower than solely based on subreddit and time alone. At the start of this project, machine learning was a completely foreign concept to me. Even the process of gathering, cleaning, and analyzing data was something I had little experience with. The great thing about notebooks on the IBM Data Science Experience is that you get the Pandas Python Data Analysis Library and the Spark engine out-of-the-box to get you started with small and large data science projects alike. Working with these tools, I was able to analyze Reddit post data set and experiment with different features in-depth. Now, I feel I have a much better grasp on what machine learning does, how it works, and the tools needed to work with large data sets. So check out the notebook and PixieApp I created, and let me know what you think here in the comments. You’ll be able to see the entire process of building and deploying the model, and you’ll have the opportunity to make predictions on your own posts. You might even find that it helps you create the perfect, high-scoring Reddit post of your dreams. To create your own crystal ball, load the notebook , complete the setup steps, and follow the instructions in the notebook cells. May your comments be plentiful, and your future filled with upvotes! Thanks to Patrick Titzler , Teri Chadbourne, CMP , and Mark Watson . * Machine Learning * Reddit * Ibm Watson * Jupyter Notebook * Cognitive Computing Blocked Unblock Follow FollowingNICK KASTEN Computer Science / Math Student @ Texas State University FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform. * Share * * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","In this article, I’ll describe an app I built to help with my Reddit game, and what I learned about machine learning in the process. I’ll also share the code so you can try it yourself.",Gaze Into My Reddit Crystal Ball – IBM Watson Data Lab – Medium,Live,35 100,"Jump to navigation * Twitter * LinkedIn * Facebook * About * Contact * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chats * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Subscribe ×BLOGS DATA VISUALIZATION PLAYBOOK: DETERMINING THE RIGHT LEVEL OF DETAIL Post Comment August 31, 2015 by Jennifer Shin Topics: Analytics , Big Data Research , Big Data Technology , Big Data Use Cases , Data Scientists Tags: data visualization , analytics , data scienceOne of the most important steps for creating data visualizations is selecting which aspects, features or dimensions of the data to present—in other words, letting the data dictate the visualization . Unlike school assignments, data scientists and professionals rarely receive project that provides the same clear guidance they received as children. There is no longer a teacher who assigns a bar chart; instead, data scientists are expected to find insights that will enlighten managers and colleagues. Utilizing data science can be beneficial to anyone interested in an effective visualization. This article shows how data science can be used to create effective data visualizations by focusing on one key question every data scientist needs to ask: What level of detail should I show in my visualization? To demonstrate the importance of this question, consider the following scenario. A researcher is conducting an experiment and the researcher records the date, time and a measurement at 6 a.m., 2 p.m. and 8 p.m. every day for a month. How can this data set be visualized? STEP 1: CREATING VISUAL REPRESENTATIONS The most direct way to present the data is to plot each data point. The most direct way to present the data is to plot each data point. In Figure 1, each measurement recorded over the course of the study is plotted against the date and time using a bar chart. Figure 1: Bar chart of each measurement recorded over the course of the study. Bar charts can seem simple and easy to use, but selecting the wrong data can impact the effectiveness of any visualization. With close to 100 data points in Figure 1, including every data point makes it difficult to gain significant insight without further analysis. If plotting each data point doesn’t provide meaningful insight, consider using summary statistics to gather information and as a starting point for finding useful patterns in the data set. In certain cases, visualizing summary statistics may be sufficient for presenting information. For example, a chart showing the average temperature for each month can be an effective presentation of the seasonal weather changes for a geographic region. STEP 2: DIGGING INTO THE DATA In the previous step, Figure 1 fell short of presenting usable insights. To get better insights, you can use summary statistics to analyze the data points directly or evaluate the visualization. Either approach allows data scientists to explore potential patterns in any data sets, as shown below. For data scientists who prefer to work directly with the data set, daily or weekly averages can present an effective overview by splitting up the data set into different levels. Figure 2a shows the daily average for the first seven days and the difference between the daily averages and the weekly average. The table shows that the difference between the daily average and the weekly average stands out on the sixth day, when the daily average is significantly higher than the weekly average. With the discovery of unusual behavior in the first week, it’s easy to check whether the pattern is consistent during the other weeks of the study. Figure 2a: Measurements for the first seven days, including the daily average and the difference between the daily average and the weekly average. For data scientists who prefer to work with visualization, the bar chart in Figure 1 can serve as a valuable source for insights. Figure 2b shows the measurements for the first seven days of the study with the average for this period represented by the horizontal red line. Similar to the previous step, the values for the sixth day are significantly different than the values for the other days in the study. Figure 2b: Measurements recorded during the first 7 days. The red line represents the average measurement recorded. STEP 3: REVISING THE CHART Since both the data set and the original visualization revealed that the data peaked on the sixth day, Figure 1 can be revised to determine if this pattern is consistent throughout the study. Specifically, in Figure 3, the three measurements recorded each day are represented as one averaged daily value and shows that the measurement values peak each week on the same weekday. Figure 3: Bar chart of the average measurement recorded daily. APPLY WITH CAUTION While averages can be useful for data mining, using this approach too liberally can inadvertently result in hiding valuable information. By replacing daily averages with weekly averages, Figure 4a no longer shows the peaks that occur on days 6, 13, 20 and 27—and the measurements are so close that the chart suggests there is very little variability in the data. Figure 4a: Bar chart of the average measurement recorded weekly. Conceptually calculating the average of a set of numbers is similar to redistributing the amounts evenly across these values until each one is equal. For instance, finding the average of 8 and 12 can be thought of as taking 2 from 12 and moving it to 10 so that the two values both equal 10, which is the average of the two numbers. Hence, if a set of numbers includes extreme values, averaging these terms can result in the loss of vital information. Remember that using a “one-size-fits-all” approach can increase the chances of hiding or missing important insights. Creating alternative visualizations of the measurements by time, as in Figure 4b, will minimize this risk and open up the possibility of finding new patterns. Figure 4b: Line chart of the measurements by time. Discover how the IBM advanced analytics portfolio can help you find patterns and derive insights by visually exploring data. Follow @IBMBigData RELATED CONTENT BLOG WHAT IS MACHINE LEARNING? Businesses can benefit enormously from analysis-derived rules that enable understanding why certain events occur and the corresponding actions to take. Learn more about a widely used six-phase methodology for building predictive analytics models that can reveal hidden rules for meaningful business... Read Blog Podcast How is open source transforming machine learning? Blog Bridging NoSQL databases into open data science initiatives Blog Spark and R: The deepening open analytics stack Blog Go global with data science at Datapalooza Podcast How is open source transforming streaming analytics? Podcast InsightOut: Leveraging metadata and governance Blog What is Spark? Blog Internet of Things data access and the fear of the unknown Blog InsightOut: Metadata and governance Podcast How is open source transforming graph analytics? Blog What is Hadoop? Blog Bridging Spark analytics to cloud data services View the discussion thread. IBM * Site Map * Privacy * Terms of Use * 2014 IBM FOLLOW IBM BIG DATA & ANALYTICS * Facebook * YouTube * Twitter * @IBMbigdata * LinkedIn * Google+ * SlideShare * Twitter * @IBManalytics * Explore By Topic * Use Cases * Industries * Analytics * Technology * For Developers * Big Data & Analytics Heroes * Explore By Content Type * Blogs * Videos * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Events * Around the Web * About The Big Data & Analytics Hub * Contact Us * RSS Feeds * Additional Big Data Resources * AnalyticsZone * Big Data University * Channel Big Data * developerWorks Big Data Community * IBM big data for the enterprise * IBM Data Magazine * Smarter Questions Blog * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics Heroes More * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics Heroes SearchEXPLORE BY TOPIC: Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social media Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive analyticsMORE Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social media Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive analytics Infographic Win the race to insight Blog What is machine learning? Blog Simple polyglot persistence in the cloud Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive analytics Blog What is machine learning? Blog Simple polyglot persistence in the cloudMORE Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive analytics Blog What is machine learning? Blog Simple polyglot persistence in the cloud Podcast How is open source transforming machine learning? Blog The future of cognitive business: Try the self-service technical preview Blog Bridging NoSQL databases into open data science initiatives Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social media Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freightMORE Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social media Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive analytics Infographic How financial advisors can connect with investors Podcast Cyber Beat Live: I'm In! When insiders threaten our security Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive analytics Infographic Win the race to insightMORE Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive analytics Infographic Win the race to insight Infographic How financial advisors can connect with investors Blog What is machine learning? Blog IBM is a leader in the Forrester Wave™: Big Data Hadoop Cloud Solutions, Q2 2016 * Home * Explore By Topic * Use Cases * All * Acquire, Grow & Retain Customers * Create New Business Models * Improve IT Economics * Manage Risk * Optimize Operations & Reduce Fraud * Transform Financial Processes * Industries * All * Banking * Consumer Products * Education * Energy & Utilities * Government * Healthcare & Life Sciences * Industrial * Insurance * Media & Entertainment * Retail * Telecommunications * Analytics * All * Content Analytics * Customer Analytics * Entity Analytics * Social Media Analytics * Technology * All * Business Intelligence * Cloud Database * Data Governance * Data Warehouse * Database Management Systems * Data Science * Hadoop & Spark * Internet of Things * Predictive Analytics * Streaming Analytics * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chat * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Big Data & Analytics Heroes * For Developers * Events * Upcoming Events * Webcasts * Twitter Chat * Meetups * Around The Web * About Us * Contact Us * Search Site",Here’s a quick and handy guide to creating data visualizations that are appropriately detailed to ensure maximum effectiveness.,Data visualization playbook: The right level of detail,Live,36 102,"Homepage IBM Watson Data Lab Follow Sign in / Sign up Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. 4 mins ago -------------------------------------------------------------------------------- CREATE A CUSTOM DOMAIN FOR CLOUDANT USING CLOUDFLARE WHAT’S IN A NAME? PROXY TO GET SPEED AND PROTECTION TOO. When signing up for an IBM Cloudant account through cloudant.com, you pick a username, which becomes the sub-domain of cloudant.com , e.g. janedoe.cloudant.com . If you create a Cloudant service inside Bluemix, then you are assigned a randomly-generated sub-domain like dd4f-de8e79e7--9652-4d92-fd347be5b308-bluemix.cloudant.com . If you want to assign a custom domain to your Cloudant account, then you could perform the DNS magic yourself, but it would leave you with the responsibility of dealing with the creation of an HTTPS certificate for your domain. A much simpler alternative is to sign up for a Cloudflare account and let them do the heavy lifting! Cloudflare is a proxy service that sits between your users and your website handling caching, immunity to denial-of-serivce attacks, analytics, content optimisation, and lots more. In this case, we are going to place Cloudflare in front of a Cloudant account. This article assumes you have your own custom domain name already (like janedoe.com ) and have already signed up for a Cloudant account (like janedoe.cloudant.com ). We want to create a new sub-domain: db.janedoe.com , which will work with HTTPS and whose traffic will be sent to Cloudant. SIGN UP FOR CLOUDFLARE Visit www.cloudflare.com and create an account. Enter your custom domain name and let Cloudflare perform its initial scan. ADD A CNAME RECORD Once the Cloudflare scan of your existing domain is complete, we can tell Cloudflare that we wish to proxy db.janedoe.com to janedoe.cloudant.com . To do this, we create a CNAME record by completing the form: Here, we choose the CNAME type from the pull-down list and enter the new sub-domain ( db ) and our target ( janedoe.cloudant.com ). TELL CLOUDANT ABOUT YOUR DOMAIN NAME Cloudant also needs to know about this new naming strategy. In the Cloudant dashboard, select Account > Virtual Hosts and complete the form. Enter your new domain name ( db.janedoe.com ) and click the Add Domain button: TESTING After a few minutes, you should be able to visit http://db.janedoe.com and https://db.janedoe.com (HTTPS may take up to 24 hours to take effect). That's it! Note: If you bind your proxied Cloudant service to a Bluemix app, this mapping will not take effect because the VCAP_SERVICES entry for Cloudant will not reflect the new domain name.BENEFITS OF USING CLOUDFLARE AND CLOUDANT Cloudflare offers several benefits for Cloudant users * HTTP2 . Cloudflare supports HTTP2/SPDY out of the box. So requests from HTTP2-compatible sources (like Google Chrome) would benefit from the smaller binary protocol, the single multiplexed connection, and the compressed headers that HTTP2 affords. * Free HTTPS . Your custom domain can be covered by a free HTTPS certificate without any fuss. * DDoS protection . If you are paying for a quota of Cloudant requests, then the last thing you want is for a bad actor to maliciously call your Cloudant account directly at your expense. * Compression . Traffic between the browser/user-agent and Cloudflare can be compressed, reducing the amount of bandwidth required to transmit or receive requests. * Caching . If you upgrade to a paid plan, then you can customise Cloudflare to cache certain requests to improve performance or to take some load from your Cloudant service. * Analytics . You can see the statistics of which URLs are being hit. CONCLUSION That’s how easy it is to set up a custom domain for your Cloudant service. If you use Cloudant on Bluemix, the process is the same. (To reach your Cloudant dashboard from Bluemix, just open the service and click Launch .) Then follow the steps outlined in this post. Enjoy your new custom domain, along with all the benefits of Cloudflare. Cloudant Cloudflare Tutorial Proxy DNS Blocked Unblock Follow FollowingGLYNN BIRD Developer Advocate @ IBM Watson Data Platform. Views are my own etc. FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform. * Share * * * × Don’t miss Glynn Bird’s next story Blocked Unblock Follow Following Glynn Bird","When you customise your Cloudant domain with Cloudfare, you get better performance, DDoS protection, and caching too. Here’s how to set it up.",Create a Custom Domain for Cloudant Using Cloudflare – IBM Watson Data Lab,Live,37 105,"The primary index is the fastest way to retrieve data from your database.Enhance this tutorial with live data from a sample database inside your Cloudant account.For security purposes, please sign in or sign up to demo the API.To demo the Cloudant API, you'll need to replicate a small sample database into your account. The database is named animaldb, and it contains information from Wikipedia about ten different animals.The primary index is fast because it comes with every Cloudant database, which means you don't have to write any code before you can use it.The primary index, often referred to as _all_docs, returns an id, a key and a value for every document in the database. The id and key are the same (Cloudant makes an index keyed by doc id), while the value is the _rev of the document._all_docs also reports on the total number of documents and any offset used to query the index.Demo the Cloudant API right here. The server response (JSON) will appear directly below.Sign in or create a free account to demo the Cloudant API.To demo the API here, replicate the sample database first.All indexes are sorted by their key. The sort order is:The full specification is documented in theCouchDB Wiki.The generic _all_docs request above returns all the documents in the database. That's fine for this example database, but in a realistic scenario you'll probably want a more manageable result set. That's where API options come in.Add the limit parameter to keep your result set to a certain size. If you want to offset your result set (for example to paginate through some rows) you can also pass in a skip parameter.In this call, we limit the result set to 2 rows and skip the first 3 rows.Use slicing to pull out row ranges from the index by using start and end keys in your query.Here we are looking for animals with names that begin with letters greater than the startkey up to and including the endkey.If you don't want to include documents that match the end key, add the inclusive_end parameter with a value of false.View slicing with starkey and endkey can be combined with skip, limit and inclusive_end to further constrain your result set.Cloudant's primary index automatically turns a document's _id into its key. If you want a document matching a single key, find it with the key parameter.Here, we're looking for a document indexed with the key of ""llama"".You can also hit the document directly, without additional parameters, at its unique URL. The result is similar to the single key request we made above, but different in that all fields are included in the result.Use include_docs=true when you want all of the contents of the document you're requesting (not just the id).This API call uses include_docs=true along with limit and skip.You can also query for a specific set of keys by POSTing a JSON array of keys to the view.As we've seen, the _all_docs index can be a very useful view into your database, especially if your application has a natural unique identifier that you can use for your documents.As your data grows, you'll want to explore secondary indexes, which allow you to build additional indexes over your database, defined by efficient MapReduce views.Browse the API ReferenceGet the API Reference (PDF)","A guide to using Cloudant's _all_docs endpoint to retrieve documents by id, or within a range of keys using this interactive tutorial.",For Developers: Querying the Cloudant Primary Index,Live,38 106,"* R Views * About this Blog * Contributors * Some Resources * * R Views * About this Blog * Contributors * Some Resources * REPRODUCIBLE FINANCE WITH R: PULLING AND DISPLAYING ETF DATA by Jonathan Regenstein It’s the holiday season, and that can mean only one thing: time to build a leaflet map as an interface to country Exchange Traded Fund ( ETF ) data! In previous posts, we examined how to import stock data and then calculate and display the Sharpe Ratio of a portfolio. Today, we’re going to skip the calculations and focus on a nice interface for pulling and displaying data. Specifically, our end product will enable users to graph country ETF prices by clicking on those countries in an interactive map, instead of having to use the ETF ticker symbol. Admittedly, part of the motivation here is that I don’t like having to remember ticker symbols for country ETFs, but hopefully others will find it useful too. Our app will be simple in that it displays price histories, but it can serve as the foundation for more complicated work, as we will discuss when the app is completed in the next post. At the outset, it is crucial to note that this Notebook will serve a different purpose than our previous Notebook. As before, we will use this Notebook to test data import, wrangling, and our visualizations before taking the next step of building an interactive Shiny app. However, we are going to save objects from this Notebook into a .Rdat file, and then use that file in our app. In that way, this Notebook is more fundamentally connected to our app than our previous Notebook. In the next “finance Friday = fun day” post, we will go through how to build that app (though frankly the hard work occurs in this Notebook), but for today here is how we’ll proceed. First, we will get our ETF tickers, countries and year-to-date performance data into a nice, neat data frame. Note that the data frame will not hold the price history data itself. Rather, it will hold simply the ticker symbols, country names and YTD percentages. Next, we pass those ticker symbols to the getSymbols() function and download the price histories for the county ETFs. Advance warning: there are 42 country ETFs in this example, and downloading 42 xts objects takes time and RAM. I recommend using the server version of the IDE if you want to run this code, or truncate and grab three or four price histories, or skip this step. As we’ll see, it is not strictly necessary to pass all of those tickers to getSymbols() right now because the data will be downloaded on the fly when a user clicks on a country in our Shiny app. However, even though it requires a lot memory, I prefer to download all 42 price histories in order to confirm that the tickers are correct and accessible via getSymbols() . Better to find the typos now than to have users discover an error in the app. Once we have confirmed that our ticker symbols are valid, it’s time for step 3: build our map using a shapefile of the world’s countries. This step requires a lot of RAM, but leaflet makes the process quite simple from a coding perspective. If you’re new to map building, this will serve as a gentle introduction to creating a usable interactive map. Fourth, and very importantly, we will add our ETF tickers and year-to-date performance data to our shapefile, making them accessible via clicks on the map. At this step, we will be thankful that when we created a data frame in step 1, we used the same country names as appear on the map: that forethought will allow us to do an easy merge() of the data. We’ll then build the map to make sure it looks how we want it to look in the final app. Once we have a shapefile with our ETF tickers added, we’ll save it to a .RDat file that we can load into our Shiny app. Let’s get to it! Building an interface to country ETFs will require those ETF ticker symbols. We also need the country names to go alongside them. Why country names instead of, say, the full ETF title? We need a way to synchronize with our map file and country names is a good way. There’s no way to know this ahead of time without thinking through the structure of the app and probably making liberal use of a whiteboard. That valuable country ETF data is available here . Have a peek at that link and notice that the year-to-date performance is also readily available. I hadn’t planned on including YTD performance in any way, but we’ll grab it and put it to good use. That data is not available in the html, so simple rvest moves aren’t going to help us. There’s a download button, but I found it easier to copy/paste to a spreadsheet and then import to the IDE. I will spare us the gsub() pain of extracting country names from the fund titles (though direct message me if you want that code) and paste the tickers, country names and year-to-date performance below. The data frame looks pretty good, though quite simple, and it’s fair to wonder why I bothered to highlight this step with it’s own code chunk. In fact, getting the clean ticker and country names was quite time-consuming, and that will often be the case: the most prosaic data import and tidying tasks can take a long time! Here is another fine occasion to bring up reproducibility and work flow. Once you or your colleague has spent the time to get a clean data frame with ticker and country names, we definitely want to make sure that no one else, including your future self, has to duplicate the effort for a future project. I put this step in it’s own code chunk so that the path back to the clean data would be as clear as possible. For that reason, I also have a personal preference for the ‘DataGrab’ file naming convention – i.e., in the IDE, I named this file ‘Global-ETF-Map-DataGrab’. Whenever I use a Notebook for the purpose of importing, tidying, building and then saving objects in a .Rdat file that will be loaded by a Shiny app, I include ‘DataGrab’ in the name of the file. If future me or a team member needs to locate the file behind one of our flexdashboards, they will know that it has ‘DataGrab’ in the title. Back to the code at hand! Now that we have the tickers in a data frame column, we can use getSymbols() to import the price history of each ETF. We aren’t going to use the results of this import in the app. Rather, we are going to perform this import to test that we have the correct symbols, and that they play nicely with getSymbols() , because that is the function we will use in our Shiny app. Alright, it looks like we’ve been successful at importing the closing price history of the country ETFs. Nothing too complicated here and again, our purpose was to test that the ticker symbols are correct. We are not going to be saving these prices for future use. Now it’s time to build a map of the Earth! First, we will need a shapefile that contains the spatial polygons for the countries of the world. The next code chunk will grab a shapefile from naturalearthdata.com . That shapefile has the longitude and latitude coordinates for the world’s countries and some data about them. We’ll then use the readOGR() function from the rgdal package to load the shapefile into our global environment. Take a peek at the data frame portion of the shapefile, and scroll to the right to see some interesting things like GDP estimates and economic development stages. It’s pretty nice that the shapefile contains some economic data for us. The other portion of the shapefile is the spatial data: longitude and latitude coordinates. If you’re not a cartographer, don’t worry about those for now. If you’re not familiar with spatial data frames, that’s okay because neither am I. The leaflet package makes building a nice interactive map with these shapefiles relatively painless. Before building a map, let’s make use of the data that was included in our data frame. The ‘gpd_md_est’ column (which you can see in the data frame above) contains GDP estimates for each country. We’ll add some color to our map with shades of blue that are darker for higher GDPs and lighter for lower GDPs. We want something to happen when a user clicks a country. How about a popup with country name and stage of economic development? Again, that data is included in the shapefile we downloaded. Now we can use leaflet to build a world map that is shaded by GDP and displays a popup. Note the ‘layerId = ~name’ snippet below – it creates a layer of country names. We will change that later in an important way. The map looks good, but it sure would be nice if we could add the ETF ticker symbols and year-to-date data to the world spatial data frame object – and we can! Our ‘name’ column in the ETF data frame uses the same country naming convention as the ‘name’ column of the map, and those columns are both called ‘name’. Thus, we can use the merge() function from the sp package to add the ETF data frame to the spatial data frame. This is similar to a join using dplyr. The correspondence of country names wasn’t just luck – I had the benefit of having worked with this shapefile in the past, and made sure the country names matched up, and now you have the benefit of having worked with this shapefile. For any future project that incorporates a map like this, give some forethought to how data might need to be merged with the shapefile. The shapefile and the new data need a way to be matched. Country names usually work well. After the merging, the ticker symbols and year-to-date number columns will be added for each country that has a match in the ‘name’ column. For those with no match, the ‘ticker’ and ‘ytd’ columns will be filled with NA. Now that the ytd data is added, let’s shade the different countries according to the year-to-date performance of the country EFT, instead of by GDP as we did before. A nice side benefit of this new shading scheme: if a country has no ETF, it will remain an unattractive grey. The new shading is nice, but let’s also have the popup display the exact year-to-date performance percentage for any detail-oriented users. Now we’ll build another map that uses the year-to-date color scheme and popup, but we will make one more massively important change: we will change layerId = ~name to layerId = ~ticker to create a map layer of tickers. Why is this massively important? When we eventually create a Shiny app, we want to pass ticker symbols to getSymbols() based on a user click. The ‘layerId’ is how we’ll do that: when a user clicks on a country, we capture the ‘layerId’, which is a ticker name that we can pass to getSymbols() . But that is getting ahead of ourselves. For now, here is the new map: Fantastic: we have a map that is shaded by the YTD performance of country ETFs, and displays that YTD percentage in the popup. Notice the difference between this map and the previous map which was shaded by GDP: a user can quickly see which countries have ETFs and click to see more. The ‘world_etf’ shapefile is going to play a crucial role in our Shiny app, and the last step is to save it for use in our flexdashboard. Note that we are not going to save the ETF price data. It’s not needed in the interactive Shiny app because that data will be imported dynamically when a user clicks. That allows our dashboard to be constantly updated in real time. Remember that we loaded up the ETF data in this Notebook so that we could ensure that the ticker symbols play nicely with getSymbols() . Next time, we’ll wrap this up into a Shiny app by way of flexdashboard, and that app will allow users to click on a country and graph the ETF history. The first thing we’ll do in that file is load the .RDat file that we just created. There are two pieces of good news: first, we’ve already done the hard work of creating a map object, and the app coding is the fun part. Second, the work here does not need to be repeated for any future projects. If you or your team ever need to build a map of the world shaded by GDP estimates or ETF YTD performance, here it is. If you ever need the clean tickers, year-to-date performance or the time series data on these 42 country ETFs, here it is. See you soon! Jonathan Regenstein 2016-12-14T13:17:05+00:008 COMMENTS 1. Reproducible Finance with R: Pulling and Displaying ETF Data - Use-R!Use-R! December 14, 2016 at 10:12 pm[…] leave a comment for the author, please follow the link and comment on their blog: RStudio.R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data […] 2. 3. Reproducible Finance with R: Pulling and Displaying ETF Data – Mubashir Qasim December 15, 2016 at 2:20 am[…] article was first published on RStudio, and kindly contributed to […] 4. 5. Rich December 15, 2016 at 12:02 pmExcellent! 6. 7. Angus Davidson December 15, 2016 at 1:28 pmGreat post. Really useful. Off topic slightly but do you know of any good R training companies in London? I work in finance for an I Bank in London. Want to get a couple of days of classroom training to get me going. Just wondering if you could suggest anyone? * Jonathan Regenstein December 15, 2016 at 1:39 pmHi Angus, i’ll direct message over some thoughts. Thanks for the kind words. * Angus Davidson December 16, 2016 at 6:41 amHi Jonathan, Pleasure. thanks for putting in the work to produce the post. Have PM’d you but posting as a comment in case anyone else out there has done a course in London. I’m looking for a two or three day course to get me going. Once I’m up and running I’ll be using it a fair bit at work and so should learn it pretty fast hopefully. What I really want to do is accelerate that really slow painful early bit where you’re trying to get going and everything is new. Have done a bit of research and found this course – anyone used them? Any thoughts? Thanks for any help. Angus http://www.acuitytraining.co.uk/server-database-programming/r-training-course/ * * 8. 9. Reproducible Finance with R: A Shiny ETF Map – RStudio December 16, 2016 at 12:01 pm[…] a previous post, we built an R Notebook that laid the groundwork for a Shiny app that allows users to graph country […] 10. 11. Reproducible Finance with R: A Shiny ETF Map - Use-R!Use-R! December 16, 2016 at 9:12 pm[…] a previous post, we built an R Notebook that laid the groundwork for a Shiny app that allows users to graph country […] 12. Comments are closed. 250 Northern Ave, Boston, MA 02210 844-448-1212 info@rstudio.com DMCA Trademark Support ECCN * Switch tabs w/o muscle cramps: New RStudio Desktop 1.0.136 switches w/ Ctrl+Tab. Lots of tabs? Ctrl+Shift+. to select tab by name! #rstats 6 days ago Copyright 2016 RStudio | All Rights Reserved | Legal Terms Twitter Linkedin Facebook Rss Email github Rss","Our app will be simple in that it displays price histories, but it can serve as the foundation for more complicated work, as we will discuss when the app is completed in the next post. At the outset, it is crucial to note that this Notebook will serve a different purpose than our previous Notebook.",Pulling and Displaying ETF Data,Live,39 107,"Stats and Bots Follow Sign in / Sign up * Home * Subscribe * * 🤖 TRY STATSBOT FREE - Empower every department with data * Vadim Smolyakov Blocked Unblock Follow Following passionate about data science and machine learning https://github.com/vsmolyakov Aug 22 -------------------------------------------------------------------------------- ENSEMBLE LEARNING TO IMPROVE MACHINE LEARNING RESULTS HOW ENSEMBLE METHODS WORK: BAGGING, BOOSTING AND STACKING Ensemble learning helps improve machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model. That is why ensemble methods placed first in many prestigious machine learning competitions, such as the Netflix Competition, KDD 2009, and Kaggle. The Statsbot team wanted to give you the advantage of this approach and asked a data scientist, Vadim Smolyakov, to dive into three basic ensemble learning techniques. -------------------------------------------------------------------------------- Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model in order to decrease variance (bagging), bias (boosting), or improve predictions (stacking). Ensemble methods can be divided into two groups: * sequential ensemble methods where the base learners are generated sequentially (e.g. AdaBoost). The basic motivation of sequential methods is to exploit the dependence between the base learners. The overall performance can be boosted by weighing previously mislabeled examples with higher weight. * parallel ensemble methods where the base learners are generated in parallel (e.g. Random Forest). The basic motivation of parallel methods is to exploit independence between the base learners since the error can be reduced dramatically by averaging. Most ensemble methods use a single base learning algorithm to produce homogeneous base learners, i.e. learners of the same type, leading to homogeneous ensembles . There are also some methods that use heterogeneous learners, i.e. learners of different types, leading to heterogeneous ensembles . In order for ensemble methods to be more accurate than any of its individual members, the base learners have to be as accurate as possible and as diverse as possible. BAGGING Bagging stands for bootstrap aggregation. One way to reduce the variance of an estimate is to average together multiple estimates. For example, we can train M different trees on different subsets of the data (chosen randomly with replacement) and compute the ensemble: Bagging uses bootstrap sampling to obtain the data subsets for training the base learners. For aggregating the outputs of base learners, bagging uses voting for classification and averaging for regression . We can study bagging in the context of classification on the Iris dataset. We can choose two base estimators: a decision tree and a k-NN classifier. Figure 1 shows the learned decision boundary of the base estimators as well as their bagging ensembles applied to the Iris dataset. Accuracy: 0.63 (+/- 0.02) [Decision Tree] Accuracy: 0.70 (+/- 0.02) [K-NN] Accuracy: 0.64 (+/- 0.01) [Bagging Tree] Accuracy: 0.59 (+/- 0.07) [Bagging K-NN] The decision tree shows the axes’ parallel boundaries, while the k=1 nearest neighbors fit closely to the data points. The bagging ensembles were trained using 10 base estimators with 0.8 subsampling of training data and 0.8 subsampling of features. The decision tree bagging ensemble achieved higher accuracy in comparison to the k-NN bagging ensemble. K-NN are less sensitive to perturbation on training samples and therefore they are called stable learners. Combining stable learners is less advantageous since the ensemble will not help improve generalization performance.The figure also shows how the test accuracy improves with the size of the ensemble. Based on cross-validation results, we can see the accuracy increases until approximately 10 base estimators and then plateaus afterwards. Thus, adding base estimators beyond 10 only increases computational complexity without accuracy gains for the Iris dataset. We can also see the learning curves for the bagging tree ensemble. Notice an average error of 0.3 on the training data and a U-shaped error curve for the testing data. The smallest gap between training and test errors occurs at around 80% of the training set size. A commonly used class of ensemble algorithms are forests of randomized trees.In random forests , each tree in the ensemble is built from a sample drawn with replacement (i.e. a bootstrap sample) from the training set. In addition, instead of using all the features, a random subset of features is selected, further randomizing the tree. As a result, the bias of the forest increases slightly, but due to the averaging of less correlated trees, its variance decreases, resulting in an overall better model. In an extremely randomized trees algorithm randomness goes one step further: the splitting thresholds are randomized. Instead of looking for the most discriminative threshold, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows reduction of the variance of the model a bit more, at the expense of a slightly greater increase in bias. BOOSTING Boosting refers to a family of algorithms that are able to convert weak learners to strong learners. The main principle of boosting is to fit a sequence of weak learners− models that are only slightly better than random guessing, such as small decision trees− to weighted versions of the data. More weight is given to examples that were misclassified by earlier rounds. The predictions are then combined through a weighted majority vote (classification) or a weighted sum (regression) to produce the final prediction. The principal difference between boosting and the committee methods, such as bagging, is that base learners are trained in sequence on a weighted version of the data. The algorithm below describes the most widely used form of boosting algorithm called AdaBoost , which stands for adaptive boosting. We see that the first base classifier y1(x) is trained using weighting coefficients that are all equal. In subsequent boosting rounds, the weighting coefficients are increased for data points that are misclassified and decreased for data points that are correctly classified. The quantity epsilon represents a weighted error rate of each of the base classifiers. Therefore, the weighting coefficients alpha give greater weight to the more accurate classifiers. The AdaBoost algorithm is illustrated in the figure above. Each base learner consists of a decision tree with depth 1, thus classifying the data based on a feature threshold that partitions the space into two regions separated by a linear decision surface that is parallel to one of the axes. The figure also shows how the test accuracy improves with the size of the ensemble and the learning curves for training and testing data. Gradient Tree Boosting is a generalization of boosting to arbitrary differentiable loss functions. It can be used for both regression and classification problems. Gradient Boosting builds the model in a sequential way. At each stage the decision tree hm(x) is chosen to minimize a loss function L given the current model Fm-1(x): The algorithms for regression and classification differ in the type of loss function used. STACKING Stacking is an ensemble learning technique that combines multiple classification or regression models via a meta-classifier or a meta-regressor. The base level models are trained based on a complete training set, then the meta-model is trained on the outputs of the base level model as features. The base level often consists of different learning algorithms and therefore stacking ensembles are often heterogeneous. The algorithm below summarizes stacking. The following accuracy is visualized in the top right plot of the figure above: Accuracy: 0.91 (+/- 0.01) [KNN] Accuracy: 0.91 (+/- 0.06) [Random Forest] Accuracy: 0.92 (+/- 0.03) [Naive Bayes] Accuracy: 0.95 (+/- 0.03) [Stacking Classifier] The stacking ensemble is illustrated in the figure above. It consists of k-NN, Random Forest, and Naive Bayes base classifiers whose predictions are combined by Logistic Regression as a meta-classifier. We can see the blending of decision boundaries achieved by the stacking classifier. The figure also shows that stacking achieves higher accuracy than individual classifiers and based on learning curves, it shows no signs of overfitting. Stacking is a commonly used technique for winning the Kaggle data science competition. For example, the first place for the Otto Group Product Classification challenge was won by a stacking ensemble of over 30 models whose output was used as features for three meta-classifiers: XGBoost, Neural Network, and Adaboost. See the following link for details. CODE In order to view the code used to generate all figures, have a look at the following ipython notebook . CONCLUSION In addition to the methods studied in this article, it is common to use ensembles in deep learning by training diverse and accurate classifiers. Diversity can be achieved by varying architectures, hyper-parameter settings, and training techniques. Ensemble methods have been very successful in setting record performance on challenging datasets and are among the top winners of Kaggle data science competitions. RECOMMENDED READING * Zhi-Hua Zhou, “Ensemble Methods: Foundations and Algorithms”, CRC Press, 2012 * L. Kuncheva, “Combining Pattern Classifiers: Methods and Algorithms”, Wiley, 2004 * Kaggle Ensembling Guide * Scikit Learn Ensemble Guide * S. Rachka, MLxtend library * Kaggle Winning Ensemble YOU’D ALSO LIKE: Support Vector Machines Tutorial Learning SVMs from examples blog.statsbot.co Google Analytics Audit Checklist and Tools Auditing a Google Analytics setup like a pro blog.statsbot.co Generative Adversarial Networks (GANs): Engine and Applications How generative adversarial nets are used to make our life better blog.statsbot.co * Machine Learning * Ensemble * Ensemble Learning * Data Science * Random Forest 2 Blocked Unblock Follow FollowingVADIM SMOLYAKOV passionate about data science and machine learning https://github.com/vsmolyakov FollowSTATS AND BOTS Data stories on machine learning and analytics. From Statsbot’s makers. * Share * 2 * * * Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates",Ensemble learning helps improve machine learning results by combining several models. Ensemble methods allow the production of better predictive performance compared to a single model. ,Ensemble Learning to Improve Machine Learning Results,Live,40 109,"TL;DR: It's easy to customise the Mongo shell's prompt, especially if If you use MongoDB shell with one of MongoHQ's Elastic Deployments, you will have noticed that your replica set name is not as catchy as it could be. You will have noticed it because that replica set name appears in your Mongo shell prompt by default. So your prompt looks something like this...set-5345738b13a3efb950000d32:PRIMARY>That's a bit noisy and probably not telling you much unless you have excellent skills memorising and comparing hex strings. It is also telling you whether you are connected to the primary or secondary. Thats at least 37 characters and you are nearly half way across the screen before you start typing. What would probably be more useful is a shorter customised prompt that tells you what you want to know. Here we'll show you how in a couple of steps.The prompt in Mongo shell is derived from a variable called prompt; if its not defined, then the shell shows us its default. The best place to set that is in your .mongorc.js file but before we do that let's see what we can do with it in the shell.The first thing you might want to do is just strip down the prompt. To do that you can set prompt to a string:set-5345738b13a3efb950000d32:PRIMARY> var prompt="">""If you aren't keen on the minimalism, just delete prompt to return to the default prompt. You can always set the variable to something more informative. Let's put the database name into the prompt:Of course we have to remember that the value of the prompt when set like that, is unchanging. So if we put the time in prompt like so:exemplum>var prompt=ISODate().toLocaleTimeString()+"">""14:53:38>it would always be 14:53:38. To have a dynamic prompt, we need to set prompt to a function, so that evaluating it will make the function return an newly calculated value.14:53:38>var prompt=function(){ return ISODate().toLocaleTimeString()+"">""; }14:56:35>14:56:38>Now the time will update when it displays the prompt. We're using the time here as its the most easily accessible changing value for all users. It could be any other statistic you want to display, but do remember that statistic will be calculated every time the prompt is displayed.We're already getting to the point where we want to make this more permanant, so exit the shell and open an editor on your .mongorc.js file. You'll find it in your $HOME directory on Unix systems. We can set the prompt variable in there and create our own, ever smarter, version by adding this to .mongorc.js:var prompt=function() {var dbname=db.getName();var master=db.isMaster().ismaster;var dblabel=master?dbname:""(""+dbname+"")""var time=ISODate().toLocaleTimeString();return dblabel+""/""+time+"">"";Here, the dbname is shown with parentheses around it if we are talking to a secondary node, and no parentheses if the primary, and adds in the local time to our compact prompt. That gives us a neat prompt:exemplum/15:20:30>And that replica set name we hid at the start? If you need that, just run db.isMaster().setName in the shell.Now you can go and customise your prompt to what you need and get it displayed in the compact (or verbose) form you prefer. And when you've come up with your perfect custom prompt why not share it with us by mailing it to dj@mongohq.com and we'll publish them in a future article.","It's easy to customize the Mongo shell's prompt, especially if you use MongoDB shell with one of MongoHQ's Elastic Deployments.",Customizing MongoDB’s Shell with Compact Prompts,Live,41 110,"GETTING STARTED WITH COMPOSE'S SCYLLADB Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Sep 22, 2016Getting started with ScyllaDB is easy since it is a drop in replacement for Apache's Cassandra database. For all intents and purposes, Scylla looks just like Cassandra to your code. So much so that Scylla even uses Cassandra's drivers. The main difference is in implementation. Scylla is written in C++ while Cassandra is written in Java. Compose's ScyllaDB is the latest version: Scylla 1.3. This version corresponds to Cassandra 2.1.8 with a detailed compatibility matrix here . One of the benefits of mimicking Cassandra is that the tool chain, drivers, and built in query language, cql , are already mature since they have evolved through multiple iterations and a great deal of use. The number of drivers on Planet Cassandra , all of which are compatible, are far beyond a typical 1.x project. cql , a SQL like language, has grown into being the de facto way to interact with Scylla/Cassandra and it even has its own shell, cqlsh , similar to many SQL shells for RDBMSs. What follows is a brief run through of some of the highlights of connecting to ScyllaDB on Compose. After creating a deployment, we will look at getting connected with cqlsh then we will review connecting on the JVM, Python, and NodeJS runtimes to go over the basics of getting started. Connect with cqlshAssuming you already have a Compose account (if not you can get a 30 day free trial here ), creating a deployment of ScyllaDB is little more than hitting ""Create Deployment"" and choosing ""ScyllaDB"". After a couple of minutes, a three node cluster will have been created for you: The ""Overview"" page has all of the information needed to connect to your new Scylla cluster. The easiest way to verify your deployment and your tools is to connect directly with cqlsh . Depending on your platform there are multiple ways to get this tool onto your local device whether that be your laptop, a cloud VM, or even your own dedicated hardware. The easiest is to just install the latest Cassandra release (the latest versions still support version 2.1.8 which is what Scylla is) and use the builtin cqlsh . On a Mac with homebrew , it is nothing more than brew install cassandra . For others there are myriad ways from package managers to straight downloads. Use whatever suits your platform best. From the ""Overview"" page it is easy to copy the cmd (any one of them will work): and then just paste it into your shell to execute it: If you type HELP you can see that the shell has a lot of capability. What's even nicer is that all of those commands have TAB completion too. Let's try it. Type CREATE KEYSPACE my_new_keyspace you should see the choices for the replication class. Go ahead and choose SimpleStrategy since the cluster won't be spanning multiple data centers. Hit again and enter in 3 for the replication_factor. Then close the brace with } and finish the statement with ; . You just created your first KEYSPACE and defaulted it to replicating your data to all three nodes in your cluster. Now that you have a keyspace let's use it: USE my_new_keyspace; Your shell will show that your command prompt is using your keyspace by default: Every table has to have a keyspace and when we create one in the shell here it will default to my_new_keyspace . While Scylla/Cassandra has evolved into having a schema language that looks very similar to SQL. It's not really the case. Unlike an RDBMS, a row here is much more like a key value lookup. It just so happens that the value has a flexible schema which we are about to define: CREATE TABLE my_new_table ( my_table_id uuid, last_name text, first_name text, PRIMARY KEY(my_table_id) ); Type that CREATE TABLE command in your cqlsh to give us a place to populate with the following examples. CONNECT FROM THE JVM One of the most advanced drivers for Cassandra is the Java driver. This makes sense considering Cassandra is written in Java. What follows is a Groovy script. For those who utilize just about any JVM language translating from Groovy to your language of choice should be relatively straightforward: @Grab('com.datastax.cassandra:cassandra-driver-core:3.1.0') @Grab('org.slf4j:slf4j-log4j12') import com.datastax.driver.core.BoundStatement import com.datastax.driver.core.Cluster import com.datastax.driver.core.Host import com.datastax.driver.core.PreparedStatement import com.datastax.driver.core.Row import com.datastax.driver.core.Session import static java.util.UUID.randomUUID Cluster cluster = Cluster.builder() .addContactPointsWithPorts( new InetSocketAddress(""aws-us-east-1-portal9.dblayer.com"", 15399 ), new InetSocketAddress(""aws-us-east-1-portal9.dblayer.com"", 15401 ), new InetSocketAddress(""aws-us-east-1-portal6.dblayer.com"", 15400 ) ) .withCredentials(""scylla"", ""XOEDTTBPZGYAZIQD"") .build() Session session = cluster.connect(""my_new_keyspace"") PreparedStatement myPreparedInsert = session.prepare( """"""INSERT INTO my_new_table(my_table_id, last_name, first_name) VALUES (?,?,?)"""""") BoundStatement myInsert = myPreparedInsert .bind(randomUUID(), ""Hutton"", ""Hays"") session.execute(myInsert) session.close() cluster.close() To get started we pull in the latest Cassandra driver: @Grab('com.datastax.cassandra:cassandra-driver-core:3.1.0') After all of the imports we use a Cluster.builder() to build up the configuration. Just one of the ContactPoint s is used to connect. From that connection the other nodes in the cluster are discovered. If that ContactPoint is unreachable on connect then another is used which is why we add all three. PreparedStatement s may be familiar since they are analogous to other DBs' features of the same name. The statement is parsed and held at the server ready to be used over and over again. The following calls to bind and execute populate and send the data over to the server for actual execution. While there are simpler methods for one off execution, it is good to highlight such a useful feature. To prove that the script works go back to your cqlsh and query the table: CONNECT FROM PYTHON Support for languages other than Java is very solid too. Python is a great example. cqlsh is even written in Python. So make no mistake the support here is more than up to date: pip install cassandra-driver The above pulls in the driver with a python package manager pip . The following performs very similarly to the Java code of preparing a statement and executing an insert: from cassandra.cluster import Cluster from cassandra.auth import PlainTextAuthProvider import uuid auth_provider = PlainTextAuthProvider( username='scylla', password='XOEDTTBPZGYAZIQD') cluster = Cluster( contact_points = [""aws-us-east-1-portal9.dblayer.com""], port = 15401, auth_provider = auth_provider) session = cluster.connect('my_new_keyspace') my_prepared_insert = session.prepare("""""" INSERT INTO my_new_table(my_table_id, first_name, last_name) VALUES (?, ?, ?)"""""") session.execute(my_prepared_insert, [uuid.uuid4(), 'Snake', 'Hutton']) To verify again we'll run the same SELECT : CONNECT FROM NODEJS Last but not least: Javascript. npm install cassandra-driver npm install uuid We use the ubiquitous node package manager (npm) to install the driver and the needed uuid library. The very similar code to the above examples follows: var cassandra = require('cassandra-driver') var authProvider = new cassandra.auth.PlainTextAuthProvider('scylla', 'XOEDTTBPZGYAZIQD') var uuid = require('uuid') client = new cassandra.Client({ contactPoints: [ ""aws-us-east-1-portal9.dblayer.com:15399"", ""aws-us-east-1-portal9.dblayer.com:15401"", ""aws-us-east-1-portal6.dblayer.com:15400"" ], keyspace: 'my_new_keyspace', authProvider: authProvider}); client.execute(""INSERT INTO my_new_table(my_table_id, first_name, last_name) VALUES(?,?,?)"", [uuid.v4(), ""V8"", ""Hutton""], { prepare: true }, function(err, result) { if(err) { console.error(err); } console.log(""success"") }); Once again we connect, prepare, and execute an insert statement. And finally we verify: MoreThere is so much more to ScyllaDB. Modelling data from queries first. User defined data types. Tunable consistency. Building databases without joins. Timestamps. Architecting an app with eventual consistency. CAP theorem. PACELC theorem. Dynamo and BigTable. On and on... The flexible availability guarantees of ScyllaDB/Cassandra really are a great tool and plumbing the depths of how to make them work well can take some time. We at Compose though are excited about ScyllaDB and look forward to seeing what you can do with such a great new database. Image by Margarida CSilva Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton is a spirit runner. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud * Subscribe Join 75,000 other devs on our weekly database newsletter. © 2016 Compose",Getting started with ScyllaDB is easy since it is a drop in replacement for Apache's Cassandra database.,Getting Started with Compose's ScyllaDB,Live,42 115,"☰ * Login * Sign Up * Learning Paths * Courses * Our Courses * Partner Courses * Badges * Our Badges * BDU Badge Program * Events * Blog * Resources * Resources List * Downloads * DEEP LEARNING WITH TENSORFLOW The majority of data in the world is unlabeled and unstructured. Shallow neural networks cannot easily capture relevant structure in, for instance, images, sound, and textual data. Deep networks are capable of discovering hidden structures within this type of data. In this TensorFlow course you'll use Google's library to apply deep learning to different data types in order to solve real world problems. Login to EnrollTELL YOUR FRIENDS * * * * * Course code: ML0120EN * Audience: Anyone interested in Machine Learning, Deep Leaning and TensorFlow * Course level: Advanced * Time to complete: 10 Hours * Learning path: Deep Learning This Deep Learning with TensorFlow course focuses on TensorFlow. If you are new to the subject of deep learning, consider taking our Deep Learning 101 course first. Traditional neural networks rely on shallow nets, composed of one input, one hidden layer and one output layer. Deep-learning networks are distinguished from these ordinary neural networks having more hidden layers, or so-called more depth. These kind of nets are capable of discovering hidden structures within unlabeled and unstructured data (i.e. images, sound, and text), which consitutes the vast majority of data in the world. TensorFlow is one of the best libraries to implement deep learning. TensorFlow is a software library for numerical computation of mathematical expressional, using data flow graphs. Nodes in the graph represent mathematical operations, while the edges represent the multidimensional data arrays (tensors) that flow between them. It was created by Google and tailored for Machine Learning. In fact, it is being widely used to develop solutions with Deep Learning. In this TensorFlow course, you will be able to learn the basic concepts of TensorFlow, the main functions, operations and the execution pipeline. Starting with a simple “Hello Word” example, throughout the course you will be able to see how TensorFlow can be used in curve fitting, regression, classification and minimization of error functions. This concept is then explored in the Deep Learning world. You will learn how to apply TensorFlow for backpropagation to tune the weights and biases while the Neural Networks are being trained. Finally, the course covers different types of Deep Architectures, such as Convolutional Networks, Recurrent Networks and Autoencoders. Course Syllabus Module 1 – Introduction to TensorFlow * HelloWorld with TensorFlow * Linear Regression * Nonlinear Regression * Logistic Regression * Activation Functions Module 2 – Convolutional Neural Networks (CNN) * CNN History * Understanding CNNs * CNN Application Module 3 – Recurrent Neural Networks (RNN) * Intro to RNN Model * Long Short-Term memory (LSTM) * Recursive Neural Tensor Network Theory * Recurrent Neural Network Model Module 4 - Unsupervised Learning * Applications of Unsupervised Learning * Restricted Boltzmann Machine * Collaborative Filtering with RBM Module 5 - Autoencoders * Introduction to Autoencoders and Applications * Autoencoders * Deep Belief Network GENERAL INFORMATION * This TensorFlow course is free. * This course if with Python language. * It is self-paced. * It can be taken at any time. * It can be audited as many times as you wish. RECOMMENDED SKILLS PRIOR TO TAKING THIS COURSE * Neural Network REQUIREMENTS * Python programming COURSE STAFF Saeed Aghabozorgi , PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets. BIG DATA UNIVERSITY COURSE DEVELOPMENT TEAM Thanks to BDU course developement team, BDU interns and all individuals contributed to the development of this course: Kiran Mantri, Shashibushan Yenkanchi, Jag Rangrej, Naresh Vempala, Walter Gomes, Anita Vincent, Gabriel Sousa, Francisco Magioli, Victor Costa, Erich Sato, Luis Otavio and Rafael Belo. * About * Contact * Blog * Community * FAQ * Ambassador Program * Legal Follow us * * * * * * *",This free Deep Learning with TensorFlow course provides a solid introduction to the use of TensorFlow to analyze unstructured data.,Deep Learning With Tensorflow Course by Big Data University,Live,43 116,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectUNCOVER INSIGHTS ABOUT YOUR PRODUCTS HIDDEN IN STACK OVERFLOWmarkwatson / April 25, 2016As developer advocates one of our jobs is to help developers who areexperiencing issues with our products. Most developers turn to Stack Overflow to ask questions when they run into trouble (over 11.5 million questions askedto date!). We constantly monitor Stack Overflow for questions related to ourproducts, or our personal expertise, to provide as much assistance to developersas possible. We answer a lot of questions, and it’s important that we track andanalyze those questions.How do we conduct our Stack Overflow analysis? In this post we are going to showyou how to extend the Stack Overflow connector to provide real value and solvereal problems. We’ll show you how we use it to monitor the products we support,improve our responsiveness, and most importantly help our fellow developers.With 11,000,000+ questions, getting relevant Stack Overflow insights on aproduct is a challenge. We’ll show you how we do it, with our open source SimpleData Pipe app.THE STACK OVERFLOW CONNECTORIn this tutorial we showed you how to build a Simple Data Pipe Connector for Stack Overflow. Theend result was a connector that allowed users to select one of the top 30 mostactive tags on Stack Overflow and retrieve the 30 most active questions for thattag. While the tutorial served its purpose as a gentle introduction to Data PipeConnector development, we really didn’t create a connector that was all thatuseful.In this post we will show you how to extend the Stack Overflow connector to movedata that will enable us to: * Find questions that we need to answer. * Find out which of our products are most popular on Stack Overflow. * Run statistics to determine response rate, acceptance rate, etc.THE SIMPLE DATA PIPE SDKReflecting on the Stack Overflow connector we built in the previous tutorial,it’s easy to see where it was lacking: * We needed to be able to select less popular and more relevant tags, such as cloudant or apache-spark . * We needed to be able to pull more than 30 questions. * We needed to pull in the questions and the answers to those questions.It was obvious we needed to extend our connector. The Simple Data Pipe SDKallows us to extend almost every part of the connector, including: * Adding custom properties to the connector configuration. * Customizing the user interface for managing and running the connector. * Massaging or enhancing the data moved from the connector into Cloudant.ADDING CUSTOM PROPERTIESEvery pipe created in the Simple Data Pipe has a correspending document storedin the pipe_db database in Cloudant. This document contains information about the type of pipe(i.e., stackoverflow) and the configuration specific to that pipe. Here is asample document stored for a Stack Overflow pipe:{ ""_id"": ""fd1ffa968a467f73ce93d2a4720fdec4"", ""_rev"": ""28-666e0c198da190afafa275230e935a05"", ""connectorId"": ""stackoverflow"", ""name"": ""stackoverflow-html-tag"", ""type"": ""pipe"", ""version"": 1, ""clientId"": ""6812"", ""clientSecret"": ""ShxD2WxxxxxxSHxxJExX5x(("", ""oAuth"": { ""accessToken"": ""(R38xxxxC8WxxxPMN*Sp8Q))"" }, ""tables"": [ { ""name"": ""javascript"", ""label"": ""javascript"" }, { ""name"": ""java"", ""label"": ""java"" }, // ... { ""name"": ""html"", ""label"": ""html"" } ], ""selectedTableName"": ""html"", ""selectedTableId"": ""html""}The Simple Data Pipe SDK allows connector developers to add and access customproperties on this document. Developers can use those properties in code to makedecisions on how to retrieve data from the desired data source.To get the data that we need from Stack Overflow we are going to add three newproperties: * customTags : A comma-separated list of tags for questions that should be downloaded from Stack Overflow. * questionCount : The number of questions to download for each tag. * downloadAnswers : A boolean value specifying whether or not to download the answers for all tags.EXTENDING THE USER INTERFACEIn order to provide users the ability to specify custom values for our three newproperties ( customTags , questionCount , and downloadAnswers ) we need to make some changes to the user interface.We are going to customize the Filter page by adding a text field for users toenter the list of custom tags. This will populate our customTags property. We’ll add a pulldown with a list of paging options to populate our questionCount property. Finally, we’ll add a checkbox that will allow a user to specifywhether or not to retrieve the answers. This will set our downloadAnswers property.We start by copying the pipeDetails.tables.html page from the simple-data-pipeproject into the simple-data-pipe-stackoverflow project ( simple-data-pipe/app.templates simple-data-pipe-connector-stackoverflow/lib/templates ). We then add the following HTML:
Download Answers
Our new Filter page looks like this:When a user saves their filter options we can see the three new properties addedto the pipe config document in the database:{ ""_id"": ""8237fa1bd2ea945cee7f89f71c1fa112"", ""_rev"": ""98-694fb538c0a10d2245658cb90c5e6c1c"", ""connectorId"": ""stackoverflow"", ... ""customTags"": ""apache-spark,cloudant,dashdb"", ""questionCount"": ""500"", ""downloadAnswers"": true}Now that we have these three properties available to us, we need to use them inour connector code. The extent of the changes are too great for this post, butwe can see that these properties can be easily accessed from the pipe object passed into many of the connector functions, for example:this.fetchRecords = function(dataSet, pushRecordFn, done, pipeRunStep, pipeRunStats, pipeRunLog, pipe, pipeRunner) { var tags = pipe.customTags; var pageSize = pipe.questionCount; var downloadAnswers = pipe.downloadAnswers; //...}THE STACK OVERFLOW QUESTION DATA STRUCTUREAfter we update our code to use these properties and run our pipe we can see thequestions moved to Cloudant. Here is a sample question:{ ""_id"": ""0506c2a366b431fbbdf939f4aae574a3"", ""_rev"": ""1-27c2ab3a997fa0cd73fd5f3cfe0168f4"", ""tags"": [ ""java"", ""nosql"", ""cloudant"" ], ""owner"": { ""user_id"": 3052176, ... }, ""is_answered"": false, ""answer_count"": 1, ... ""question_id"": 29216049, ""title"": ""Updating Cloudant database using Java"", ""body"": ""Was wondering if it possible to write code in Java that will update the entries in my Cloudant database?"", ""answers"": [ { ""owner"": { ""user_id"": 4284412, ... }, ""is_accepted"": false, ""question_id"": 29216049, ""body"": ""Yes, Its possible to write JAVA code to update entries / documents in Cloudant database. You need to use the java-cloudant driver. Please have a look at the following project on github."" ... } ], ...}As you can see we are now retrieving and associating answers with questions.We’ve also highlighted a few other important fields: * tags : The tags associated to the question. * is_answered : A boolean specifying whether or not an answer has been accepted by the user who asked the question. * answer_count : The number of answers to the question.In the next section we’ll use these fields to create custom queries to find thedata that we need to gain greater insight into our Stack Overflow developercommunity.QUERYING AND ANALYZING THE STACK OVERFLOW DATAWe are going to start by creating a new design document in Cloudant that willallow us to aggregate and search our Stack Overflow data. Specifically, we willcreate views and search indexes to: * Get the number of questions for a tag that have or have not been answered. * Get the number of questions for a tag that have or have not been accepted (by the owner of the question). * Get a list of questions for a tag that have no answers.We rolled up these views and indexes into a single design document:{ ""_id"": ""_design/questions"", ""views"": { ""by_tag"": { ""reduce"": ""_sum"", ""map"": ""function (doc) {\n if (doc.tags) {\n for (var i=0; i�\n }\n }\n}"" }, ""by_tag_accepted"": { ""reduce"": ""_sum"", ""map"": ""function (doc) {\n if (doc.is_answered && i�\n }\n }\n}"" }, ""by_tag_not_accepted"": { ""reduce"": ""_sum"", ""map"": ""function (doc) {\n if (! doc.is_answered && i�\n }\n }\n}"" }, ""by_tag_answered"": { ""reduce"": ""_sum"", ""map"": ""function (doc) {\n if (doc.answer_count && doc.answer_count > 0 && i�\n }\n }\n}"" }, ""by_tag_not_answered"": { ""reduce"": ""_sum"", ""map"": ""function (doc) {\n if (! doc.is_answered && doc.answer_count == 0 && i�\n }\n }\n}"" } }, ""language"": ""javascript"", ""indexes"": { ""by_tag"": { ""analyzer"": ""standard"", ""index"": ""function (doc) {\n if (doc.tags && doc.tags.length � i�\n index(\""answered\"", doc.answer_count � i�\n }\n}\n}"" } }}We’ll use the following views to query statistics: * questions/by_tag : This will return the total number of questions for a tag. * questions/by_tag_answered : This will return the total number of answered questions for a tag. * questions/by_tag_not_answered : This will return the total number of questions that have not been answered for a tag. * questions/by_tag_accepted : This will return the total number of accepted questions for a tag. * questions/by_tag_not_accepted : This will return the total number of questions that have not been accepted for a tag.The first thing we are going to look at is the total number of questions fortags apache-spark , cloudant , and dashdb . We’ll do this by querying the questions/by_tag view. For the cloudant tag this query would look something like this:curl -X GET https://$USERNAME:$PASSWORD@$USERNAME.cloudant.com/stackoverflow_custom/_design/questions/_view/by_tag?group=true&key=%22cloudant%22Example response:{""rows"":[ {""key"":""cloudant"",""value"":476}]}There have been 476 questions labeled with the tag cloudant . If we run the same query for apache-spark and dashdb we can see which product is the most popular on Stack Overflow:Tag # Questions apache-spark 12,521 cloudant 476 dashdb 58Let’s see how well these products are being supported by querying the questions/by_tag_answered view.curl -X GET https://$USERNAME:$PASSWORD@$USERNAME.cloudant.com/stackoverflow_custom/_design/questions/_view/by_tag_answered?group=true&key=%22cloudant%22Example response:{""rows"":[ {""key"":""cloudant"",""value"":427}]}427 of the 476 questions labeled with the tag cloudant have been answered. We can also use the questions/by_tag_accepted view to find how many questions have been accepted. Here are the results forall of our three tags:Tag # Questions # Answered % Answered # Accepted % Accepted apache-spark 12,521 9,617 76.8 7,376 58.9 cloudant 476 427 89.7 339 71.2 dashdb 58 53 91.4 39 67.2As you can see around 90% of questions tagged with cloudant or dashdb have been answered while over 23% of questions tagged with apache-spark have gone unanswered. So, let’s see if we can find a few of these questions andstart answering them.In the design document we created the following search indexes: * questions/by_tag : This will return all of the questions that have a tag that matches our query. * questions/by_tag_answer_status : This will return all of the questions that have a tag that matches our query and match our answer paramWe can query the questions/by_tag_answer_status index passing in the tag and the answered: parameter set to false , as follows:curl -X GET https://$USERNAME:$PASSWORD@$USERNAME.cloudant.com/stackoverflow_custom/_design/questions/_search/by_tag_answer_status?q=tag:%22apache-spark%22+AND+answered:false&include_docs=true&limit=2In this example we have limited our search results to two. The result is twoquestions without answers :{ ""total_rows"":2904, ... ""rows"":[ { ""id"":""f74f323a1c531ef4c5ef6faf3fe2e074"", ""order"":[ 3.3885726928710938, 6 ], ""fields"":{}, ""doc"":{ ""_id"":""f74f323a1c531ef4c5ef6faf3fe2e074"", ""_rev"":""1-5c6a960c4457a7382cbb0729c0844137"", ""tags"":[ ""apache-spark"" ], ""owner"":{ ""reputation"":24, ""user_id"":1935652, ... }, ""is_answered"":false, ""view_count"":3, ""answer_count"":0, ""score"":1, ""last_activity_date"":1460620372, ""creation_date"":1460620372, ""question_id"":36616897, ""link"":""http://stackoverflow.com/questions/36616897/task-data-locality-no-pref-when-is-it-used"", ""title"":""Task data locality NO_PREF. When is it used?"", ""body"":""According to Spark doc, there are 5 levels of data locality..."", ... } }, { ""id"":""d173ca7647eac111020df96c264137bc"", ""order"":[ 3.241180419921875, 26 ], ""fields"":{}, ""doc"":{ ... ""tags"":[ ""apache-spark"" ], ""owner"":{ ""reputation"":143, ""user_id"":5245972, ... }, ""is_answered"":false, ""view_count"":12, ""answer_count"":0, ""score"":0, ""last_activity_date"":1460378894, ""creation_date"":1460378894, ""question_id"":36549142, ""link"":""http://stackoverflow.com/questions/36549142/can-i-use-checkpoint-for-spark-in-this-way"", ""title"":""Can I use checkpoint for Spark in this way?"", ""body"":""The spark doc said about checkpoint..."", ... } } ]}From here we can copy the link for a question, go to the Stack Overflow site,and try to help out another developer in need of assistance.CONCLUSION AND NEXT STEPSUsing the Simple Data Pipe SDK to extend our Stack Overflow connector, we havebeen able to gain real insights into how we support developers. We did this byextending the user interface of our basic Stack Overflow connector to give usthe ability to choose more relevant data to download. We added new properties toour connector config that we were able to access immediately in code and withoutdatabase schema changes. Finally, we created views and search indexes inCloudant to retrieve important statistics and unanswered questions quickly andefficiently.We’ve barely scratched the surface with what we can do with this data. Here aresome potential next steps: * Create a dashboard for viewing and sharing these statistics. * Create an interface for searching previous answers or unanswered questions. * Integrate user information to find the users in our group who are answering the most questions, have the highest % of accepted questions, etc.You can access the Stack Overflow connector on github at https://github.com/ibm-cds-labs/simple-data-pipe-connector-stackoverflow .For more information about the Simple Data Pipe and Simple Data Pipe connectors start here .SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Learn how to use IBM Bluemix and the Simple Data Pipe example app to conduct Stack Overflow analysis on how well users support certain tech products.,Uncover Product Insights Hidden in Stack Overflow,Live,44 120,"Build a custom library for Apache® Spark™ and deploy it to a Jupyter Notebook.New to developing applications with Apache® Spark™? This is the tutorial foryou. It provides the end-to-end steps needed to build a simple custom libraryfor Apache® Spark™ (written in scala ) and shows how to deploy it on IBM Analytics for Apache Spark for Bluemix , giving you the foundation you need to build real-life productionapplications.In this tutorial, you'll learn how to: 1. Create a new Scala project using sbt and package it as a deployable jar. 2. Deploy the jar into a Jupyter Notebook on Bluemix. 3. Call the helper functions from a Notebook cell. 4. Optional: Import, test and debug your project into Scala IDE for Eclipse.REQUIREMENTSTo complete these steps you need to: * be familiar with the scala language and jupyter notebooks . * download scala runtime 2.10.4 . * download homebrew . * download scala sbt (simple build tool).CREATE A SCALA PROJECT USING SBTThere are multiple build frameworks you can use to build Apache® Spark™projects. For example, Maven is popular with enterprise-build engineers. For this tutorial, we chose SBTbecause setup is fast and it’s easy to work with.The following steps guide you through creation of a new project. Or, you candirectly download the code from this Github repository : 1. Open a terminal or command line window. cd to the directory that contains your development project and create a directory named helloSpark : mkdir helloSpark && cd helloSpark 2. Create the recommended directory layout for projects builts by Maven or SBT by entering these 3 commands: mkdir -p src/main/scala mkdir -p src/main/java mkdir -p src/main/resources 3. In src/main/scala directory, create a subdirectory that corresponds to the package of your choice, like mkdir -p com/ibm/cds/spark/samples . Then, in that directory, create a new file called HelloSpark.scala and in your favorite editor, add the following content to it: package com.ibm.cds.spark.samples import org.apache.spark._ object HelloSpark { //main method invoked when running as a standalone Spark Application def main(args: Array[String]) { val conf = new SparkConf().setAppName(""Hello Spark"") val spark = new SparkContext(conf) println(""Hello Spark Demo. Compute the mean and variance of a collection� println("">>> Results: "") println("">>>>>>>Mean: � println("">>>>>>>Variance: � (rdd.mean(), rdd.variance()) } } 4. Create your sbt build definition. To do so, in your project root directory, create a file called build.sbt and add the following code to it: name := ""helloSpark"" version := ""1.0"" scalaVersion := ""2.10.4"" libraryDependencies ++= { val sparkVersion = ""1.3.1"" Seq( ""org.apache.spark"" %% ""spark-core"" % sparkVersion, ""org.apache.spark"" %% ""spark-sql"" % sparkVersion, ""org.apache.spark"" %% ""spark-repl"" % sparkVersion ) } The libraryDependencies line tells sbt to download the specified spark components. In this example, we specify dependencies to spark-core, spark-sql, and spark-repl, but you can add more spark components dependencies. Just follow the same pattern, like: spark-mllib, spark-graphx, and so on. Read detailed documentation on sbt build definition . 5. From the root directory of your project, run the following command: sbt update . This command uses Apache Ivy to compute all the dependencies and download them in your local machine at /.ivy2/cache directory. 6. Compile your source code by entering the following command: sbt compile 7. Package your compiled code as a jar by entering the following command: sbt package . You should see a file named hellospark 2.10-1.0.jar in your project root directory’s target/scala-2.10 directory. (Terminal tells you where it saved the package.) The namingconvention for the jar file is:projectName scala version-project version .jar hellospark 2.10-1.0 .jarDEPLOY YOUR CUSTOM LIBRARY JAR TO A JUPYTER NOTEBOOKWith your custom library built and packaged, you're ready to deploy it to aJupyter Notebook on Bluemix. 1. If you haven't already, sign up for Bluemix , IBM's open cloud platform for building, running, and managing applications. 2. In Bluemix, initiate the IBM Analytics for Apache Spark service. 1. In the top menu, click Catalog . 2. Under Data and Analytics , find Apache Spark . 3. Click to open it, and click Create . 3. Get the deployable jar on a publicly available url by doing one of the following: * Upload the jar into a github repository. Note the download URL. You'll use in Step 5 to deploy the jar into the IBM Analytics for Apache Spark Service. * Or, you can use our sample jar, which is pre-built and posted here on github . 4. Create a new Scala notebook. 1. In Bluemix, open your Apache Spark service. 2. If prompted, open an existing instance or create a new one. 3. Click New Notebook . 4. Enter a Name , and under Language select Scala . Click Create Notebook . 5. In the first cell, enter and run the following special command called AddJar to upload the jar to the IBM Analytics for Spark service. Insert the URL of your jar. %AddJar https://github.com/ibm-cds-labs/spark.samples/raw/master/dist/hellospark_2.10-1.0.jar -f That % before AddJar is a special command, which is currently available, but may be deprecated in an upcoming release. We'll update this tutorial at that time. The -f forces the download even if the file is already in the cache. Now that you deployed the jar, you can call APIs from within the Notebook.CALL THE HELPER FUNCTIONS FROM A NOTEBOOK CELLIn the notebook, call the code from the helloSpark sample library. In a newcell, enter and run the following code:val countPerPartitions = 500000var partitions = 10val stats = com.ibm.cds.spark.samples.HelloSpark.computeStatsForCollection( sc, countPerPartitions, partitions)println(""Mean: "" + stats._1)println(""Variance: "" + stats._2)Final results in your Bluemix Jupyter Notebook look like this:OPTIONAL: IMPORT, TEST, AND DEBUG YOUR PROJECT IN SCALA IDE FOR ECLIPSEIf you want to get serious and import, test, and debug your project in a localdeployment of Apache® Spark™, follow these steps for working in Eclipse. 1. Download the Scala IDE for Eclipse . (Note that you can alternatively use the Intellij scala IDE but it's easier to follow this tutorial with Scala IDE for Eclipse) 2. Install sbteclipse (sbt plugin for Eclipse) with a simple edit to the plugins.sbt file, located in ~/.sbt/0.13/plugins/ (If you can't find this file, create it.) Read how to install . 3. Configure Scala IDE to run with Scala 2.10.4 Launch Eclipse and, from the menu, choose Scala IDE Preferences . Choose Scala Installations and click the Add button. Navigate to your scala 2.10.4 installation root directory, select the lib directory, and click Open . Name your installation (something like 2.10.4 ) and click OK . Click OK to close the dialog box. 4. Generate the eclipse artifacts necessary to import the project into Scala IDE for eclipse. Return to your Terminal or Command Line window. From your project's root directory use the following command: sbt eclipse . Once done, verify that .project and .classpath have been successfully created. 5. Return to Scala IDE, and from the menu, choose File Import . In the dialog that opens, choose General Existing Projects into Workspace . 6. Beside Select root directory , click the Browse button and navigate to the root directory of your project, then click Finish : 7. Configure the scala installation for your project. The project will automatically compile. On the lower right of the screen, on the Problems tab, errors appear, because you need to configure the scala installation for your project. To do so, right-click your project and select Scala Set the Scala installation . In the dialog box that appears, select 2.10.4 (or whatever you named your installation). Click OK and wait until the project recompiles. On the Problems tab, there are no errors this time. 8. Export the dependency libraries. (This will make it easier to create the launch configuration in the next step). Right-click on the helloSpark project, and select Properties . In the Properties dialog box, click Java Build Path . The Order and Export tab opens on the right. Click the Select All button and click OK : 9. Create a launch configuration that will start a spark-shell. 1. From the menu, choose Run Run Configurations . 2. Right-click Scala Application and select New . 3. In Project , browse to your helloSpark project and choose it. 4. In Main Class , type org.apache.spark.deploy.SparkSubmit 5. Click the Arguments tab, go to the Program Arguments box, and type: --class org.apache.spark.repl.Main spark-shell Then within VM Arguments type: -Dscala.usejavacp=true -Xms128m -Xmx800m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=64m 10. Click Run . 11. Configuration runs in the Console and completes with a scala prompt. Now you know how to run/debug a spark-shell from within your developmentenvironment that includes your project in the classpath, which can then becalled from the shell interpreter.You can also build a self-contained Apache® Spark™ Application and run it manually using spark-submit or scheduling. The sample code thatcomes with this tutorial is designed to run both as an Apache® Spark™Application and a reusable library. If you want to run/debug the applicationfrom within the Scala IDE, then you can follow the same steps as above, but inStep 9e, replace the call in the Program Arguments box with the fully qualifiedname of your main class, like --class com.ibm.cds.spark.samples.HelloSpark spark-shellSUMMARYYou just learned how to build your own library for Apache® Spark™ and share itvia Notebook on the cloud. You can also manage your project with the import,test, and debug features of Scala IDE for Eclipse.Next, move on to my Sentiment Analysis of Twitter Hashtags tutorial, which uses Apache Spark Streaming in combination with IBM Watson totrack how a conversation is trending on Twitter. In future tutorials, we'll diveinto more sample apps that cover more on Spark SQL, Spark Streaming, and otherpowerful components that Spark has to offer.© “Apache”, “Spark,” and “Apache Spark” are trademarks or registered trademarksof The Apache Software Foundation. All other brands and trademarks are theproperty of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Build a custom library for Apache® Spark™ and deploy it to a Jupyter Notebook.,Start Developing with Spark and Notebooks,Live,45 121,"REAL-TIME Q&A APP WITH RETHINKDB Matt Collins / September 13, 2016One of my first tasks here at IBM Cloud Data Services was to blog about my first impressions of RethinkDB . One thing that article briefly touched upon was RethinkDB’s unique ability of being able to push updates to your app as and when the data changes — making it a strong contender to be your database of choice when building real-time apps. This was something that got my attention back then, and it’s about time we revisited RethinkDB’s push functionality to see just how easy it is to build a real-time app. THE CHALLENGE As a Developer Advocate, a large part of my job is going out into the community and delivering talks on a range of topics. These talks invariably end with a Q&A session where I can clear up anything that was confusing or misunderstood. Live Q&As seem like a good use-case for building a real-time app: allow attendees to ask questions from their smartphone during the talk, and vote on which questions they want answered. We can then update the list of questions on-screen in real time, showing the most popular questions and any answers that surfaced during the talk. The live Q&A app we’ll build using Node.js and RethinkDB. TOOLS Node.js is what I am going to use to build this app. Node’s ability to deal with a large amount of concurrent connections is something that stands it in good stead when building real-time apps. Although it’s not something we need to consider in this post, it’s good to design for future scale. RethinkDB is our database. As mentioned above, we are going to be making use of the changefeeds functionality to push any changes to our app as and when they happen. We will also be looking at ReQL and how that works with Node.js. You can start up a free RethinkDB instance with Compose to get you going. For the front end we will be using Vue.js and Bootstrap . Vue is one of many Javascript Frameworks for building front ends. I prefer it to something like Angular, as I think Vue is easier to get up and running. Bootstrap, of course, is the popular HTML/CSS framework from Twitter. We still need to be able to get this data from our app to the front end, and to do that we will be using Socket.IO . This is a simple way to implement WebSockets in your Node.js app, but it will also take care of any cross-browser/platform issues for you. We will also use this Socket.IO component for Vue, so that we can easily incorporate Socket.IO into our Vue app. SET UP All the code below can be found in the rethinkdb-questions GitHub repository for this article. Clone the repo and npm install to get all of the dependencies and follow along below! app.js is the brains of our whole app. We are using Express to help us get up and running quickly. The first portion of the file is just including dependencies and so on. Once we get to line 25 or so, we have some configuration to do. We need to define a connection object for RethinkDB. There are two connection objects defined in the code at the moment: one for if you are using a local RethinkDB instance, and one if you are using a hosted instance via Compose. Uncomment the one you wish to use, and if you are using Compose, make sure you enter your connection details! On the topic of Compose connection details for RethinkDB, note that Compose’s Deployment Overview gives you a proxy connection string that is similar — yet subtly different — than the URL for the RethinkDB admin UI. It’s the difference of a single . , so make sure you’re using the right string for your host. We will use this connection object every time we create a RethinkDB connection. RETHINKDB AND REQL Before we dive into the code we should take a bit of time to talk about ReQL, the query language for RethinkDB. When you create a connection to RethinkDB, this is a permanent, socket-like connection that will stay open until it is closed by the application. This is useful for a couple of reasons: * It allows RethinkDB to return a cursor instead of a dataset to allow us to iterate through the data in an efficient manner * We can push updates down this connection using changefeeds We get this cursor by querying the database using ReQL, which is RethinkDB’s native query language. It is designed to embed itself into your code — i.e., if you’re building your app in JavaScript, your ReQL looks like JavaScript — so that it feels familiar and comfortable to the developer. It also fits into the standard coding patterns of whichever language you are using. It is important to know that even though the query looks like JavaScript (or Ruby, Python, etc.), none of the heavy lifting is being performed in JavaScript, or even by your app. What is happening in the background is that your ReQL expression is being compiled down into a query that RethinkDB understands, sent to the server, and then executed in a distributed fashion across the whole cluster — allowing for performant querying of large datasets. That being said, lets look at how we can use ReQL in our app. API ENDPOINTS We will start with the API, or the back end of the app. The API is going to provide the front end with the ability to get our questions data from the database, as well as add new questions and update existing questions. The API consists of a collection of Express routes that we will define, which will in turn query our RethinkDB instance. We won’t cover how Express routing works today, however there is a very simple guide on the Express website that should help if you’re unfamiliar. ADDING NEW QUESTIONS Having a real-time Q&A app is no good if we have no questions, so the first thing we need to do is create a way to add some. This is done using the POST /question endpoint. // Create a new question app.post(""/question"", bpJSON, bpUrlencoded, (req, res) => { var question = { question: req.body.question, score: 1, answer: """" } r.connect(connection, function(err, conn) { r.table(""questions"").insert(question).run(conn, (err, cursor) =� }) }) We create our question object using the question parameter provided as part of the request. Then we connect to the database, build up our query in ReQL, and run the query. The ReQL portion of this code is here: r.table(""questions"").insert(question) Lets take a minute to examine what this query is doing: * r is the RethinkDB namespace * We can tag on the table(""questions"") method (just like JavaScript, remember) to select our desired table * And then, we can use the insert(question) method to say we wish to insert a new document, passing in our question object. Simple, huh? Once the query completes, we return a simple JSON response, just to signify whether this request was successful or not, and close our connection to the database. It’s important to close your RethinkDB connection, as you don’t want unused, open connections consuming resources. GETTING QUESTION DATA Now that we have some questions, we probably want to be able to get them back, right? The GET /questions endpoint is designed to do just that – return all existing questions from the RethinkDB database in one go. // Get all questions app.get(""/questions"", (req, res) => { r.connect(connection, (err, conn) => { r.table(""questions"").run(conn, (err, cursor) => { cursor.toArray((err, results) =� }) }) All we are doing here is connecting to RethinkDB, using ReQL to ask for everything from the questions table, and transforming the full dataset into an array which is then sent back to the client in the response. In the meantime, we close our connection to RethinkDB. Again, the ReQL portion of this code is here: // equivalent to SELECT * FROM questions r.table(""questions"") We previously touched upon the fact that we don’ instead, we receive a cursor. In this instance, we don’t want a cursor. So in the Get all questions snippet we use the toArray() method to return our full dataset. UPDATING OUR QUESTIONS We mentioned before that we wanted our users to be able to vote on questions. We actually have two endpoints for voting: POST /upvote/:id and POST /downvote/:id , which will either add or subtract 1 from the score of a question. Here’s what the ReQL looks like: // get by ID // update score to score+1 // default of 1 if no score set r.table(""questions"").get(req.params.id).update({ score: r.row(""score"").add(1).default(1) }) In English, we are saying: * Get a question by a specified ID * Update this question * Set the value of score to be score+1 * If score does not currently exist, then set it to 1. This is how we are upvoting a question. Similarly, we use .sub(1) in the downvote endpoint. Finally, a question is no use without an answer. We add answers using POST /answer/:id . This process is similar to changing the score of a question. All we need to do is find our question by its unique ID and update it to include an answer that is provided via this request: // get by ID // update answer r.table(""questions"").get(req.params.id).update({ answer: req.body.answer }) FRONT END Next, we need to define some routes for Express to allow us to access our app via a web front end. * GET / is the homepage and will be used to display our questions & answers * GET /answer is identical to the homepage, but will allow an administrator to answer questions Both of these endpoints will return index.html , which is where we will create our front end. Once jQuery has told us that our document is ready to go, we define our Vue app with the app variable. This is where we can define our data model on the client side, along with a bunch of methods we can call and handlers for our Socket.IO events. app = new Vue({ el: '#app', // the HTML element that this Vue app relates to data: { questions: [] // our data model, an array of questions }, methods: { ... // a bunch of methods we have defined that can interact with our data model }, computed: { ... // some computed values that we can use }, sockets:{ ... // socket event handlers } }) After defining our app, we call the app.getQuestions() method, which will hit the GET /questions endpoint, retrieving all of our questions data. We can then store this data in our data model at app.questions . It is easy to define how we want our data to be displayed with Vue. The app that we defined above relates to everything that is defined inside div#app .
....
In here we have a button, which toggles a form that we can use to ask a new question. Below the form, we have another div that will house all of our questions. We can do some powerful things like iterate through our questions array to render a new div for each question. The example below is saying “iterate through the sortedQuestions computed array, and create a new div for each element, exposing this element as question “.
...
We can then use the special Vue handlebars notation to define the ID of this div like so id=""{{ question.id }}"" . At any point in this div we can refer to the question variable within the handlebars notation to refer to the current question in the array. This gives us access to other properties such as score , answer , and question to help us create our question template in HTML. We can also call on the methods we defined in our app with the following notation: When this button is clicked, we will call the upvote() method that is defined in our Vue app, passing the unique ID of this question as the only parameter. REAL-TIME UPDATES We have a number of methods defined in general.js and questions.js on the front end that make requests back to the API endpoints we discussed earlier in the article. We pass these requests using the jQuery ajax APIs. * getQuestions() calls GET /questions * askQuestion() handles the submission of the question form to POST /question * doUpvote() calls POST /upvote/:id * doDownvote calls POST /downvote/:id * answerQuestion() handles the submission of the answer form to POST /amswer/:id On closer inspection of these functions, you might notice that when we add or update a question, there is no code to react to the response from the API and update the front end. The reason is that we want to handle any updates to the data in real time, and we want all of our clients to respond to the same stimulus — i.e., an event from our app to tell us that the data has changed in the database. This approach helps us to manage state within our real-time app. How else could it work? Well, if we have multiple ways of updating the front end — i.e., after making an API call, or after receiving a WebSocket event — then there is a greater chance of introducing bugs or inconsistencies in our code, meaning that our clients could get out of sync. If every client is reacting to data changes from a single source, then there is a much greater chance of consistent results across the whole user base. Historically, computer systems have been built with a single source of data (a database!), but this doesn’t necessarily work well with data-driven, real-time apps. Traditional databases are not designed for this use-case. To get such a database to work in this context, the developer has to use an antiquated approach such as polling (repeatedly asking the database for an update). Developers could also use message queues and additional infrastructure to help manage the flow of data. Neither of these solutions, however, scales well. Scaling data-driven web apps is the problem we’re trying to solve by using RethinkDB and changefeeds. Speaking of changefeeds… CHANGEFEEDS & SOCKET.IO As mentioned previously, we’ll use of the RethinkDB changefeed feature to get updates from our database whenever a new question is added or updated. Towards the bottom of app.js you should see some code that creates our changefeed. r.connect(connection, (err, conn) => { r.table(""questions"").changes().run(conn, (err, cursor) => { // for each update emit the data via Socket.IO cursor.each((err, item) =� // new if (item.old_val === null && item.new_val !== null) { io.emit('new', item.new_val) } // deleted else if (item.old_val !== null && item.new_val === null) { io.emit('deleted', item.old_val) } // updated else if (item.old_val !== null &� The first thing to note here is that we are not closing the connection to RethinkDB — this is because we want to keep the connection open so that we can continue receiving updates. Using changefeeds is similar to doing a normal query in ReQL: you just attach the .changes() method to the end of your query. We still receive a cursor, and we can use that cursor to iterate over any incoming update events. Update events look like this: { new_val: { ... }, old_val: { ... } } The new_val is what is currently being stored in the database, whilst the old_val is what was previously there. If the old_val is null , then that indicates there was no previous value and this event represents an insert of a new document. If new_val is null , then that indicates a deletion. If both new_val and old_val have data, then this signifies an update has taken place. In the code example above, you can see that we are determining which event has occurred, and that we are using Socket.IO to emit() the relevant data to the front end via WebSockets. When emitting events via Socket.IO, the first parameter is the name of the event ( new , deleted , updated ) and the second parameter is the data you wish to send. Again, we will not examine Socket.IO here, but there is an easy-to-follow article on the official website that shows how to get started with Socket.IO and Express . In our front end HTML, we have included the Socket.IO client library and a Socket.IO component for Vue: We then configure Vue to use Socket.IO and define the location of the server: // Tell Vue to use Socket.io var socketUrl = `${location.protocol}//${location.hostname}${(location.port ? ':'+location.port: '')}`; Vue.use(VueSocketio, socketUrl); We can then define event handlers for Socket.IO that will listen for events. We have three events: * new – a question has been inserted into the database * updated – a question has been updated (either answered or voted on) * deleted – a question has been deleted from the database The handlers are defined in the sockets object of the Vue app we created at the beginning of the article. The handlers are simply functions that update the data model of our app to reflect the change in our questions data. Because of Vue’s data bindings, we don’t have to do anything else — the front end will automatically update to reflect the changed data! Now, whenever the data changes within the database these updates will be reflected, in real time, within our app. Pretty cool, huh? CONCLUSION So there we have it: from nothing to a real-time Q&A app in relatively little time! What have we learnt? Simply put, RethinkDB is a great place to start building your real-time apps. It provides a single source of data to power your apps that isn’t available from other database offerings without resorting to clumsy polling or complicated architecture to manage the flow of data within your app. RethinkDB doesn’t give you everything you need to create a real-time app front-to-back. If you want to update users in real time, you still need a way to get your updates to the front end, but RethinkDB does a lot of the heavy lifting for you in a clear and succinct way. SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Click to share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: javascript / Node.js / nodejs / RethinkDB Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",RethinkDB's push updates makes it great for real-time apps. Here's an example built for live Q&A sessions at conferences using RethinkDB changefeeds.,Q&A Voting App with RethinkDB,Live,46 123,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * Connect CONTENTS * Apache Spark * Get Started * Get Started in Bluemix * Tutorials * Load dashDB Data with Apache Spark * Load Cloudant Data in Apache Spark Using a Python Notebook * Load Cloudant Data in Apache Spark Using a Scala Notebook * Build SQL Queries * Use the Machine Learning Library * Build a Custom Library for Apache Spark * Sentiment Analysis of Twitter Hashtags * Use Spark Streaming * Launch a Spark job using spark-submit * Sample Notebooks * Sample Python Notebook: Precipitation Analysis * Sample Python Notebook: NY Motor Vehicle Accidents Analysis * BigInsights * Get Started * BigInsights on Cloud for Analysts * BigInsights on Cloud for Data Scientists * Perform Text Analytics on Financial Data * Sample Scripts * Compose * Get Started * Create a Deployment * Add a Database and Documents * Back Up and Restore a Deployment * Enable Two-Factor Authentication * Add Users * Enable Add-Ons for Your Deployment * Compose Enterprise * Get Started * Cloudant * Get started * Copy a sample database * Create a database * Change database permissions * Connect to Bluemix * Developing against Cloudant * Intro to the HTTP API * Execute common API commands * Set up pre-authenticated cURL * Database Replication * Use cases for replication * Create a replication job * Check replication status * Set up replication with cURL * Indexes and Queries * Use the primary index * MapReduce and the secondary index * Build and query a search index * Use Cloudant Query * Cloudant Geospatial * Integrate * Create a Data Warehouse from Cloudant Data * Store Tweets Using Cloudant, dashDB, and Node-RED * Load Cloudant Data in Apache Spark Using a Scala Notebook * Load Cloudant Data in Apache Spark Using a Python Notebook * dashDB * dashDB Quick Start * Get * Get started with dashDB on Bluemix * Load data from the desktop into dashDB * Load from Desktop Supercharged with IBM Aspera * Load data from the Cloud into dashDB * Move data to the Cloud with dashDB’s MoveToCloud script * Load Twitter data into dashDB * Load XML data into dashDB * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB * Load JSON Data from Cloudant into dashDB * Integrate dashDB and Informatica Cloud * Load geospatial data into dashDB to analyze in Esri ArcGIS * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion Workbench (DCW) * Install IBM Database Conversion Workbench * Convert data from Oracle to dashDB * Convert IBM Puredata System for Analytics to dashDB * From Netezza to dashDB: It’s That Easy! * Use Aginity Workbench for IBM dashDB * Build * Create Tables in dashDB * Connect apps to dashDB * Analyze * Use dashDB with Watson Analytics * Perform Predictive Analytics and SQL Pushdown * Use dashDB with Spark * Use dashDB with Pyspark and Pandas * Use dashDB with R * Publish apps that use R analysis with Shiny and dashDB * Perform market basket analysis using dashDB and R * Connect R Commander and dashDB * Use dashDB with IBM Embeddable Reporting Service * Use dashDB with Tableau * Leverage dashDB in Cognos Business Intelligence * Integrate dashDB with Excel * Extract and export dashDB data to a CSV file * Analyze With SPSS Statistics and dashDB * REST API * Load delimited data using the REST API and cURL * DataWorks * Get Started * Connect to Data in IBM DataWorks * Load Data for Analytics in IBM DataWorks * Blend Data from Multiple Sources in IBM DataWorks * Shape Raw Data in IBM DataWorks * DataWorks API INSTALL IBM DATABASE CONVERSION WORKBENCH Jess Mantaro / July 22, 2015See how to download and install Database Conversion Workbench, Data Studio plugin for IBM dashDB. You can also read a transcript of this video RELATED LINKS * About IBM Data Conversion Workbench * Convert IBM PureData for Analytics to dashDB * Convert data from Oracle to dashDB Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM","Watch how to download and install Database Conversion Workbench, Data Studio plugin for IBM dashDB.",Install IBM Database Conversion Workbench,Live,47 126,"Data Science Experience Datasci X * Data Science Experience Datasci X * Data Works Sign In Sign UpDOCUMENTATION * All * Get started * Analyze data * Manage data * Get started * Quick overview * Set up projects and collaborate * Known issues * FAQs * Analyze data * Notebooks * Create notebooks: overview * Sample notebooks * Parts of a notebook * Install libraries and packages * Pixiedust packageManager * Load and access data in a notebook * Visualizations * Model visualizations * Pixiedust visualizations * Brunel visualizations * RStudio * Spark overview * Manage data * Catalogs * Create a catalog * Create data assets * Create catalog projects * Add and manage collaborators * Monitor data usage and user activity * Analyze streaming data from Kafka topics Get started with IBM Data Science ExperienceGET STARTED WITH IBM DATA SCIENCE EXPERIENCE Welcome to IBM Data Science Experience (DSX). Depending on the plan you chose, your environment is set up with one or more Apache Spark instance and 5 GB or more of object storage. PROJECTS AND NOTEBOOKS If you want to jump right in, you can create projects to collaborate with other data scientists and data engineers, create and share notebooks, data sets, and data connections, or use RStudio. * To start setting up projects and collaborating, see Set up projects and collaborate . * To work with notebooks in your projects, see Create notebooks: overview . * For RStudio, see RStudio overview . COMMUNITY You can also explore the community area for curated data sets, sample notebooks, articles, and tutorials, both to learn from and to use as starting points. Figure: A sample of community cards Whenever you want to return to your homepage, click the IBM Data Science Experience button. Learn more: * Quick overview * Known issues * FAQs * Contact * Privacy * Terms of Use",Learn to use IBM Data Science Experience.,Data Science Experience Documentation,Live,48 127,"Compose The Compose logo Articles Sign in Free 30-day trialGEOFILE: USING OPENSTREETMAP DATA IN COMPOSE POSTGRESQL - PART II Published Mar 30, 2017 geofile openstreetmap postgis GeoFile: Using OpenStreetMap Data in Compose PostgreSQL - Part IIGeoFile is a series dedicated to looking at geographical data, its features, and uses. In today's article, we're continuing our examination of OpenStreetMap data and walking through how to incorporate other data sources. We'll also look at using PostGIS to filter our data and to find places that are within or intersect a chosen polygon. In the last GeoFile article , we looked at how to import OpenStreetMap (OSM) data into Compose PostgreSQL and ran some queries to get the most popular cuisines in Seattle. We found that coffee shops were the most popular places in the city, and we provided a top ten list of which coffee companies have the most branches in Seattle. In this article, we'll be using the same OSM data in conjunction with the Seattle Police Department's 911 call data . We'll show you how to create tables and store this data in PostgreSQL using Sequelize , a Node.js ORM for relational databases. Then we'll look at locations, areas, and reasons for 911 calls using PostGIS and then viewing them all using OpenJUMP , an open-source GIS tool. Let's look at Sequelize and import some data into our PostgreSQL deployment ... SEQUELIZE ME Sequelize is a Node.js ORM that works with a number of relational databases out of the box. For our use case, Sequelize makes it easy to perform CRUD operations and create models for our data. In particular, for the data model that we'll be creating, it comes with a geometry data type that works well with GeoJSON and PostgreSQL. What we'll be doing with Sequelize is creating a table called emergency_calls and inserting GeoJSON documents from the SPD 911 call API. The information that we'll be gathering from the API is the incident id , longitude , latitude , event_clearance_group , and event_description . The event_clearance_group and event_description provide us with details about each 911 call incident. In addition to Sequelize, we'll be using the request Node.js library. This library will allow us to gather the GeoJSON documents from the SPD 911 call API and will help us insert the documents into PostgreSQL one at a time. To install Sequelize and request , we'll write the following in our terminal using NPM. npm install sequelize request --save After installing the packages, let's create a file called 911data.js . Within the file, we'll first require both the request and sequelize libraries we installed with NPM. const request = require('request'); const Sequelize = require('sequelize'); We then set up a variable url with the URL of the API and append to the URL $$app_token and include a custom token from data.seattle.gov . You'll have to apply for a token in order to not have download limits on your data. Next, we'll use Sequelize's $offset and $limit functions to limit the number of records we'll import to our database since there are more than 1.3 million records in the SPD 911 calls dataset. In order to get the latest 911 calls, we'll offset our data by 1.3 million rows and limit our data to only the last 100,000 rows. const url = ""https://data.seattle.gov/resource/pu5n-trf4.geojson?$$app_token=your_token&$offset=1300000 After that, we'll initialize a database connection using our Compose PostgreSQL connection string located on the Overview page under Credentials . At the end of the connection string, we'll change the database name from compose to osm since we're inserting the 911 call records into a table located within the OSM database. const sequelize = new Sequelize(""postgres://admin:mypass@aws-us-west-4-portal.0.dblayer.com:25223/osm""); Once that's done, we can set up a Sequelize model for our data. For this example, we'll keep it simple and only get the id and the event_group and event_description , which categorize and describe each 911 call. We'll also set up a column called geom that will automatically process our GeoJSON longitude and latitude coordinates into a PostGIS geometry object. Sequelize does this by using the PostGIS function ST_GeomFromGeoJSON behind the scenes. Since we're using GeoJSON data, the PostGIS geometry object coordinate reference system will be set to SRID 4236, but the geom column set up by Sequelize will have an SRID set to 0. We'll have to change this once we've inserted our data since SRID 4236 will not work with OSM data since it uses a different SRID - this is discussed further below. To set up a model for our data, we'll first define the model using Sequelize's define method. The first argument that the method takes is the table name we want to create. For our use case, the table will be named emergency_calls . The second argument of the function is an object that contains the data type, field name, and other constraints we want to put on our columns such as defining primary keys and allowing null values. const EmergencyCalls = sequelize.define('emergency_calls', { id: { type: Sequelize.STRING, field: 'id', primaryKey: true }, eventGroup: { type: Sequelize.TEXT, field: 'event_clearance_group' }, eventDescription: { type: Sequelize.TEXT, field: 'description' }, geom: { type: Sequelize.GEOMETRY('POINT'), field: 'geom', allowNull: false } }); The field is the name we assign to the PostgreSQL table column. The type is the data type we assign to the column using any appropriate Sequelize's data type . For columns what will store data as a string, we'll use Sequelize's STRING data type. For the geom column, Sequelize has a GEOMETRY data type that allows us to assign the type of geometry to a column. In our use case, it's ""POINT"" since the GeoJSON geometry type is also ""POINT"". Sequelize also allows us to assign the GEOMETRY data type a second parameter that is a SRID number. However, since our GeoJSON data does not contain information regarding the SRID, if we define the SRID of the column and our data doesn't match when inserting a record, we'll receive an error. Therefore, to solve this problem we will not set the SRID of the column initially and we'll go back and manually change the column later using PostGIS. Now that we have the model set up, we can initialize EmergencyCalls which will create the table. Here, we'll append the sync method to create the PostgreSQL table in the database and use the force option to drop the table if it exists. EmergencyCalls .sync({ force: true }) .then(() => {}) .catch(err = Once we have the model set up and the table created, we can start importing the data into the table. We'll do this using the request library which takes a URL and a callback that contains the GeoJSON data in the body . We'll assign a variable called data that parses the GeoJSON data using the JSON.parse method. Then we'll take the data and iterate over the GeoJSON features array: request(url, (err, res, body) = i Within the for-loop, we'll use the Sequelize create method to insert each 911 call record into our database. We'll only select the necessary information from the GeoJSON ""Properties"" and ""Geometry"" objects and put the results into keys we created from our Sequelize model. EmergencyCalls.create({ id: jsonFeatures[i].properties.cad_cdw_id, eventGroup: jsonFeatures[i].properties.event_clearance_group, eventDescription: jsonFeatures[i].properties.event_clearance_description, geom: jsonFeatures[i].geometry }); The full request looks like: request(url, (err, res, body) = i Once we have the code set up, just run node 911data.js and we'll see the table set up and all of our data being logged in the terminal window. After the data has been inserted, let's see what it looks like in PostgreSQL by logging into our OSM database. OUR 911 DATA AND OSM WITH POSTGIS Once we've logged into our PostgreSQL deployment and connected to our OSM database, we can view what Sequelize has inserted into our emergency_calls table. Using a SELECT query our table will contain documents that look something like this: SELECT * FROM emergency_calls LIMIT 1; Running the query gives us: id | event_clearance_group | description | geom | createdAt | updatedAt -------+--------------------------+--------------------+--------------------------------------------+----------------------------+---------------------------- 89778 | SUSPICIOUS CIRCUMSTANCES | SUSPICIOUS VEHICLE | 0101000000C58EC6A17E935EC040852348A5CC4740 | 2017-03-29 20:48:10.812+00 | 2017-03-29 20:48:10.812+00 Notice that the fields that we defined and the data have been created and inserted along with two other timestamp fields created by Sequelize. These timestamps show when a record has been inserted and updated in the table. If you don't want timestamps to be added, just add timestamps: false inside the Sequelize model. To view the data type of each column run \d emergency_calls . Table ""public.emergency_calls"" Column | Type | Modifiers -----------------------+--------------------------+----------- id | character varying(255) | not null event_clearance_group | text | description | text | geom | geometry(Point) | not null createdAt | timestamp with time zone | not null updatedAt | timestamp with time zone | not null Indexes: ""emergency_calls_pkey"" PRIMARY KEY, btree (id) Here we can see that the geom column has been assigned a geometry data type without an SRID even though there are geometry objects inserted in the column. Since our geom column contains GeoJSON data in the form of a geometry object, the coordinates of the data are automatically calculated using SRID 4326 even though the column doesn't have an SRID defined. If we decided to project the geom data onto our OSM map, however, the geom points would not align with the map because OSM uses SRID 3857. So how do we solve this issue? To solve the issue we'll have to transform our geom column to SRID 3857. To do this, we'll update the column by first setting the SRID of each value to SRID 4326 by using PostGIS's ST_SetSRID function. We'll then transform each value to SRID 3857 using PostGIS's ST_Transform function. This can be done by writing the following SQL statement: UPDATE emergency_calls SET geom = ST_Transform(ST_SetSRID(geom,4326),3857); After that's completed, we'll create an index on the geom column: CREATE INDEX idx_emergency_calls ON emergency_calls USING GIST (geom); Using OpenJUMP, we now can view each of the points on the Seattle OSM map. Now that we have some 911 call points, an OSM map, and other OSM data, let's do some querying ... POSTGIS QUERIES In the last article, we looked at some restaurant data using OSM's hstore data column to find the top ten cuisines in Seattle. We then found out that coffee was the most popular ""cuisine"" and found the top ten coffee shops in the city. Let's take a closer look at the map this time by focusing on one particular area of Seattle called Capitol Hill. We'll start out by selecting the area and getting its coordinates. A useful tool to draw and get the coordinates of a polygon on a map is geojson.io . All we have to do is zoom on Seattle and draw a polygon around the area we want. It will automatically provide us with the coordinates of the polygon in GeoJSON in the right sidebar. Once we have the coordinates, we can use PostGIS's function ST_GeomFromText to define the shape type, its coordinates, and the SRID. In this case, since the coordinates derive from a GeoJSON object, the default SRID is 4326. This is what the function with our coordinates will look like: ST_GeomFromText('POLYGON((-122.32576847076416 47.61471249278791, -122.32044696807861 47.61471249278791, -122.32044696807861 47.6179525278143, -122.32576847076416 47.6179525278143, -122.32576847076416 47.61471249278791))', 4326) The value returned from ST_GeomFromText will have to be transformed so that it can be viewed correctly on the OSM map like the geom column coordinates in our emergency_calls table. To do that, we'll again use the ST_Transform function and assign it SRID 3857: ST_Transform(ST_GeomFromText(..., 4326), 3857) Once we have this setup, we can use OpenJUMP and select Run Datastore Query from the File menu. Once we select Run Datastore Query , it will open a window to create a new map layer. We first select or create a new connection to our OSM database. Then we type in the name of our layer and then write the SQL query: SELECT ST_Transform(ST_GeomFromText('POLYGON((-122.32576847076416 47.61471249278791, -122.32044696807861 47.61471249278791, -122.32044696807861 47.6179525278143, -122.32576847076416 47.6179525278143, -122.32576847076416 47.61471249278791))', 4326), 3857) FROM planet_osm_polygon; Running this query we will see the polygon appear on the map. To get all the restaurants and their names that are only within the polygon, we'll use the PostGIS function ST_Contains , which selects only objects that are contained within a defined geometry. A simple way to understand how this function works is to view it as: ST_Contains(shape_to_search_in, objects_to_find_in_the_shape) Therefore, when we write this query to find all the OSM points inside the polygon, we'd write: SELECT amenity, name, way FROM planet_osm_point WHERE ST_Contains(ST_Transform(ST_GeomFromText('POLYGON((-122.32576847076416 47.61471249278791, -122.32044696807861 47.61471249278791, -122.32044696807861 47.6179525278143, -122.32576847076416 47.6179525278143, -122.32576847076416 47.61471249278791))', 4326), 3857), way) AND amenity = 'restaurant' GROUP BY amenity, name, way; Within ST_Contains , the first geometry we add is the polygon that we want to search within. In this case, it's the same polygon that we created and transformed to SRID 3857. The second geometry way is the geometry column of planet_osm_point which are the OSM points we want to return if they are contained inside the polygon. Additionally, we select only the restaurant amenities so that we only get the restaurants and filter out the other points. Running this query provides us with five results: amenity | name | way ------------+-----------------------+---------------------------------------------------- restaurant | 611 Supreme | 0101000020110F000014AE4719F1F869C185EB51B86C0D5741 restaurant | Bill's Off Broadway | 0101000020110F00008FC2F5E0DBF869C1C3F528EC6B0D5741 restaurant | Fogan Cocina Mexicana | 0101000020110F00003333339BF6F869C114AE47D1760D5741 restaurant | Raygun Lounge | 0101000020110F000048E17A7402F969C11F85EB716B0D5741 restaurant | Yo! Zushi | 0101000020110F00006666660EC3F869C1295C8FD2870D5741 Using our 911 call data, we could create a more complex example using ST_Contains to show the number of 911 calls that took place near these restaurants. To so that, what we'd want to show is the name of each restaurant, the reason for the 911 call, and the distance between the 911 call event and the restaurant. Another constraint that we might add is that the 911 call has to be within a radius of 30 meters from the restaurant. This will filter out calls that are further away. This query would look similar to the following: SELECT p.name AS restaurant, e.event_clearance_group AS activity, ST_Distance(p.way, e.geom) AS distance FROM planet_osm_point AS p, emergency_calls AS e WHERE p.amenity = 'restaurant' AND e.event_clearance_group Running the query gives us 71 results like: restaurant | activity | distance -----------------------+--------------------------+------------------ 611 Supreme | ASSAULTS | 21.6197191748894 ... Fogan Cocina Mexicana | MENTAL HEALTH | 29.6527188363329 ... Raygun Lounge | SUSPICIOUS CIRCUMSTANCES | 28.8634577346799 ... Yo! Zushi | FALSE ALARMS | 27.4377722361284 The first thing to notice is that we have two ST_Contains functions being used. Each one indicates that the restaurant and the 911 call should be contained within the polygon. What's also noticeable is the other PostGIS queries that we added: ST_Distance and ST_DWithin . ST_Distance provides the distance between one geometry and the other. In this case, it shows us the distance between the 911 call and a restaurant. The function ST_DWithin returns true if two geometries are within a specified distance of each other. So, above, we are indicating in the WHERE clause that each restaurant and 911 call have to be within 30 meters of each other. Another interesting function provided by PostGIS is ST_Instersects , which is useful for when you want to see what shapes intersect with another. This function is helpful especially when we want to find roads that intersect with our polygon. Like the function ST_Contains , this function first takes the geometry of the polygon that we are searching within, and then the geometry column of the table that contains the shapes we want if they intersect with our polygon. A query selecting all the roads that interest with our polygon would look like the following: SELECT name, way FROM planet_osm_roads WHERE ST_Intersects(ST_Transform(ST_GeomFromText('POLYGON((-122.32576847076416 47.61471249278791, -122.32044696807861 47.61471249278791, -122.32044696807861 47.6179525278143, -122.32576847076416 47.6179525278143, -122.32576847076416 47.61471249278791))', 4326), 3857), way) GROUP BY name, way; We first provide the coordinates of the polygon that we transformed. Then we provide the geometry column way from the planet_osm_roads table, which contains the geometries of all the roads on our map. When running the query, we'll find that there are five streets that intersect the polygon. name ----------------------------------- Broadway University Link Northbound East Pine Street University Link Southbound Seattle Streetcar First Hill Line SO MUCH MORE ... In this article, we looked at how to import and use another dataset with our OSM data. In addition, we looked at using PostGIS functions in order to modify our new dataset in order to work with OSM and to select only a portion of that data to query. While we haven't covered all of PostGIS's capabilities, this basic overview will help you start combining your own data with OSM and start using PostGIS and PostgreSQL for all your GIS needs. -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. attribution Stephen Monroe Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud technology. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES Mar 16, 2017GEOFILE: USING OPENSTREETMAP DATA IN COMPOSE POSTGRESQL - PART I GeoFile is a series dedicated to looking at geographical data, its features, and uses. In today's article, we're going to int… Abdullah Alger Dec 15, 2016GEOFILE: POSTGIS AND RASTER DATA GeoFile is a series dedicated to looking at geographical data, its features, and uses. In this article, we'll look at raster… Abdullah Alger Oct 17, 2016GEOFILE: EVERYTHING IN THE RADIUS WITH POSTGIS GeoFile is a series dedicated to looking at geographical data, its features and uses. In this article, we build upon our last… Abdullah Alger Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",We'll also look at using PostGIS to filter our data and to find places that are within or intersect a chosen polygon.,GeoFile: Using OpenStreetMap Data in Compose PostgreSQL - Part II,Live,49 129,"Follow Sign in / Sign up Home About Insight Data Science Data Engineering Health Data AI Never miss a story from Insight Data , when you sign up for Medium. Learn more Never miss a story from Insight Data Get updates Get updates Sebastien Dery Blocked Unblock Follow Following I don’t know what I’m doing; but then neither do you so it’s all good. Master of Layers, Protector of the Graph, Wielder of Knowledge. #OpenScience Oct 16 -------------------------------------------------------------------------------- GRAPH-BASED MACHINE LEARNING: PART I COMMUNITY DETECTION AT SCALE During the seven-week Insight Data Engineering Fellows Program recent grads and experienced software engineers learn the latest open source technologies by building a data platform to handle large, real-time datasets. Sebastien Dery (now a Data Science Engineer at Yewno ) discusses his project on community detection on large datasets. -------------------------------------------------------------------------------- #tltr : Graph-based machine learning is a powerful tool that can easily be merged into ongoing efforts. Using modularity as an optimization goal provides a principled approach to community detection. Local modularity increment can be tweaked to your own dataset to reflect interpretable quantities. This is useful in many scenarios, making it a prime candidate for your everyday toolbox.Many important problems can be represented and studied using graphs — social networks, interacting bacterias, brain network modules, hierarchical image clustering and many more. If we accept graphs as a basic means of structuring and analyzing data about the world, we shouldn’t be surprised to see them being widely used in Machine Learning as a powerful tool that can enable intuitive properties and power a lot of useful features. Graph-based machine learning is destined to become a resilient piece of logic, transcending a lot of other techniques. See more in this recent blog post from Google Research This post explores the tendencies of nodes in a graph to spontaneously form clusters of internally dense linkage (hereby termed “community”); a remarkable and almost universal property of biological networks. This is particularly interesting knowing that a lot of information can be extrapolated from a node’s neighbor (e.g. think recommendation system, respondent analysis, portfolio clustering). So how can we extract this kind of information? Community Detection aims to partition a graph into clusters of densely connected nodes, with the nodes belonging to different communities being only sparsely connected. Graph analytics concerns itself with the study of nodes (depicted as disks) and their interactions with other nodes (lines). Community Detection aims to classify nodes by their “clique”.“ Is it the same as clustering? ” * Short answer: Yes . * Long answer: For all intents and purposes, yes it is . So why shouldn’t I just use my good old K-Means? You absolutely should, unless your data and requirements don’t work well with that algorithm’s assumptions, namely: 1. K number of clusters 2. Sum of Squared Error (SSE) as the right optimization cost 3. All variable have the same variance 4. The variance of the distribution of each attribute is spherical For a more in-depth look click here . First off, let’s drop this idea of SSE and choose a more relevant notation of what we’re looking for: the internal versus external relationships between nodes of a community. Let’s discuss the notion of modularity. where: nc is the number of communities; lc number of edges within; dc sum of vertex degree; and m the size of the graph (number of edges). We will be using this equation as a global metric of goodness during our search for an optimal partitioning. In a nutshell: Higher score will be given to a community configuration offering higher internal versus external linkage.So all I have to do is optimize this and we’re done, right? A major problem in the theoretical formulation of this optimization scheme is that we need an all-knowing knowledge of the graph topology (geometric properties and spatial relations). This is rather, let’s say, intractable . Apparently we can’t do any better than to try all possible subsets of the vertices and check to see which, if any, form communities. The problem of finding the largest clique in a graph is thus said to be NP-hard . However, several algorithms have been proposed over the years to find reasonably good partitions in reasonable amounts of time, each with its own particular flavor. This post focuses on a specific family of algorithms called agglomerative . These algorithms work very simply by collecting (or merging) nodes together. This has a lot of advantages since it typically only requires a knowledge of first degree neighbors and small incremental merging steps , to bring the global solution towards stepwise equilibriums. You might point out that the modularity metric gives a global perspective on the state of the graph and not a local indicator. So, how does this translate to the small local increment that I just mentioned? The basic approach does indeed consists of iteratively merging nodes that optimize a local modularity so let’s go ahead and define that as well: where ∑ in is the sum of weighted links inside C, ∑ tot sum of weighted links incident to nodes in C, k i sum of weighted links incident to node i , k i, in sum of weighted links going from i to nodes in C and m a normalizing factor as the sum of weighted links for the whole graph. (Sorry, Medium doesn’t allow subscript and superscript)This is part of the magic for me as this local optimization function can easily be translated to an interpretable metric within the domain of your graph. For example, * Community Strength: Sum of Weighted Link within a community. * Community Popularity: Sum of Weighted Link incident to nodes within a specific community. * Node Belonging: Sum of Weighted Link from a node to a community. There’s also nothing stopping from adding more terms to the previous equation that are specific to your dataset. In other words, the weighted links can be a function of the type of nodes computed on-the-fly (useful if you’re dealing with a multidimensional graph with various types of relationships and nodes). Example of converging iterations before the Compress phaseNow that we’re all set with our optimization function and local cost, the typical agglomerative strategy consists of two iterative phases ( Transfer and Compress ). Assuming a weighted network of N nodes, we begin by assigning a different community to each node of the network. 1. Transfer : For each node i, consider its neighbors j and evaluate the gain in modularity by swapping c_i for c_j . The greedy process transfers the node into the neighboring community, maximizing the gain in modularity (assuming the gain is positive). If no positive gain is possible, the node i stays in its original community. This process is applied to all nodes until no individual move can improve the modularity (i.e. a local maxima of modularity is attained — a state of equilibrium). 2. Compress : building a new network whose nodes are the communities found during the first phase; a process termed compression (see Figure below). To do so, edge weights between communities are computed as the sum of the internal edges between nodes in the corresponding two communities. Agglomerative process: Phase one converges to a local equilibrium of local modularity. Phase two consist in compressing the graph for the next iteration, thus reducing the number of nodes to consider and incidentally computation time as well.Now the tricky part: as this is a greedy algorithm , you’ll have to define a stopping criteria based on your case scenario and the data at hand. How to define this criteria? It can be a lot of things: a maximum number of iterations, a minimum modularity gain during the transfer phase, or any other relevant piece of information related to your data that would inform you that it needs to stop. Still not sure when to stop ? Just make sure you save every intermediate step of the iterative process somewhere, let the optimization run until there’s only one node left in your graph, and then look back at your data! The interesting part is that by keeping track of each step, you also profit from a hierarchical view of your communities which can be further explored and leveraged. In a follow up post, I will discuss how we can achieve this on a distributed system using Spark GraphX , part of my project while at the Insight Data Engineering Fellows Program . [0803.0476] Fast unfolding of communities in large networks Abstract: We propose a simple method to extract the community structure of large networks. Our method is a heuristic… arxiv.org -------------------------------------------------------------------------------- Want to learn Spark, machine learning with graphs, and other big data tools from top data engineers in Silicon Valley or New York? The Insight Data Engineering Fellows Program is a free 7-week professional training where you can build cutting edge big data platforms and transition to a career in data engineering at top teams like Facebook, Uber, Slack and Squarespace. Learn more about the program and apply today . Big Data Data Science Machine Learning Social Network Analysis Insight Data Engineering 4 Blocked Unblock Follow FollowingSEBASTIEN DERY I don’t know what I’m doing; but then neither do you so it’s all good. Master of Layers, Protector of the Graph, Wielder of Knowledge. #OpenScience FollowINSIGHT DATA Insight Fellows Program —Your bridge to careers in Data Science and Data Engineering.",Community Detection at Scale,Graph-based machine learning,Live,50 131,"* Free 7-Day Crash Course * Blog * Masterclass MODERN MACHINE LEARNING ALGORITHMS: STRENGTHS AND WEAKNESSES EliteDataScience 0 Comments May 16, 2017 Share Google Linkedin TweetIn this guide, we’ll take a practical, concise tour through modern machine learning algorithms. While other such lists exist, they don’t really explain the practical tradeoffs of each algorithm, which we hope to do here. We’ll discuss the advantages and disadvantages of each algorithm based on our experience. Categorizing machine learning algorithms is tricky, and there are several reasonable approaches; they can be grouped into generative/discriminative, parametric/non-parametric, supervised/unsupervised, and so on. For example, Scikit-Learn’s documentation page groups algorithms by their learning mechanism . This produces categories such as: * Generalized linear models * Support vector machines * Nearest neighbors * Decision trees * Neural networks * And so on… However, from our experience, this isn’t always the most practical way to group algorithms. That’s because for applied machine learning, you’re usually not thinking, “boy do I want to train a support vector machine today!” Instead, you usually have an end goal in mind, such as predicting an outcome or classifying your observations. Therefore, we want to introduce another approach to categorizing algorithms, which is by machine learning task. NO FREE LUNCH In machine learning, there’s something called the “No Free Lunch” theorem. In a nutshell, it states that no one algorithm works best for every problem, and it’s especially relevant for supervised learning (i.e. predictive modeling). For example, you can’t say that neural networks are always better than decision trees or vice-versa. There are many factors at play, such as the size and structure of your dataset. As a result, you should try many different algorithms for your problem , while using a hold-out “test set” of data to evaluate performance and select the winner. Of course, the algorithms you try must be appropriate for your problem, which is where picking the right machine learning task comes in. As an analogy, if you need to clean your house, you might use a vacuum, a broom, or a mop, but you wouldn't bust out a shovel and start digging. MACHINE LEARNING TASKS This is Part 1 of this series. In this part, we will cover the ""Big 3"" machine learning tasks, which are by far the most common ones. They are: 1. Regression 2. Classification 3. Clustering In Part 2 (coming soon), we will cover more situational tasks, such as: 1. Feature Selection 2. Feature Extraction 3. Density Estimation 4. Anomaly Detection Two notes before continuing: * We will not cover domain-specific adaptations, such as natural language processing. * We will not cover every algorithm. There are too many to list, and new ones pop up all the time. However, this list will give you a representative overview of successful contemporary algorithms for each task. 1. REGRESSION Regression is the supervised learning task for modeling and predicting continuous, numeric variables. Examples include predicting real-estate prices, stock price movements, or student test scores. Regression tasks are characterized by labeled datasets that have a numeric target variable . In other words, you have some ""ground truth"" value for each observation that you can use to supervise your algorithm. Linear Regression 1.1. (REGULARIZED) LINEAR REGRESSION Linear regression is one of the most common algorithms for the regression task. In its simplest form, it attempts to fit a straight hyperplane to your dataset (i.e. a straight line when you only have 2 variables). As you might guess, it works well when there are linear relationships between the variables in your dataset. In practice, simple linear regression is often outclassed by its regularized counterparts (LASSO, Ridge, and Elastic-Net). Regularization is a technique for penalizing large coefficients in order to avoid overfitting , and the strength of the penalty should be tuned. * Strengths: Linear regression is straightforward to understand and explain, and can be regularized to avoid overfitting. In addition, linear models can be updated easily with new data using stochastic gradient descent . * Weaknesses: Linear regression performs poorly when there are non-linear relationships. They are not naturally flexible enough to capture more complex patterns, and adding the right interaction terms or polynomials can be tricky and time-consuming. * Implementations: Python / R 1.2. REGRESSION TREE (ENSEMBLES) Regression trees (a.k.a. decision trees) learn in a hierarchical fashion by repeatedly splitting your dataset into separate branches that maximize the information gain of each split. This branching structure allows regression trees to naturally learn non-linear relationships. Ensemble methods, such as Random Forests (RF) and Gradient Boosted Trees (GBM), combine predictions from many individual trees. We won't go into their underlying mechanics here, but in practice, RF's often perform very well out-of-the-box while GBM's are harder to tune but tend to have higher performance ceilings. * Strengths: Decision trees can learn non-linear relationships, and are fairly robust to outliers. Ensembles perform very well in practice, winning many classical (i.e. non-deep-learning) machine learning competitions. * Weaknesses: Unconstrained, individual trees are prone to overfitting because they can keep branching until they memorize the training data. However, this can be alleviated by using ensembles. * Implementations: Random Forest - Python / R , Gradient Boosted Tree - Python / R 1.3. DEEP LEARNING Deep learning refers to multi-layer neural networks that can learn extremely complex patterns. They use ""hidden layers"" between inputs and outputs in order to model intermediary representations of the data that other algorithms cannot easily learn. They have several important mechanisms, such as convolutions and drop-out, that allows them to efficiently learn from high-dimensional data. However, deep learning still requires much more data to train compared to other algorithms because the models have orders of magnitudes more parameters to estimate. * Strengths: Deep learning is the current state-of-the-art for certain domains, such as computer vision and speech recognition. Deep neural networks perform very well on image, audio, and text data, and they can be easily updated with new data using batch propagation. Their architectures (i.e. number and structure of layers) can be adapted to many types of problems, and their hidden layers reduce the need for feature engineering. * Weaknesses: Deep learning algorithms are usually not suitable as general-purpose algorithms because they require a very large amount of data. In fact, they are usually outperformed by tree ensembles for classical machine learning problems. In addition, they are computationally intensive to train, and they require much more expertise to tune (i.e. set the architecture and hyperparameters). * Implementations: Python / R 1.4. HONORABLE MENTION: NEAREST NEIGHBORS Nearest neighbors algorithms are ""instance-based,"" which means that that save each training observation. They then make predictions for new observations by searching for the most similar training observations and pooling their values. These algorithms are memory-intensive, perform poorly for high-dimensional data, and require a meaningful distance function to calculate similarity. In practice, training regularized regression or tree ensembles are almost always better uses of your time. 2. CLASSIFICATION Classification is the supervised learning task for modeling and predicting categorical variables. Examples include predicting employee churn, email spam, financial fraud, or student letter grades. As you'll see, many regression algorithms have classification counterparts. The algorithms are adapted to predict a class (or class probabilities) instead of real numbers. Logistic Regression 2.1. (REGULARIZED) LOGISTIC REGRESSION Logistic regression is the classification counterpart to linear regression. Predictions are mapped to be between 0 and 1 through the logistic function , which means that predictions can be interpreted as class probabilities. The models themselves are still ""linear,"" so they work well when your classes are linearly separable (i.e. they can be separated by a single decision surface). Logistic regression can also be regularized by penalizing coefficients with a tunable penalty strength. * Strengths: Outputs have a nice probabilistic interpretation, and the algorithm can be regularized to avoid overfitting. Logistic models can be updated easily with new data using stochastic gradient descent. * Weaknesses: Logistic regression tends to underperform when there are multiple or non-linear decision boundaries. They are not flexible enough to naturally capture more complex relationships. * Implementations: Python / R 2.2. CLASSIFICATION TREE (ENSEMBLES) Classification trees are the classification counterparts to regression trees. They are both commonly referred to as ""decision trees"" or by the umbrella term ""classification and regression trees (CART)."" * Strengths: As with regression, classification tree ensembles also perform very well in practice. They are robust to outliers, scalable, and able to naturally model non-linear decision boundaries thanks to their hierarchical structure. * Weaknesses: Unconstrained, individual trees are prone to overfitting, but this can be alleviated by ensemble methods. * Implementations: Random Forest - Python / R , Gradient Boosted Tree - Python / R 2.3. DEEP LEARNING To continue the trend, deep learning is also easily adapted to classification problems. In fact, classification is often the more common use of deep learning, such as in image classification. * Strengths: Deep learning performs very well when classifying for audio, text, and image data. * Weaknesses: As with regression, deep neural networks require very large amounts of data to train, so it's not treated as a general-purpose algorithm. * Implementations: Python / R 2.4. SUPPORT VECTOR MACHINES Support vector machines (SVM) use a mechanism called kernels , which essentially calculate distance between two observations. The SVM algorithm then finds a decision boundary that maximizes the distance between the closest members of separate classes. For example, an SVM with a linear kernel is similar to logistic regression. Therefore, in practice, the benefit of SVM's typically comes from using non-linear kernels to model non-linear decision boundaries. * Strengths: SVM's can model non-linear decision boundaries, and there are many kernels to choose from. They are also fairly robust against overfitting, especially in high-dimensional space. * Weaknesses: However, SVM's are memory intensive, trickier to tune due to the importance of picking the right kernel, and don't scale well to larger datasets. Currently in the industry, random forests are usually preferred over SVM's. * Implementations: Python / R 2.5. NAIVE BAYES Naive Bayes (NB) is a very simple algorithm based around conditional probability and counting. Essentially, your model is actually a probability table that gets updated through your training data. To predict a new observation, you'd simply ""look up"" the class probabilities in your ""probability table"" based on its feature values. It's called ""naive"" because its core assumption of conditional independence (i.e. all input features are independent from one another) rarely holds true in the real world. * Strengths: Even though the conditional independence assumption rarely holds true, NB models actually perform surprisingly well in practice, especially for how simple they are. They are easy to implement and can scale with your dataset. * Weaknesses: Due to their sheer simplicity, NB models are often beaten by models properly trained and tuned using the previous algorithms listed. * Implementations: Python / R 3. CLUSTERING Clustering is an unsupervised learning task for finding natural groupings of observations (i.e. clusters) based on the inherent structure within your dataset. Examples include customer segmentation, grouping similar items in e-commerce, and social network analysis. Because clustering is unsupervised (i.e. there's no ""right answer""), data visualization is usually used to evaluate results. If there is a ""right answer"" (i.e. you have pre-labeled clusters in your training set), then classification algorithms are typically more appropriate. K-Means 3.1. K-MEANS K-Means is a general purpose algorithm that makes clusters based on geometric distances (i.e. distance on a coordinate plane) between points. The clusters are grouped around centroids, causing them to be globular and have similar sizes. This is our recommended algorithm for beginners because it's simple, yet flexible enough to get reasonable results for most problems. * Strengths: K-Means is hands-down the most popular clustering algorithm because it's fast, simple, and surprisingly flexible if you pre-process your data and engineer useful features. * Weaknesses: The user must specify the number of clusters, which won't always be easy to do. In addition, if the true underlying clusters in your data are not globular, then K-Means will produce poor clusters. * Implementations: Python / R 3.2. AFFINITY PROPAGATION Affinity Propagation is a relatively new clustering technique that makes clusters based on graph distances between points. The clusters tend to be smaller and have uneven sizes. * Strengths: The user doesn't need to specify the number of clusters (but does need to specify 'sample preference' and 'damping' hyperparameters). * Weaknesses: The main disadvantage of Affinity Propagation is that it's quite slow and memory-heavy, making it difficult to scale to larger datasets. In addition, it also assumes the true underlying clusters are globular. * Implementations: Python / R 3.3. HIERARCHICAL / AGGLOMERATIVE Hierarchical clustering, a.k.a. agglomerative clustering, is a suite of algorithms based on the same idea: (1) Start with each point in its own cluster. (2) For each cluster, merge it with another based on some criterion. (3) Repeat until only one cluster remains and you are left with a hierarchy of clusters. * Strengths: The main advantage of hierarchical clustering is that the clusters are not assumed to be globular. In addition, it scales well to larger datasets. * Weaknesses: Much like K-Means, the user must choose the number of clusters (i.e. the level of the hierarchy to ""keep"" after the algorithm completes). * Implementations: Python / R 3.4. DBSCAN DBSCAN is a density based algorithm that makes clusters for dense regions of points. There's also a recent new development called HDBSCAN that allows varying density clusters. * Strengths: DBSCAN does not assume globular clusters, and its performance is scalable. In addition, it doesn't require every point to be assigned to a cluster, reducing the noise of the clusters (this may be a weakness, depending on your use case). * Weaknesses: The user must tune the hyperparameters 'epsilon' and 'min_samples,' which define the density of clusters. DBSCAN is quite sensitive to these hyperparameters. * Implementations: Python / R PARTING WORDS We've just taken a whirlwind tour through modern algorithms for the ""Big 3"" machine learning tasks: Regression, Classification, and Clustering. In Part 2 (coming soon), we will look at algorithms for more situational tasks, such as Dimensionality Reduction (i.e. Feature Selection or Extraction), Density Estimation, and Anomaly Detection. However, we want to leave you with a few words of advice based on our experience: 1. First... practice, practice, practice. Reading about algorithms can help you find your footing at the start, but true mastery comes with practice. As you work through projects and/or competitions, you'll develop practical intuition, which unlocks the ability to pick up almost any algorithm and apply it effectively. 2. Second... master the fundamentals. There are dozens of algorithms we couldn't list here, and some of them can be quite effective in specific situations. However, almost all of them are some adaptation of the algorithms on this list, which will provide you a strong foundation for applied machine learning. 3. Finally, remember that better data beats fancier algorithms. In applied machine learning, algorithms are commodities because you can easily switch them in and out depending on the problem. However, effective exploratory analysis, data cleaning, and feature engineering can significantly boost your results. If you'd like to learn more about the applied machine learning workflow and how to efficiently train professional-grade models, we invite you to sign up for our free 7-day email crash course . For more over-the-shoulder guidance, we also offer a comprehensive masterclass that further explains the intuition behind many of these algorithms and teaches you how to apply them to real-world problems. Share Google Linkedin TweetLEAVE A RESPONSE CANCEL REPLY Name* Email* Website* Denotes Required Field RECOMMENDED READING * Modern Machine Learning Algorithms: Strengths and Weaknesses * The Ultimate Python Seaborn Tutorial: Gotta Catch ‘Em All * The 5 Levels of Machine Learning Iteration * R vs Python for Data Science: Summary of Modern Advances * Python Machine Learning Tutorial, Scikit-Learn: Wine Snob Edition Copyright © 2016 · EliteDataScience.com · All Rights Reserved * Home * Terms of Service * Privacy Policy","Get to know the ML landscape through this practical, concise overview of modern machine learning algorithms. Plus, we'll discuss the tradeoffs of each.",Modern Machine Learning Algorithms,Live,51 132,"* United States IBM® * Site map Search IBM Developer Advocacy * Services * Set Up a Secure Gateway * Cloudant * Migrate CSV data to dashDB * Migrate PureData for Analytics Data to dashDB * Migrate Data with the Lift Data Load API * Compose * Spark * dashDB * IBM Graph * Data Connect * Lift * BigInsights on Cloud * Watson Analytics * DB2 on Cloud * DataStage on Cloud * Master Data Management on Cloud * Informix on Cloud * Blog * Showcases * Search Resources * Events Services to get , build , and analyze data on the ibm cloud Set Up a Secure GatewayLearn how to set up a secure gateway as the first step to migrating your data to dashDB using IBM Bluemix Lift. You can also… CloudantA fully-managed NoSQL database as a service (DBaaS) built from the ground up to scale globally, run non-stop, and handle a wide variety of data… Migrate CSV data to dashDBLearn how to migrate your CSV data to dashDB using IBM Bluemix Lift. You can also read a transcript of this video Read the migration… Migrate PureData for Analytics Data to dashDBLearn how to migrate data from IBM PureData for Analytics to dashDB using IBM Bluemix Lift. You can also read a transcript of this video… Migrate Data with the Lift Data Load APIThe IBM Bluemix Lift Data Load API allows you to perform your migration from on-premises sources to targets on the cloud. The IBM Bluemix Lift… ComposeProduction-ready hosting for the following databases: MongoDB with SSL, Elasticsearch, RethinkDB, PostgreSQL, Redis, etcd, and RabbitMQ. SparkAnalytics for Apache Spark provides fast, in-memory, distributed analytics processing of large data sets. dashDBTrue business intelligence comes from the ability to glean insights from your data. To get them, you need a place where you can combine data… IBM GraphIBM Graph is an easy-to-use, fully managed graph database service for storing, querying, and visualizing data points, their connections, and properties. IBM Graph is based… Data ConnectData Connect is a cloud-based data refinery that transforms raw data into relevant and actionable information. Find data, shape it, and deliver it to applications… LiftMigrate data from on-premises to the cloud quickly and securely. BigInsights on CloudIBM BigInsights on Cloud provides Hadoop-as-a-service on IBM’s SoftLayer global cloud infrastructure. It offers the performance and security of an on-premises deployment without the cost… Watson AnalyticsWatson Analytics offers you the benefits of advanced analytics without the complexity. A smart data discovery service available on the cloud, it guides data exploration,… DB2 on CloudIBM DB2 on Cloud offering provides a database on IBM’s SoftLayer® global cloud infrastructure. It offers customers the rich features of an on-premise DB2 deployment… DataStage on CloudIBM DataStage on Cloud provides IBM InfoSphere DataStage on the IBM SoftLayer global cloud infrastructure. It offers the rich features of the on-premises DataStage deployment… Master Data Management on CloudIBM Master Data Management on Cloud provides IBM Master Data Management Advanced Edition on IBM Softlayer global cloud infrastructure. It offers the rich features of… Informix on CloudToday’s businesses are embracing the virtualization and automation of cloud computing to decrease costs and increase the deliverables of their IT departments. The time-tested characteristics… Search Topic Advanced Search Language Technology Powered by the Simple Search Service i What's This?The most popular Topics, Technologies and Languages are determined by the Simple Search Service - a microservice that lets you quickly create a faceted search engine. See what else IBM can do for you. Learn More about the Simple Search Service CloudDataServices Labs Open Menu * * Services * Back to Navigation * Watson Analytics * Migrate Data with the Lift Data Load API * Informix on Cloud * Migrate PureData for Analytics Data to dashDB * Set Up a Secure Gateway * Blog * Showcases * Search resources * Back to Navigation * Events NEW VIDEOS! HOW TO BUILD AN APP USING IBM GRAPH -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Lauren Schaefer 12/14/16Lauren Schaefer Learn More Recent Posts * New videos! How to build an app using IBM Graph Watch how to build a storefront web app with IBM Graph. * What’s all the hoopla about graph databases? Learn why you'd want to use a Graph database and see how to get started. What do you do when you need to take a simple, static website and turn it into an online storefront with personalized recommendations? You create a new graph database using IBM Graph, and you start coding! Over the last few months, I’ve been doing just that. While I’ve been busy coding, I’ve been documenting my progress in videos. So, sit back, relax, and enjoy my video playlist! If you’d like to try my demo app yourself, visit http://laurenslovelylandscapegraph.mybluemix.net . You can get your own copy of the code here . Or better yet, you can deploy the app to Bluemix with the simple click of a button so you can have your own running copy of the app: I’m currently building the recommendation engine for the app. Follow me on Twitter for updates: @Lauren_Schaefer . Happy graphing! * Graph Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RECENT UPDATES * Blog * Recent Post * New videos! How to build an app using IBM Graph * 12/14/16 * Watch how to build a storefront web app with IBM Graph. * Lauren Schaefer",Watch how to build a storefront web app with IBM Graph.,Build an app using IBM Graph,Live,52 135,"Homepage Follow Sign in Get started Homepage * Home * Data Science Experience * Data Catalog * IBM Data Refinery * * Watson Data Platform * Jake Shelley Blocked Unblock Follow Following PM on IBM Watson Data Platform Nov 15 -------------------------------------------------------------------------------- INTRODUCING STREAMS DESIGNER Starting today, users will be able to access Streams Designer through the Watson Data Platform. Streams Designer is a brand new IDE for building applications using real time data. WHAT IS STREAMS DESIGNER? Building real-time applications can be intimidating. Streams Designer makes the process easy and accessible by allowing you to simply drag and drop operators to shape, model, and transform your data as it flows from inputs to outputs. Streams Designer will allow new users to get their feet wet building real-time applications without having to dive deep into complex libraries and tools. Existing users will also love how quickly they can build and test new flows. WHAT’S NEW IN THE SERVICE? Here are a couple highlights of the functionality being delivered in Streams Designer. * Drag and drop interface for real-time applications: Streams Designer promises to make real-time analysis more accessible. You can drag and drop operators onto a canvas and connect them to create a pipeline for your data to flow through. Streams Designer offers a drag and drop IDE to create real-time applications * Monitor your flow in real time: While your flow is running, Streams Designer provides a dashboard for you to monitor the throughput of events as they pass through operators. You can also see the events and their attributes as they pass from operator to operator. You can quickly determine the health and status of your flow without having to check outputs and logs. Monitor the health and status of your flow in the real-time dashboard * Handle common Streaming use cases with a constantly growing list of operators: Today you can create flows that leverage models, filter by geofences, and aggregate clickstream data. Use the getting started wizard to set up a flow using a template. The team is continuously working on new operators and use cases, so if you don’t see what you need today, let us know and we’ll get it on it! -------------------------------------------------------------------------------- HOW DO I GET STARTED? Getting started is easy and free . If you don’t have a Watson Data Platform account, sign up here . After you finish registering, select Streams Designer from the Tools menu or add it to an existing project. The team is incredibly excited to open up Streams Designer to a wider audience. We’ve come a long way, but there is a lot more coming! Look for updates in this blog. If you’d like more information about IBM Streaming Analytics you can find it here . * Real Time Analytics * Streaming Analytics * IBM One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. Blocked Unblock Follow FollowingJAKE SHELLEY PM on IBM Watson Data Platform FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Starting today, users will be able to access Streams Designer through the Watson Data Platform. Streams Designer is a brand new IDE for building applications using real time data. ",Introducing Streams Designer,Live,53 138,"Jump to navigation * Twitter * LinkedIn * Facebook * About * Contact * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chats * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Subscribe ×BLOGS 8 WAYS TO TURN DATA INTO VALUE WITH APACHE SPARK MACHINE LEARNING Post Comment October 18, 2016 by Alex Liu Chief Data Scientist, Analytics Services, IBM Follow me on LinkedInEven as Apache Spark becomes increasingly easy to use, it is also becoming organizations’ go-to solution for executing big data computations. Not surprisingly, then, more companies than ever are adopting Spark. BUILDING AN ANALYTICS OPERATING SYSTEM When Databricks looked into 900 organizations’ use of Apache Spark in July 2016, an even clearer picture emerged. Spark played an essential role in building real-time streaming use cases for more than half (51%) of respondents, and 82% said the same when asked about advanced analytics. Similarly, use of Spark’s machine learning capabilities for production purposes jumped from 13% in 2015 to 18% in 2016. Within the computing community, increasing numbers of corporations, IBM among them, have helped enhance the capabilities of Spark. In particular, IBM backs Spark as the “analytics operating system” and accordingly has become one of the top contributors to Spark 2.0.0, as well as one of the biggest contributors to Spark’s machine learning capabilities . Data compiled by the IBM WW Competitive and Product Strategy Team. In the wake of much favorable media attention paid to Spark, many corporations have adopted Spark on paper—or have at least downloaded it with an eye to future use. Yet only a fraction have actually used Spark, let alone implemented it as their core analytics platform. TURNING DATA INTO VALUE THROUGH MACHINE LEARNING In the modern business environment, implementation of any platform, Apache Spark or not, requires practical justifications. Accordingly, the foundation for any serious Spark adoption is, as always, Spark’s power to turn data into value. Drawing on my own consulting experience as well as on some of my own research , I’ll share eight ways of using Spark’s machine learning capabilities to turn data into value. 1. OBTAIN A HOLISTIC VIEW OF BUSINESS In today's competitive world, many corporations work hard to gain a holistic view or a 360 degree view of customers, for many of the key benefits as outlined by data analytics expert Mr. Abhishek Joshi . In many cases, a holistic view was not obtained, partially due to the lack of capabilities to organize huge amount of data and then to analyze them. But Apache Spark’s ability to compute quickly while using data frames to organize huge amounts of data can help researchers quickly develop analytical models that provide a holistic view of the business, adding value to related business operations. To realize this value, however, an analytical process, from data cleaning to modeling, must still be completed. 2. ENHANCE FRAUD DETECTION WITH TIMELY UPDATES To avoid losing millions or even billions of dollars to the ever-changing fraudulent schemes that plague the modern financial landscape, banks must use fraud detection models that let them quickly adopt new data and update their models accordingly. The machine learning capabilities offered by Apache Spark can help make this possible. 3. USE HUGE AMOUNTS OF DATA TO ENHANCE RISK SCORING For financial organizations, even tiny improvements to risk scoring can bring huge profits merely by avoiding defaults. In particular, the addition of data can help heighten the accuracy of risk scoring, allowing financial institutions to predict default. Although adding data can be a very challenging prospect from the standpoint of traditional credit scoring, Apache Spark can simplify the risk scoring process. 4. AVOID CUSTOMER CHURN BY RETHINKING CHURN MODELING Losing customers means losing revenue. Not surprisingly, then, companies strive to detect potential customer churn through predictive modeling, allowing them to implement interventions aimed at retaining customers. This might sound easy, but it can actually be very complicated: Customers leave for reasons that are as divergent as the customers themselves are, and products and services can play an important, but hidden, role in all this. What’s more, merely building models to predict churn for different customer segments—and with regard to different products and services—isn’t enough; we must also design interventions, then select the intervention judged most likely to prevent a particular customer from departing. Yet even doing this requires the use of analytics to evaluate the results achieved—and, eventually, to select interventions from an analytical standpoint. Amid this morass of choices, Apache Spark’s distributed computing capabilities can help solve previously baffling problems. 5. DEVELOP MEANINGFUL PURCHASE RECOMMENDATIONS Recommendations for purchases of products and services can be very powerful when made appropriately, and they have become expected features of e-commerce platforms, with many customers relying on recommendations to guide their purchases. Yet developing recommendations at all means developing recommendations for each customer—or, at the very least, for small segments of customers. Apache Spark can make this possible by offering the distributed computing and streaming analytics capabilities that have become invaluable tools for this purpose. 6. DRIVE LEARNING BY AVOIDING STUDENT ATTRITION AND PERSONALIZING LEARNING Big data is no longer solely the province of business—it has come to play a central role in education, particularly as universities seek to combat student churn, including by providing personalized education. In the modern educational environment, a combination of Apache Spark–based student churn modeling and recommendation systems can add significant value, both material and nonmaterial, to educational institutions. 7. HELP CITIES MAKE DATA-DRIVEN DECISIONS Pursuant to laws and regulations enacted at various levels of government, US cities are increasingly making their collected data publicly available—the data.gov portal is a well-known example. Certainly, as seen in New York , the open data thus disseminated is an important enabler of data-driven decision making at the municipal level. But US cities are only just beginning to generate value in this way, partly because of the difficulties of organizing this mass of data in easily used forms and the challenge of applying suitable predictive models. However, as we’ve already observed in open data meetups, including an IBM-sponsored meetup in Glendale , Apache Spark and other open-source tools, such as R, are indeed helping municipalities derive increasing value from open data. 8. PRODUCE SUITABLE CUSTOMER SEGMENTATIONS USING TELECOMMUNICATIONS DATA Many giant telecommunications companies, in the United States as well as around the world, have collected huge amounts of data, some of which they make available to their partners and customers. But using this data to create value often remains a significant challenge: The data is stored using special formats and chiefly comprises text, not numeric, information—and that’s apart from any special data issues that may arise, including those involving missing cases or missing content. Fortunately, Apache Spark, when used together with R and IBM SPSS, can help companies work effectively with special data formats while handling special data issues and providing modeling algorithms suited for work with both numbers and text—bringing software solutions together to offer additional ways of creating value. For more information about these ways of using Apache Spark, including detailed plans of action, check out my book Apache Spark Machine Learning Blueprints , available on Amazon. Reflecting IBM’s focus on Apache Spark, the machine learning capabilities of Apache Spark will be a main focus at the IBM Insight at World of Watson 2016 conference , scheduled for 24–27 October in Las Vegas. I hope to see you there, where I’ll be joining my colleagues. Look out for me at select events and in the IBM bookstore for a chance to meet up at one of my book signings. Follow @IBMBigData Topics: Analytics , Big Data Education , Big Data Use Cases , Data Scientists , Hadoop Tags: Apache Spark , churn , counterfraud , data analytics , data science , e-commerce , education , Finance , fraud , IBM SPSS , machine learning , Public Sector , R , risk , segmentation , telecommunicationsRELATED CONTENT PODCAST DATA SCIENCE EXPERT INTERVIEW: DEZ BLANCHFIELD, CRAIG BROWN, DAVID MATHISON, JENNIFER SHIN AND MIKE TAMIR PART 2 Take a peek at the future of data science in this discussion with five thought leaders in the data analytics industry, the second installment of a two-part interview recorded at the IBM Insight at World of Watson 2016 conference. Listen to Podcast Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David Mathison, Jennifer Shin and Mike Tamir part 1 Blog Calling all TM1 users: Your next on-premises planning solution is here Video Dez Blanchfield's predictions based on what he learned at World of Watson 2016 Podcast Cyber Beat Live: Can analytics and cognitive computing stop cyber criminals? Blog Accessing the power of R through a robust statistical analysis tool Podcast Finance in Focus: Meet Watson—your new surveillance officer Video Insurers: Isn't it time to go beyond traditional views of policyholders relations? Video IBM Incentive Compensation Management: Improve sales results and operational efficiencies Blog The cognitive level of surveillance for financial institutions Video Dez Blanchfield's top 3 takeaways from World of Watson 2016 Video Recommender System with Elasticsearch: Nick Pentreath & Jean-François Puget Video Hyperparameter optimization: Sven Hafeneger View the discussion thread. IBM * Site Map * Privacy * Terms of Use * 2014 IBM FOLLOW IBM BIG DATA & ANALYTICS * Facebook * YouTube * Twitter * @IBMbigdata * LinkedIn * Google+ * SlideShare * Twitter * @IBManalytics * Explore By Topic * Use Cases * Industries * Analytics * Technology * For Developers * Big Data & Analytics Heroes * Explore By Content Type * Blogs * Videos * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Events * Around the Web * About The Big Data & Analytics Hub * Contact Us * RSS Feeds * Additional Big Data Resources * AnalyticsZone * Big Data University * Channel Big Data * developerWorks Big Data Community * IBM big data for the enterprise * IBM Data Magazine * Smarter Questions Blog * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics Heroes More * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics Heroes SearchEXPLORE BY TOPIC: Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Sales Performance Management Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Presentation Calling all IBM TM1 users! There’s a new on-premises solution in town Podcast The unusual suspects in cyber warfareMORE Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Presentation Calling all IBM TM1 users! There’s a new on-premises solution in town Podcast The unusual suspects in cyber warfare Blog Calling all TM1 users: Your next on-premises planning solution is here Presentation 8 innovative ideas for data architects Video Dez Blanchfield's predictions based on what he learned at World of Watson 2016 Blog Internet of Things: A continuum of change with opportunities galore Presentation Calling all IBM TM1 users! There’s a new on-premises solution in town Blog Calling all TM1 users: Your next on-premises planning solution is here Presentation 8 innovative ideas for data architectsMORE Blog Internet of Things: A continuum of change with opportunities galore Presentation Calling all IBM TM1 users! There’s a new on-premises solution in town Blog Calling all TM1 users: Your next on-premises planning solution is here Presentation 8 innovative ideas for data architects Blog Accessing the power of R through a robust statistical analysis tool Video Insurers: Isn't it time to go beyond traditional views of policyholders relations? Video IBM Incentive Compensation Management: Improve sales results and operational efficiencies Podcast The unusual suspects in cyber warfare Podcast Cyber Beat Live: Can analytics and cognitive computing stop cyber criminals? Podcast Finance in Focus: Meet Watson—your new surveillance officer Video Insurers: Isn't it time to go beyond traditional views of policyholders relations?MORE Podcast The unusual suspects in cyber warfare Podcast Cyber Beat Live: Can analytics and cognitive computing stop cyber criminals? Podcast Finance in Focus: Meet Watson—your new surveillance officer Video Insurers: Isn't it time to go beyond traditional views of policyholders relations? Blog The cognitive level of surveillance for financial institutions Blog Dynamic duo: Big data and design thinking Video Data streams in telecom: Koen Dejonghe Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David Mathison, Jennifer Shin and Mike... Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David Mathison, Jennifer Shin and Mike...MORE Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David Mathison, Jennifer Shin and Mike... Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David Mathison, Jennifer Shin and Mike... Presentation 8 innovative ideas for data architects Video Dez Blanchfield's predictions based on what he learned at World of Watson 2016 Blog Accessing the power of R through a robust statistical analysis tool * Home * Explore By Topic * Use Cases * All * Acquire, Grow & Retain Customers * Create New Business Models * Improve IT Economics * Manage Risk * Optimize Operations & Reduce Fraud * Transform Financial Processes * Industries * All * Banking * Consumer Products * Education * Energy & Utilities * Government * Healthcare & Life Sciences * Industrial * Insurance * Media & Entertainment * Retail * Telecommunications * Analytics * All * Content Analytics * Customer Analytics * Entity Analytics * Social Media Analytics * Technology * All * Business Intelligence * Cloud Database * Data Governance * Data Warehouse * Database Management Systems * Data Science * Hadoop & Spark * Internet of Things * Predictive Analytics * Streaming Analytics * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chat * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Big Data & Analytics Heroes * For Developers * Events * Upcoming Events * Webcasts * Twitter Chat * Meetups * Around The Web * About Us * Contact Us * Search Site",Discover eight ways that Apache Spark’s machine learning capabilities are driving the modern business.,8 ways to turn data into value with Apache Spark machine learning,Live,54 141,"PREDICT FLIGHT DELAYS WITH APACHE SPARK MLLIB, FLIGHTSTATS, AND WEATHER DATA David Taieb / August 4, 2016Flight delays are an inconvenience. Wouldn’t it be great to predict how likely a flight is to be delayed? You could remove uncertainty and let travelers plan ahead. Usually, the weather is to blame for delays. So I’ve crafted an analytics solution based on weather data and past flight performance. This solution takes weather data from IBM Insights for Weather and combines it with flight history from flightstats.com to build a predictive model that can forecast delays. To load and combine all this data, we use our Simple Data Pipe open source tool to move it into a NoSQL Cloudant database. Then I use Spark MLLib to train predictive models using supervised learning algorithms and cross-validate them. ABOUT PREDICTIVE MODELING To create a solution that can make accurate predictions, we need to tease meaningful information out of our data to craft a predictive model that can make guesses about future events. We do this using our historical weather and flight data, which we divvy up into 3 parts: * the training set helps discover potentially predictive variables and relationships between them. * the test set assesses the strength of these relationships and improves them, shaping our model. * Finally the blind set validates the model. Here’s the iterative flow: SET UP A FLIGHTSTATS ACCOUNT We get our historical data from flightstats.com, so you’ll need to create an account to get access to their data sets. Save Time! If you don’t feel like walking through flightstats account setup. but want to understand the analytics, you can use a sample database I created. Skip ahead to the Create Spark Instance section to set up the app. 1. Sign up for a free developer account at FlightStats.com . 2. Fill out the form and monitor email for confirmation link (access to APIs may take up to 24 hours). 3. Once you get your access confirmation email, go to https://developer.flightstats.com/admin/applications and copy your Application ID and Application Key (you will need them in a few minutes). Tip: While you’re here, you can also explore the flightstats APIs: – https://developer.flightstats.com/api-docs-scheduledFlights/v1 – https://developer.flightstats.com/api-docs/airports/v1 CREATE A SPARK INSTANCE 1. Login to Bluemix (or sign up for a free trial) . 2. Create a new space. If you’ve been working in Bluemix already, create a new space to have a separate, clean working area for new apps and services. On the upper left of your Bluemix dashboard, click + Create a Space and name it flightpredict or whatever you want and click Create . 3. On your Bluemix dashboard, click Work with Data . Click New Service . Find and click Apache Spark then click Choose Apache Spark . Click Create . Click the New Instance button. DEPLOY SIMPLE DATA PIPE The Simple Data Pipe is a handy data movement tool our team created to help you get and combine JSON data for use where you need it. The fastest way to deploy this app to Bluemix is to click the Deploy to Bluemix button, which automatically provisions and binds the Cloudant service too. Using my sample credentials? In that case, you don’t need to import data with the pipe. Feel free to read and understand, but then skip ahead to: Create an IPython Notebook . If you would rather deploy manually , or have any issues, refer to the readme . When deployment is done, leave this Deployment Succeeded page open. You’ll return here in a minute. ADD INSIGHTS FOR WEATHER SERVICE To work its magic, the flight predict connector that we’re about to install needs weather data. So add IBM’s Insights for Weather service now, by following these steps: 1. Open a new browser window or tab, and in Bluemix, go to the top menu, and click Catalog . 2. In the Search box, type Weather , then click the Insights for Weather tile. 3. Under app , click the arrow and choose your new Simple Data Pipe application. Doing so binds the service to your new app. 4. In Selected plan choose Premium plan to ensure you’ll have enough authorized API calls to try out this app. Not ready to lay down your credit card? If you want to understand this tutorial, without stepping through all installations and data loads, you can follow along using our sample data. Just skip ahead to Create an IPython Notebook and run the notebook without changing any credentials. 5. Click Create . 6. If you’re prompted to restage your app, do so by clicking Restage . INSTALL FLIGHTSTATS CONNECTOR I created a custom connector for the Simple Data Pipe app that loads and combines historical flight data from flightstats.com with weather data from IBM Insights for Weather. Note: If you have a local copy of Simple Data Pipe, you can install this connector using Cloud Foundry . 1. In Bluemix, at the deployment succeeded screen, click the EDIT CODE button. 2. Click the package.json file to open it. 3. Edit the package.json file to add the following line to the dependencies list:""simple-data-pipe-connector-flightstats"": ""*"" Tip: Be sure to end the line above your new line with a comma and follow proper JSON syntax. 4. From the menu, choose File Save . 5. Press the Deploy app button and wait for the app to deploy again. LOAD THE DATA We’ll load 2 sets of data, an initial set of flight data from 10 major airports, and a test set, that the connector prepares for you. LOAD INITIAL DATA SET 1. Launch simple data pipe in one of the following ways: * In the code editor where your redeployed, go to the toolbar and click the Open button for your simple data pipe app. * Or, in Bluemix, go to the top menu and click Dashboard , then on your Simple Data Pipe app tile, click the Open URL button. 2. In Simple Data Pipe, go to menu on the left and click Create a New Pipe . 3. Click the Type dropdown list, and choose Flight Stats .When you added a Flightstats connector earlier, you added the option you’re choosing now. 4. In Name , enter training (or anything you want). 5. If you want, enter a Description . 6. Click Save and continue . 7. Enter the Flightstats App ID and App Key you copied when you set up your FlightStats account. 8. Click Connect to FlightStats . You see a You’re connected confirmation message. 9. Click Save and continue . 10. On the Filter Data screen, click the dropdown arrow and select Mega SubSet from 10 busiest airports . Then click Save and continue . 11. Click Skip , to bypass scheduling. 12. Click Run now . View your progress: If you want, you can see the data load in-process. In a separate browser tab or window, open or return to Bluemix. Open your Simple Data Pipe app, go the menu on the left, and click Logs . When the data’s done loading, you see a Pipe Run complete! message. LOAD TEST SET Create a new pipe again to load test data. 1. In your Simple Data Pipe app, click Create a new Pipe . 2. In the Type dropdown, select Flight Stats . 3. In Name enter test . 4. If you want, enter a Description . 5. Click Save and Continue . 6. Enter the Flightstats App ID and App Key you copied when you set up your FlightStats account. 7. Click Connect to FlightStats . You see a You’re connected confirmation message. 8. Click Save and continue . 9. On the Filter Data screen, click the dropdown arrow and select Test set . Then click Save and continue . CREATE AN IPYTHON NOTEBOOK Shortcuts: If you’ve opted to use my sample credentials, go through the following steps to create the notebook and run its commands. If you want to skip these notebook creation steps too, you can follow the rest of this tutorial by viewing this prebuilt notebook on Github: https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/blob/master/notebook/Flight%20Predict%20PyCon%202016.ipynb Create a notebook on Bluemix: 1. Go to your Bluemix dashboard and open your Spark service. 2. Click the Notebooks button. 3. Click the New Notebook button. 4. Click the From URL tab. 5. Name it whatever you want and enter the following in the Notebook URL field: https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/notebook/Flight%20Predict%20PyCon%202016.ipynb INSTALL PYTHON PACKAGE AND ADD SERVICE CREDENTIALS Here, we install the Python Library I created, which lets you write code inline within notebook cells and encapsulate helper APIs within the Python package. This package helps keep our notebook short and performs most of the hard work. ( See this library on GitHub .) 1. Run the first cell of the notebook, which contains the following command: sc.addPyFile(""https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/flightPredict/training.py"") sc.addPyFile(""https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/flightPredict/run.py"") import training #module contains apis to train the models import run #module contains apis to run the models Tip: An alternative method to install the package (not recommended for use in this tutorial) is to use pip: !pip install --user --exists-action=w --egg git+https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats.git#egg=flightPredict Compare these 2 ways of using helper Python packages – SparkContext.addPyFile . Easy addition of python module file, supports multiple module files via zip format, and recommended during development where frequent code changes occur. – egg distribution package: pip install from PyPi server or file server (like GitHub) . Persistent install across sessions, and recommended in production. ADD CREDENTIALS Before your new notebook can work with flight and weather data, it needs access. To grant it, add your Cloudant and Weather service credentials to the notebook. Using my sample credentials? Skip ahead to Step 4 and confirm that you see the following values: cloudantHost: dtaieb.cloudant.com cloudantUserName: weenesserliffircedinvers cloudantPassword: 72a5c4f939a9e2578698029d2bb041d775d088b5 weatherUrl: https://4b88408f-11e5-4ddc-91a6-fbd442e84879:p6hxeJsfIb@twcservice.mybluemix.net 1. In Bluemix, open your app’s dashboard. 2. In the menu on the left, click Environment Variables . 3. Copy credentials for Cloudant and Weather Insights. 4. Return to your notebook, and in the second cell, paste in your credentials, replacing the ones there. (If you’re just following along in the notebook, leave existing credentials in place.) 5. Run that cell to import python modules the notebook uses and to connect to services. TRAIN THE MACHINE LEARNING MODELS 1. Load training set in Spark SQL DataFrame. Within the next cell, make sure the training dbName is your dbname from Cloudant. (To find it, go to your Simple Data Pipe app dashboard, click the Cloudant tile, then click Launch . The Cloudant dashboard shows your dbname.) Then run the following code: dbName=""pycon_flightpredict_training_set"" %time cloudantdata = training.loadDataSet(dbName,""training"") %time cloudantdata.printSchema() %time cloudantdata.count() 2. Visualize classes in scatter plot.Run the next 3 cells to plot delays based on factors like temperature, pressure, and wind speed. These plots are good first step to check distribution and possibly identify patterns. 3. Load the training data as an RDD of LabeledPoint.Run the following code to Spark SQL connector to load data into a DataFrame. trainingData = training.loadLabeledDataRDD(""training"") trainingData.take(5) 4. Train multiple classification models. Here we apply several machine-learning classification algorithms. To ensure accuracy of our predictions, we test the following different methods, and use cross-validation to choose the best one. Run the next few cells to train: * Logistic Regression Mode * NaiveBayes Model * Decision Tree Model * Random Forest Model TEST THE MODELS 1. Load test dataMake sure your dbname is the test database name from Cloudant (check your Cloudant dashboard as you did in the preceding section). Then run the following code: dbTestName=""pycon_flightpredict_test_set"" testCloudantdata = training.loadDataSet(dbTestName,""test"") testCloudantdata.count() 2. Run Accuracy metricsRun the next cell to compare the performance of the models. 3. Run the next few cells to get confusion matrixes for each model. While the metrics table we just created can tell us which model performs well overall, the confusion matrixes let us see the performance of individual classes (like Delayed less than 2 hrs ) and help us decide if we need more training data or if we need to change classes or other variables. 4. Plot the distribution of your data with Histograms Run the code in cell 15 to refine classifications and see a bar chart. Each bar is a bin (group of data points). You can specify different numbers of bins to examine data distribution and identify outliers. This info, combined with the confusion matrix results, helps you quickly uncover issues with your data. Then you can fix them and create a better predictive model. If you see an extremely long tail here (lots of bins that yield few results), you may have a data distribution issue, which you could solve by tweaking your classes. For example, this graph prompted me to change Delayed more than 4 hours and Delayed less than 2 hours to shorter increments of: Delayed less than 13 minutes , Delayed between 13-41 minutes , and Delayed more than 41 minutes . Doing so improved accuracy and helped us include the most meaningful results in our model. 5. Customize the training handler. Run the cell beneath the bar chart to provide new classification and add day of departure as a new feature. This code also re-builds the models, re-computes accuracy metrics. RUN THE MODELS Now our predictive model is in place! Our app is working with enough accuracy to let flyers enter flight details and see the likelihood of a delay. Run the final cell. If you want, replace the flight details (in red) with info on an upcoming flight of yours and run it again to see if you’ll make it on time. CONCLUSION Predictive modeling is an art form and an intensely iterative process. It requires substantial data sets and a fast, flexible way to test and tweak approaches. Simple Data Pipe let us load the pertinent data into Cloudant. From there, we used IBM Analytics for Apache Spark to create a notebook for analysis and modeling. You saw how flexible a Python notebook can be. Using it in combination with APIs in my Python package let us leverage Spark MLLIB to train predictive models and cross-validate fast and effectively. Feel free to play with this code and extend it. For example, a great improvement for deploying this app in production, would be to create a custom card for Google Now that automatically notifies a mobile user of impending flight delays and then proposes alternative flight routes using Freebird. SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Click to share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Build a Machine Learning model with Apache Spark MLLib to predict flight delays based on weather data and past performance.,"Predict Flight Delays with Apache Spark MLLib, FlightStats, and Weather Data",Live,55 143,"INTRODUCING THE SIMPLE AUTOCOMPLETE SERVICE Glynn Bird / May 31, 2016We have all seen auto-complete on web forms. The field label is Town . We start typing “M” then “a” and before we know it, a pull-down list has appeared suggesting some words that begin with the letters we’ve typed: The more characters we type, the smaller the list gets and we can click on the correct town name at any time. Websites build such tools in one of two ways: 1. The entire data set is transferred to the web page and autocomplete happens within the browser 2. No data is transferred to the browser; each keypress triggers a search for matching items on a server-side API The first solution is best for small data sets, but when the list of possible values is larger (say hundreds, thousands, or even millions of options), then the client-server approach is much more efficient. It is this second scenario that the Simple Autocomplete Service is built to cover. WHAT IS THE SIMPLE AUTOCOMPLETE SERVICE? The Simple Autocomplete Service is a Node.js web app built with the Express framework that lets you upload multiple data sets to a cloud service which then operates a fast and efficient autocomplete API. Later in this article, we’dr is: it uses a Redis in-memory database to store and index the data. Here are some example API calls from a deployed Simple Autocomplete Service instance that has been locked down so that it is now read-only: * https://simple-autocomplete-service.mybluemix.net/api/countries?term=bo * https://simple-autocomplete-service.mybluemix.net/api/presidents?term=W Notice how the urls show the two individual data sets that have been uploaded ( countries and presidents ). The search string that you want to lookup is supplied as a term parameter. The API can then be plumbed into a webpage to provide auto-complete on a form control. You can run the application locally in conjunction with a local Redis instance or deploy to the IBM Bluemix platform-as-a-service with a connected Redis by Compose service. INSTALLATION Click this button to deploy the app to Bluemix, IBM’s cloud development platform. If you don’t yet have a Bluemix account, you’ll be prompted you to sign up. (You’ll also find this button in the Simple Autocomplete Service source code repository .) Upon deployment, you’ll get an error saying that deployment failed. No worries! It didn’t really. It just requires Redis. Click the APP DASHBOARD button and click your new Simple Autocomplete Service to open it. To add Redis: 1. In a new browser tab, head over to https://www.compose.io/ and sign up for an account there. 2. Hit Create Deployment then choose Redis and wait for a cluster to be created for you. 3. On the Getting Started page that appears, click reveal your password and leave this page open. You’ll come back for these Redis credentials in a moment. 4. Head back to Bluemix and where you have your Simple Autocomplete Service open. 5. Click ADD A SERVICE OR API and choose Redis by Compose . 6. Enter your credentials as follows: * For Username enter only x * In Password enter your Redis service password. * For Public hostname/Port enter the string that appears in the TCP Connection String box after the @ character, replacing the : character with / as illustrated: When you enter these credentials, your completed form looks something like this: 7. Click Create . 8. When prompted, click Restage . When the app is done staging, click its URL to launch and see the service in action. UPLOADING DATA Find or create a file with your own data. It should be a plain text file with one text string per line, like: William Mary John From the menu on the left, click Create an index , enter an Index name , and click the Upload button. Scroll up to Current Indexes and in a few seconds you see your new index in the list. Try a few auto-completes by typing letters in the Test box. You can add as many indexes as you need (or until you run out of Redis memory). The Simple Autocomplete Service is really an API service. You can try the API call directly in a new browser window, just visit the URL of this form: https://MYAPP.mybluemix.net/api/MYINDEX?term=a replacing MYAPP with your application domain and MYINDEX with the name you chose when you created the index. LOCKING DOWN THE SERVICE When you’re happy with your data, you can lock down the Simple Autocomplete Service so that it becomes a read-only API. Simply add a custom environment variable to your Bluemix app called “LOCKDOWN” with a value of “true”. Your application will restart and only the autocomplete API will function. INTEGRATING WITH YOUR OWN FORMS The Simple Autocomplete Service is CORS-enabled, so it should be simple to plumb it into your own forms. If you have an HTML page with jQuery and jQueryUI in, you can create an auto-complete form with a few lines of code:
THE ANATOMY OF THE SIMPLE AUTOCOMPLETE SERVICE FIRST PRINCIPLES Redis is chosen as the database for this task because it stores its data in memory (it is flushed to disk periodically). In-memory databases are extremely fast and the auto-complete use-case requires high performance because the use of the web form will expect a speedy reponse to the keypresses they make. The heart of our autocomplete service is the data that is uploaded. Any text file containing one line per value should be fine e.g. . . . Mabel Mabelle Mable Mada Madalena Madalyn Maddalena Maddi Maddie . . . One solution to find matches from this data is to store the values in a list and scan every member for matches when performing an autocomplete request. This solution is fine for small data sets but as it involves scanning the whole collection from top to bottom to establish a list of matches it becomes increasingly inefficient as the data size increases. It is said to have a O(N) complexity, because the effort required to perform the search increases linearly with the size of the data set (N). In a blog post from 2010 the creator of Redis, Salvatore Sanfilippo, discusses a more efficient solution which involves pre-calculating the possible search strings and placing them into an “sorted set” data structure in Redis. Sorted sets are usually used for ordering keys by value (e.g. a high-score table), but in this case it keeps our candidate search strings in alphabetical order. The solution outlined in the blog post is used in a slightly modified form in the Simple Autocomplete Service , with our sorted set containing keys made up of combinations of possible letter combinations: . . ""m"" ""ma"" ""mab"" ""mabe"" ""mabel*Mabel"" ""mabell"" ""mabelle*Mabelle"" ""mabl"" ""mable*Mable"" . . Some features of the data to notice: * this index occupies more space than a simple list of the complete values * the keys are stored in alphabetical order * the keys are lowercased and filtered for punctuation before saving for a predictable, case-sensitive match * at the end of each sequence of keys we store the unaltered original key using the notation mabelle*Mabelle , with the original unfiltered string placed after the asterisk. This allows the service to access the original string in its original case. * the keys are not repeated – there is only one key for “ma” despite several names starting with “ma” to save space in the index * the method of storage is most efficient on large data sets with lots of repetition at the starts of words IMPORTING THE DATA The Simple Autocomplete Service adds strings to the Redis database using the ZADD command to create a sorted set: ZADD myindex 0 ""m"" ZADD myindex 0 ""ma"" ZADD myindex 0 ""mab"" ZADD myindex 0 ""mabe"" ZADD myindex 0 ""mabel*Mabel"" The zero in the syntax above is the score of the sorted set. We set all the strings to have the same score so that only alphabetical ordering takes place. QUERYING THE DATA When we wish to find the auto-complete solutions for the string ma , we need to find our way to ma in our Redis index and then retrieve a number of keys that occur after that point in the index. In Redis, we use two queries to do this 1. ZRANK to find the place in the index that matches our search string 2. ZRANGE 75 to find the 75 lines that occur in the index from that point on. 75 is number hard-coded into the service to return a reasonable number of solutions to the query. e.g. ZRANK myindex ma (integer) 7429 ZRANGE myindex 7429 7504 1) ""ma"" 2) ""mab"" 3) ""mab*Mab"" 4) ""mabe"" 5) ""mabel"" 6) ""mabel*Mabel"" 7) ""mabell"" 8) ""mabelle*Mabelle"" 9) ""mabl"" 10) ""mable*Mable"" 11) ""mad"" 12) ""mada"" 13) ""mada*Mada"" 14) ""madal"" 15) ""madale"" 16) ""madalen"" 17) ""madalena*Madalena"" 18) ""madaly"" 19) ""madalyn*Madalyn"" 20) ""madd"" . . . The service only keeps the keys with an asterisk in the middle (the complete answers) and then returns those values to the user: [""Mab"",""Mabel"",""Mabelle"",""Mable""] As the index is stored in order by Redis, the ZRANK function is an O(log n) operation, meaning that its complexity only increases in proportion to the logarithm of the data size (N). The ZRANGE query is similarly efficient so the amount of work required to perform a search using the ZRANK/ZRANGE technique remains almost constant whatever the data size. How many strings would we need have in our text file before the ZRANK/ZRANGE solution out-performs scanning a linear list? The answer is less than 100. It’ the indexed solution wins in all but the very simplest cases. HOMEWORK As it stands, the Simple Autocomplete Service only matches strings that begin with search phrase. What if I wanted to match on the second word of a phrase? Imagine I indexed actors names: Molly Ringwald Judd Nelson Paul Gleason Anthony Michael Hall Ally Sheedy Emilio Estevez I want autocomplete to work when I type “A..L” as well as if I type “S..H”. That would involve indexing additional data a al all ally ally s ally sh ally she ally shee ally sheed ally sheedy*Ally Sheedy s sh shee sheed sheedy*Ally Sheedy The index would be bigger in this case, but it should work. If anyone would like to modify the source code repository and send me a pull request, I’d be happy to incorporate this as an option. SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Easily add autocomplete to your web form fields. Simply upload your data set using this cloud service then use its fast and efficient autocomplete API.,Introducing the Simple Autocomplete Service,Live,56 147,"WILL WOLF DATA SCIENCE THINGS AND THOUGHTS ON THE WORLD * About * Archive * RSS * EN * ES TRANSFER LEARNING FOR FLIGHT DELAY PREDICTION VIA VARIATIONAL AUTOENCODERS WILL WOLF May 8, 2017In this work, we explore improving a vanilla regression model with knowledge learned elsewhere. As a motivating example, consider the task of predicting the number of checkins a given user will make at a given location. Our training data consist of checkins from 4 users across 4 locations in the week of May 1st, 2017 and looks as follows: user_id location checkins 1 a 3 1 b 6 2 c 7 2 d 2 3 a 1 3 c 4 4 b 9 4 d 4We'd like to predict how many checkins user 3 will make at location b in the coming week. How well will our model do? While each user_id might represent some unique behavior - e.g. user 3 sleeps late yet likes going out for dinner - and each location might represent its basic characteristics - e.g. location b is an open-late sushi bar - this is currently unbeknownst to our model. To this end, gathering this metadata and joining it to our training set is a clear option. If quality, thorough, explicit metadata are available, affordable and practical to acquire, this is likely the path to pursue. If not, we'll need to explore a more creative approach. How far can we get with implicit metadata learned from an external task? TRANSFER LEARNING ¶ Transfer learning allows us to use knowledge acquired in one task to improve performance in another. Suppose, for example, that we've been tasked with translating Portuguese to English and are given a basic phrasebook from which to learn. After a week, we take a lengthy test. A friend of ours - a fluent Spanish speaker who knows nothing of Portuguese - is tasked the same. Who gets a better score? PREDICTING FLIGHT DELAYS ¶ The goal of this work is to predict flight delays - a basic regression task. The data comprise 6,872,294 flights from 2008 via the United States Department of Transportation's Bureau of Transportation Statistics . I downloaded them from stat-computing.org . Each row consists of, among other things: DayOfWeek , DayofMonth , Month , ScheduledDepTimestamp (munged from CRSDepTime ), Origin , Dest and UniqueCarrier (airline), and well as CarrierDelay , WeatherDelay , NASDelay , SecurityDelay , LateAircraftDelay - all in minutes - which we will sum to create total_delay . We'll consider a random sample of 50,000 flights to make things easier. (For a more in-depth exploration of these data, please see this project's repository .) ROUTES, AIRPORTS ¶ While we can expect DayOfWeek , DayofMonth and Month to give some seasonal delay trends - delays are likely higher on Sundays or Christmas, for example - the Origin and Dest columns might suffer from the same pathology as user_id and location above: a rich behavioral indicator represented in a crude, ""isolated"" way. (A token in a bag-of-words model, as opposed to its respective word2vec representation, gives a clear analogy.) How can we infuse this behavioral knowledge into our original task? AN AUXILIARY TASK ¶ In 2015, I read a particularly-memorable blog post entitled Towards Anything2Vec by Allen Tran. Therein, Allen states: Like pretty much everyone, I'm obsessed with word embeddings word2vec or GloVe. Although most of machine learning in general is based on turning things into vectors, it got me thinking that we should probably be learning more fundamental representations for objects, rather than hand tuning features. Here is my attempt at turning random things into vectors, starting with graphs. In this post, Allen seeks to embed nodes - U.S. patents, incidentally - in a directed graph into vector space by predicting the inverse of the path-length to nodes nearby. To me, this (thus-far) epitomizes the ""data describe the individual better than they describe themself:"" while we could ask the nodes to self-classify into patents on ""computing,"" ""pharma,"" ""materials,"" etc., the connections between these nodes - formal citations, incidentally - will capture their ""true"" subject matters (and similarities therein) better than the authors ever could. Formal language, necessarily, generalizes. OpenFlights contains data for over ""10,000 airports, train stations and ferry terminals spanning the globe"" and the routes between. My goal is to train a neural network that, given an origin airport and its latitude and longitude, predicts the destination airport, latitude and longitude. This network will thereby ""encode"" each airport into a vector of arbitrary size containing rich information about, presumably, the diversity and geography of the destinations it services: its ""place"" in the global air network. Surely, a global hub like Heathrow - a fact presumably known to our neural network, yet unknown to our initial dataset with one-hot airport indices - has longer delays on Christmas than than a two-plane airstrip in Alaska. Crucially, we note that while our original (down-sampled) dataset contains delays amongst 298 unique airports, our auxiliary routes dataset comprises flights amongst 3186 unique airports. Notwithstanding, information about all airports in the latter is distilled into vector representations then injected into the former; even though we might not know about delays to/from Casablanca Mohammed V Airport (CMN), latent information about this airport will still be intrinsically considered when predicting delays between other airports to/from which CMN flies. DATA PREPARATION ¶ Our flight-delay design matrix $X$ will include the following columns: DayOfWeek , DayofMonth , Month , ScheduledDepTimestamp , Origin , Dest and UniqueCarrier . All columns will be one-hotted for simplicity. (Alternatively, I explored mapping each column to its respective value_counts() , i.e. X.loc[:, col] = X[col].map(col_val_counts) , which led to less agreeable convergence.) Let's get started. In [1]:fromabcimportABCMeta,abstractmethodfromIPython.displayimportIFrame,SVGimportosimportsysroot_dir=os.path.join(os.getcwd(),'..')sys.path.append(root_dir)importfeatherfromgmplotimportgmplotimportmatplotlib.pyplotaspltimportnumpyasnpimportpandasaspdimportseabornassnsfromsklearn.metricsimportmean_squared_errorasmean_squared_error_scikitfromsklearn.model_selectionimporttrain_test_splitfromsklearn.preprocessingimportMinMaxScaler,StandardScaler%matplotlib inline sns.set(style='darkgrid') In [2]:importkeras.backendasKfromkeras.layersimportBatchNormalization,Dense,Dropout,Embedding,Flatten,Input,LayerasKerasLayerfromkeras.layers.mergeimportconcatenate,dotfromkeras.lossesimportmean_squared_errorfromkeras.modelsimportModelfromkeras.optimizersimportAdamfromkeras.regularizersimportl2fromkeras.utils.vis_utilsimportmodel_to_dotfromkeras_tqdmimportTQDMNotebookCallback In [3]:FLIGHTS_PATH='../data/flights-2008-sample.feather'# build X, yflights=feather.read_dataframe(FLIGHTS_PATH)X=flights[['DayOfWeek','DayofMonth','Month','ScheduledDepTimestamp','Origin','Dest','UniqueCarrier']].copy()y=flights['total_delay'].copy()# one-hotone_hot_matrices=[]forcolinfilter(lambdacol:col!='ScheduledDepTimestamp',X.columns):one_hot_matrices.append(pd.get_dummies(X[col]))one_hot_matrix=np.concatenate(one_hot_matrices,axis=1)X=np.hstack([X['ScheduledDepTimestamp'].values.reshape(-1,1),one_hot_matrix])# normalizeX=StandardScaler().fit_transform(X)y=np.log(y+1).values In [4]:TEST_SIZE=int(X.shape[0]*.4)X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=TEST_SIZE,random_state=42)X_val,X_test,y_val,y_test=train_test_split(X_test,y_test,test_size=int(TEST_SIZE/2),random_state=42)print('Dataset sizes:')print(' Train: {}'.format(X_train.shape))print(' Validation: {}'.format(X_val.shape))print(' Test: {}'.format(X_test.shape)) Dataset sizes: Train: (30000, 657) Validation: (10000, 657) Test: (10000, 657) FLIGHT-DELAY MODELS ¶ Let's build two baseline models with the data we have. Both models have a single ReLU output and are trained to minimize the mean squared error of the predicted delay via stochastic gradient descent. ReLU was chosen as an output activation because delays are both bounded below at 0 and bi-modal. I considered three separate strategies for predicting this distribution. 1. Train a network with two outputs: total_delay and total_delay == 0 (Boolean). Optimize this network with a composite loss function: mean squared error and binary cross-entropy, respectively. 2. Train a ""poor-man's"" hierarchical model: a logistic regression to predict total_delay == 0 and a standard regression to predict total_delay . Then, compute the final prediction as a thresholded ternary, e.g. y_pred = np.where(y_pred_lr > threshhold, 0, y_pred_reg) . Train the regression model with both all observations, and just those where total_delay > 0 , and see which works best. 3. Train a single network with a ReLU activation. This gives a reasonably elegant way to clip our outputs below at 0, and mean-squared-error still tries to place our observations into the correct mode (of the bimodal output distribution; this said, mean-squared-error may try to ""play it safe"" and predict between the modes). I chose Option #3 because it performed best in brief experimentation and was the simplest to both fit and explain. In [5]:classBaseEmbeddingModel(metaclass=ABCMeta):defcompile(self,optimizer,loss,*args,**kwargs):self.model.compile(optimizer,loss)defsummary(self):returnself.model.summary()deffit(self,*args,**kwargs):returnself.model.fit(*args,**kwargs)defpredict(self,X):returnself.model.predict(X)@abstractmethoddef_build_model(self):passclassSimpleRegression(BaseEmbeddingModel):def__init__(self,input_dim:int,λ:float):'''Initializes the model parameters. Args: input_dim : The number of columns in our design matrix. λ : The regularization strength to apply to the model's dense layers. '''self.input_dim=input_dimself.λ=λself.model=self._build_model()def_build_model(self):input=Input((self.input_dim,),dtype='float32')dense=Dense(144,activation='relu',kernel_regularizer=l2(self.λ))(input)output=Dense(1,activation='relu',name='regression_output',kernel_regularizer=l2(self.λ))(dense)returnModel(input,output)classDeeperRegression(BaseEmbeddingModel):def__init__(self,input_dim:int,λ:float,dropout_p:float):'''Initializes the model parameters. Args: input_dim : The number of columns in our design matrix. λ : The regularization strength to apply to the model's dense layers. dropout_p : The percentage of units to drop in the model's dropout layer. '''self.input_dim=input_dimself.λ=λself.dropout_p=dropout_pself.model=self._build_model()def_build_model(self):input=Input((self.input_dim,),dtype='float32',name='input')dense=Dense(144,activation='relu',kernel_regularizer=l2(self.λ))(input)dense=Dense(144,activation='relu',kernel_regularizer=l2(self.λ))(dense)dense=Dropout(self.dropout_p)(dense)dense=Dense(72,activation='relu',kernel_regularizer=l2(self.λ))(dense)dense=Dense(16,activation='relu',kernel_regularizer=l2(self.λ))(dense)output=Dense(1,activation='relu',name='regression_output')(dense)returnModel(input,output) In [6]:deffit_flight_model(model,X_train,y_train,X_val,y_val,epochs,batch_size=256):returnmodel.fit(x=X_train,y=y_train,batch_size=batch_size,epochs=epochs,validation_data=(X_val,y_val),verbose=0,callbacks=[TQDMNotebookCallback(leave_outer=False)])defprepare_history_for_plot(history):'''Arrange the model's `history` into a ""long"" DataFrame to enable more convenient plotting. Args: history (keras.callbacks.History) : a Keras `history` object. '''results=pd.DataFrame({'train':history.history['loss'],'val':history.history['val_loss'],})results_long=pd.melt(results)results_long.columns=['dataset','loss']results_long['epoch']=2*history.epochresults_long['subject']=1returnresults_longdefplot_model_fit(history):'''Plot the training loss vs. the validation loss. Args: history (keras.callbacks.History) : a Keras `history` object. '''results=prepare_history_for_plot(history)plt.figure(figsize=(11,7))sns.tsplot(data=results,time='epoch',value='loss',condition='dataset',unit='subject')plt.title('Training Loss by Epoch',fontsize=13) SIMPLE REGRESSION ¶ In [7]:LEARNING_RATE=.0001simple_reg=SimpleRegression(input_dim=X.shape[1],λ=.05)simple_reg.compile(optimizer=Adam(lr=LEARNING_RATE),loss='mean_squared_error')simple_reg_fit=fit_flight_model(simple_reg,X_train,y_train,X_val,y_val,epochs=5,batch_size=16)plot_model_fit(simple_reg_fit) DEEPER REGRESSION ¶ In [8]:deeper_reg=DeeperRegression(input_dim=X.shape[1],λ=.03,dropout_p=.2)deeper_reg.compile(optimizer=Adam(lr=.0001),loss='mean_squared_error')deeper_reg_fit=fit_flight_model(deeper_reg,X_train,y_train,X_val,y_val,epochs=5,batch_size=16)plot_model_fit(deeper_reg_fit) TEST SET PREDICTIONS ¶ In [9]:y_pred_simple=simple_reg.model.predict(X_test).ravel()y_pred_deeper=deeper_reg.model.predict(X_test).ravel()mse_simple=mean_squared_error_scikit(y_test,y_pred_simple)mse_deeper=mean_squared_error_scikit(y_test,y_pred_deeper)print('Mean squared error, simple regression: {}'.format(mse_simple))print('Mean squared error, deeper regression: {}'.format(mse_deeper)) Mean squared error, simple regression: 2.331459019628268 Mean squared error, deeper regression: 2.3186310632259204 LEARNING AIRPORT EMBEDDINGS ¶ We propose two networks through which to learn airport embeddings: a dot product siamese network, and a variational autoencoder . DOT PRODUCT SIAMESE NETWORK ¶ This network takes as input origin and destination IDs, latitudes and longitudes. It gives as output a binary value indicating whether or not a flight-route between these airports exists. The airports DataFrame gives the geographic metadata. The routes DataFrame gives positive training examples for our network. To build negative samples, we employ, delightfully, ""negative sampling."" NEGATIVE SAMPLING ¶ routes gives exlusively (origin, dest, exists = 1) triplets. To create triplets where exists = 0 , we simply build them ourself: (origin, fake_dest, exists = 0) . It's that simple. Inspired by word2vec's approach to an almost identical problem, I pick fake_dest 's based on the frequency with which they occur in the dataset - more frequent samples being more likely to be selected - via: $$P(a_i) = \frac{ {f(a_i)}^{3/4} }{\sum_{j=0}^{n}\left( {f(a_j)}^{3/4} \right) }$$where $a_i$ is an airport. To choose a fake_dest for a given origin , we first remove all of the real dest 's, re-normalize $P(a)$, then take a multinomial draw. For a more complete yet equally approachable explanation, please see Goldberg and Levy . For an extremely thorough review of related methods, see Sebastian Ruder's On word embeddings - Part 2: Approximating the Softmax . VARIATIONAL AUTOENCODER ¶ DISCRIMINATIVE MODELS ¶ The previous network is a discriminative model: given two inputs origin and dest , it outputs the conditional probability that exists = 1 . While discriminative models are effective in distinguishing between output classes, they don't offer an idea of what data look like within each class itself. To see why, let's restate Bayes rule for a given input $x$: $$P(Y\vert x) = \frac{P(x\vert Y)P(Y)}{P(x)} = \frac{P(x, Y)}{P(x)}$$Discriminative classifiers jump directly to estimating $P(Y\vert x)$ without modeling its component parts $P(x, Y)$ and $P(x)$. Instead, as the intermediate step, they simply compute an unnormalized joint distribution $\tilde{P}(x, Y)$ and a normalizing ""partition function."" The following then gives the model's predictions for the same reason that $\frac{.2}{1} = \frac{3}{15}$: $$P(Y\vert x) = \frac{P(x, Y)}{P(x)} = \frac{\tilde{P}(x, Y)}{\text{partition function}}$$This is explained much more thoroughly in a previous blog post: Deriving the Softmax from First Principles . GENERATIVE MODELS ¶ Conversely, a variational autoencoder is a generative model: instead of jumping directly to the conditional probability of all possible outputs given a specific input, they first compute the true component parts: the joint probability distribution over data and inputs alike, $P(X, Y)$, and the distribution over our data, $P(X)$. The joint probability can be rewritten as $P(X, Y) = P(Y)P(X\vert Y)$: as such, generative models tell us the distribution over classes in our dataset, as well as the distribution of inputs within each class. Suppose we are trying to predict t-shirt colors with a 3-feature input; generative models would tell us: ""30% of your t-shirts are green - typically produced by inputs near x = [1, 2, 3] ; 40% are red - typically produced by inputs near x = [10, 20, 30] ; 30% are blue - typically produced by inputs near x = [100, 200, 300] . This is in contrast to a discriminative model which would simply compute: given an input $x$, your output probabilities are: $\{\text{red}: .2, \text{green}: .3, \text{blue}: .5\}$. To generate new data with a generative model, we draw from $P(Y)$, then $P(X\vert Y)$. To make predictions, we solicit $P(Y), P(x\vert Y)$ and $P(x)$ and employ Bayes rule outright. MANIFOLD ASSUMPTION ¶ The goal of both autoencoders is to discover underlying ""structure"" in our data: while each airport can be one-hot encoded into a 3186-dimensional vector, we wish to learn a, or even the, reduced space in which our data both live and vary. This concept is well understood through the ""manifold assumption,"" explained succinctly in this CrossValidated thread : Imagine that you have a bunch of seeds fastened on a glass plate, which is resting horizontally on a table. Because of the way we typically think about space, it would be safe to say that these seeds live in a two-dimensional space, more or less, because each seed can be identified by the two numbers that give that seed's coordinates on the surface of the glass. Now imagine that you take the plate and tilt it diagonally upwards, so that the surface of the glass is no longer horizontal with respect to the ground. Now, if you wanted to locate one of the seeds, you have a couple of options. If you decide to ignore the glass, then each seed would appear to be floating in the three-dimensional space above the table, and so you'd need to describe each seed's location using three numbers, one for each spatial direction. But just by tilting the glass, you haven't changed the fact that the seeds still live on a two-dimensional surface. So you could describe how the surface of the glass lies in three-dimensional space, and then you could describe the locations of the seeds on the glass using your original two dimensions. In this thought experiment, the glass surface is akin to a low-dimensional manifold that exists in a higher-dimensional space : no matter how you rotate the plate in three dimensions, the seeds still live along the surface of a two-dimensional plane. In other words, the full spectrum of that which characterizes an airport can be described by just a few numbers. Varying one of these numbers - making it larger or smaller - would result in an airport of slightly different ""character;"" if one dimension were to represent ""global travel hub""-ness, a value of $-1000$ along this dimension might give us that hangar in Alaska. In the context of autoencoders (and dimensionality reduction algorithms), ""learning 'structure' in our data"" means nothing more than finding that ceramic plate amidst a galaxy of stars . GRAPHICAL MODELS ¶ Variational autoencoders do not have the same notion of an ""output"" - namely, ""does a route between two airports exist?"" - as our dot product siamese network. To detail this model, we'll start near first principles with probabilistic graphical models with our notion of the ceramic plate in mind: Coordinates on the plate detail airport character; choosing coordinates - say, [global_hub_ness = 500, is_in_asia = 500] - allows us to generate an airport. In this case, it might be Seoul. In variational autoencoders, ceramic-plate coordinates are called the ""latent vector,"" denoted $z$. The joint probability of our graphical model is given as: $$P(z)P(x\vert z) = P(z, x)$$Our goal is to infer the priors that likely generated these data via Bayes rule: $$P(z\vert x) = \frac{P(z)P(x\vert z)}{P(x)}$$The denominator is called the evidence ; we obtain it by marginalizing the joint distribution over the latent variables: $$P(x) = \int P(x\vert z)P(z)dz$$Unfortunately, this asks us to consider all possible configurations of the latent vector $z$. Should $z$ exist on the vertices of a cube in $\mathbb{R}^3$, this would not be very difficult; should $z$ be a continuous-valued vector in $\mathbb{R}^{10}$, this becomes a whole lot harder. Computing $P(x)$ is problematic. VARIATIONAL INFERENCE ¶ In fact, we could attempt to use MCMC to compute $P(z\vert x)$; however, this is slow to converge. Instead, let's compute an approximation to this distribution then try to make it closely resemble the (intractable) original. In this vein, we introduce variational inference , which ""allows us to re-write statistical inference problems (i.e. infer the value of a random variable given the value of another random variable) as optimization problems (i.e. find the parameter values that minimize some objective function)."" 1 Let's choose our approximating distribution as simple, parametric and one we know well: the Normal (Gaussian) distribution. Were we able to compute $P(z\vert x) = \frac{P(x, z)}{P(x)}$, it is instrinsic that $z$ is contingent on $x$; when building our own distribution to approximate $P(z\vert x)$, we need to be explicit about this contingency: different values for $x$ should be assumed to have been generated by different values of $z$. Let's write our approximation as follows, where $\lambda$ parameterizes the Gaussian for a given $x$: $$q_{\lambda}(z\vert x)$$Finally, as stated previously, we want to make this approximation closely resemble the original; the KL divergence quantifies their difference: $$KL(q_{\lambda}(z\vert x)\Vert P(z\vert x)) = \int{q_{\lambda}(z\vert x)\log\frac{q_{\lambda}(z\vert x)}{P(z\vert x)}dz}$$Our goal is to obtain the argmin with respect to $\lambda$: $$q_{\lambda}^{*}(z\vert x) = \underset{\lambda}{\arg\min}\ \text{KL}(q_{\lambda}(z\vert x)\Vert P(z\vert x))$$Expanding the divergence, we obtain: $$ \begin{align*} KL(q_{\lambda}(z\vert x)\Vert P(z\vert x)) &= \int{q_{\lambda}(z\vert x)\log\frac{q_{\lambda}(z\vert x)}{P(z\vert x)}dz}\\ &= \int{q_{\lambda}(z\vert x)\log\frac{q_{\lambda}(z\vert x)P(x)}{P(z, x)}dz}\\ &= \int{q_{\lambda}(z\vert x)\bigg(\log{q_{\lambda}(z\vert x) -\log{P(z, x)} + \log{P(x)}}\bigg)dz}\\ &= \int{q_{\lambda}(z\vert x)\bigg(\log{q_{\lambda}(z\vert x)} -\log{P(z, x)}}\bigg)dz + \log{P(x)}\int{q_{\lambda}(z\vert x)dz}\\ &= \int{q_{\lambda}(z\vert x)\bigg(\log{q_{\lambda}(z\vert x)} -\log{P(z, x)}}\bigg)dz + \log{P(x)} \cdot 1 \end{align*} $$As such, since only the left term depends on $\lambda$, minimizing the entire expression with respect to $\lambda$ amounts to minimizing this term. Incidentally, the opposite (negative) of this term is called the ELBO , or the ""evidence lower bound."" To see why, let's plug the ELBO into the equation above and solve for $\log{P(x)}$: $$\log{P(x)} = ELBO(\lambda) + KL(q_{\lambda}(z\vert x)\Vert P(z\vert x))$$In English: ""the log of the evidence is at least the lower bound of the evidence plus the divergence between our true posterior $P(z\vert x)$ and our (variational) approximation to this posterior $q_{\lambda}(z\vert x)$."" Since the left term above is the opposite of the ELBO, minimizing this term is equivalent to maximizing the ELBO. Let's restate the equation and rearrange further: $$ \begin{align*} ELBO(\lambda) &= -\int{q_{\lambda}(z\vert x)\bigg(\log{q_{\lambda}(z\vert x)} -\log{P(z, x)}}\bigg)dz\\ &= -\int{q_{\lambda}(z\vert x)\bigg(\log{q_{\lambda}(z\vert x)} -\log{P(x\vert z)} - \log{P(z)}}\bigg)dz\\ &= -\int{q_{\lambda}(z\vert x)\bigg(\log{q_{\lambda}(z\vert x)} - \log{P(z)}}\bigg)dz + \log{P(x\vert z)}\int{q_{\lambda}(z\vert x)dz}\\ &= -\int{q_{\lambda}(z\vert x)\log{\frac{q_{\lambda}(z\vert x)}{P(z)}}dz} + \log{P(x\vert z)} \cdot 1\\ &= \log{P(x\vert z)} -KL(q_{\lambda}(z\vert x)\Vert P(z)) \end{align*} $$Our goal is to maximize this expression, or minimize the opposite: $$-\log{P(x\vert z)} + KL(q_{\lambda}(z\vert x)\Vert P(z))$$In machine learning parlance: ""minimize the negative log likelihood of our data (generated via $z$) plus the divergence between the distribution (ceramic plate) of $z$ and our approximation thereof."" See what we did? FINALLY, BACK TO NEURAL NETS ¶ The variational autoencoder consists of an encoder network and a decoder network. ENCODER ¶ The encoder network takes as input $x$ (an airport) and produces as output $z$ (the latent ""code"" of that airport, i.e. its location on the ceramic plate). As an intermediate step, it produces multivariate Gaussian parameters $(\mu_{x_i}, \sigma_{x_i})$ for each airport. These parameters are then plugged into a Gaussian $q$, from which we sample a value $z$. The encoder is parameterized by a weight matrix $\theta$. DECODER ¶ The decoder network takes as input $z$ and produces $P(x\vert z)$: a reconstruction of the airport vector (hence, autoencoder). It is parameterized by a weight matrix $\phi$. LOSS FUNCTION ¶ The network's loss function is the sum of the mean squared reconstruction error of the original input $x$ and the KL divergence between the true distribution of $z$ and its approximation $q$. Given the reparameterization trick (next section) and another healthy scoop of algebra, we write this in Python code as follows: '''`z_mean` gives the mean of the Gaussian that generates `z``z_log_var` gives the log-variance of the Gaussian that generates `z``z` is generated via: z = z_mean + K.exp(z_log_var / 2) * epsilon = z_mean + K.exp( log(z_std)**2 / 2 ) * epsilon = z_mean + K.exp( (2 * log(z_std) / 2 ) * epsilon = z_mean + K.exp( log(z_std) ) * epsilon = z_mean + z_std * epsilon'''kl_loss_numerator=1+z_log_var-K.square(z_mean)-K.exp(z_log_var)kl_loss=-0.5*K.sum(kl_loss_numerator,axis=-1)defloss(x,x_decoded):returnmean_squared_error(x,x_decoded)+kl_loss REPARAMETERIZATION TRICK ¶ When back-propagating the network's loss to $\theta$ , we need to go through $z$ — a sample taken from $q_{\theta}(z\vert x)$. Trivially, this sample is a scalar; intuitively, its derivative should be non-zero. In solution, we'd like the sample to depend not on the stochasticity of the random variable, but on the random variable's parameters . To this end, we employ the ""reparametrization trick"" , such that the sample depends on these parameters deterministically . As a quick example, this trick allows us to write $\mathcal{N}(\mu, \sigma)$ as $z = \mu + \sigma \odot \epsilon$, where $\epsilon \sim \mathcal{N}(0, 1)$. Drawing samples this way allows us to propagate error backwards through our network. AUXILIARY DATA ¶ ROUTES ¶ In [10]:# import routesroutes_cols=['airline','airline_id','origin','origin_id','dest','dest_id','codeshare','stops','equipment']routes=pd.read_csv('../data/routes.csv',names=routes_cols,usecols=['origin','dest'])routes['exists']=1# how many unique routes are there? by how many airlines are they flown?unique_routes=routes.groupby(['origin','dest']).count()print('There are {} unique routes.'.format(unique_routes.shape[0]))unique_routes.sort_values(by='exists',ascending=False).head(20).T There are 37595 unique routes. Out[10]: origin ORD ATL ORD HKT HKG CAN DOH ATL AUH BKK JFK MIA LHR ATL KGL MSY MCT CNX CDG dest ATL ORD MSY BKK BKK HGH BAH MIA MCT HKG LHR ATL JFK LAX DFW EBB JFK AUH BKK JFK exists 20 19 13 13 12 12 12 12 12 12 12 12 12 11 11 11 11 11 11 11 In [11]:# compute airport frequencies for negative samplingall_airports=routes['origin'].tolist()+routes['dest'].tolist()airport_counts=pd.Series(all_airports).value_counts()airport_probs=(airport_counts**.75)/(airport_counts**.75).sum() In [12]:defcompute_unique_negative_dests(airport,routes=routes):returnroutes[routes['origin']!=airport]['dest'].unique()defdraw_negative_samples(n,neg_dest_probs):samples_mode=np.infwhilesamples_mode>=.75*n:negative_sample_idxs=np.random.multinomial(n,neg_dest_probs)samples_mode=negative_sample_idxs.max()ifn>=4else-np.infnegative_samples=[]fordest,countinzip(neg_dest_probs.index,negative_sample_idxs):ifcount>0:negative_samples+=count*[dest]returnnegative_samples In [13]:# append `routes` with negative samplesnegative_sample_dfs=[]fori,airportinenumerate(set(routes['origin'])):n_routes=len(routes[routes['origin']==airport])negative_dests=compute_unique_negative_dests(airport)negative_dest_probs=airport_probs[negative_dests]/airport_probs[negative_dests].sum()negative_samples=draw_negative_samples(n_routes,negative_dest_probs)df=pd.DataFrame({'origin':airport,'dest':negative_samples,'exists':0})negative_sample_dfs.append(df)negative_routes=pd.concat(negative_sample_dfs,axis=0)routes=pd.concat([routes,negative_routes]) AIRPORTS ¶ In [14]:# import airportsairports_cols=['Airport ID','Name','City','Country','IATA','ICAO','Latitude','Longitude','Altitude','Timezone','DST','Tz database time zone','Type','Source']airports=pd.read_csv('../data/airports.csv',names=airports_cols,usecols=['Name','IATA','Latitude','Longitude','Altitude'],index_col=['IATA'])# join origin and destination airport metadata to `routes`origin_airports=airports.copy()origin_airports.columns=['origin_name','origin_latitude','origin_longitude','origin_altitude']dest_airports=airports.copy()dest_airports.columns=['dest_name','dest_latitude','dest_longitude','dest_altitude']routes=routes\ .join(origin_airports,on='origin')\ .join(dest_airports,on='dest')\ .dropna()\ .reset_index(drop=True)# map airport names to a unique indexdelall_airportsall_airports=routes['origin'].tolist()+routes['dest'].tolist()unique_airports=set(all_airports)airport_to_id={airport:indexforindex,airportinenumerate(unique_airports)}routes['origin_id']=routes['origin'].map(airport_to_id)routes['dest_id']=routes['dest'].map(airport_to_id) In [15]:# build X_routes, y_routesgeo_cols=['origin_latitude','origin_longitude','dest_latitude','dest_longitude']X_r=routes[['origin_id','dest_id']+geo_cols].copy()y_r=routes['exists'].copy()X_r.loc[:,geo_cols]=StandardScaler().fit_transform(X_r[geo_cols])# split training, test datatest_size=X_r.shape[0]//3val_size=test_size//2X_train_r,X_test_r,y_train_r,y_test_r=train_test_split(X_r,y_r,test_size=test_size,random_state=42)X_val_r,X_test_r,y_val_r,y_test_r=train_test_split(X_test_r,y_test_r,test_size=val_size,random_state=42)print('Dataset sizes:')print(' Train: {}'.format(X_train_r.shape))print(' Validation: {}'.format(X_val_r.shape))print(' Test: {}'.format(X_test_r.shape)) Dataset sizes: Train: (87630, 6) Validation: (21907, 6) Test: (21907, 6) DOT PRODUCT EMBEDDING MODEL ¶ To start, let's train our model with a single latent dimension then visualize the results on the world map. In [16]:N_UNIQUE_AIRPORTS=len(unique_airports)classDotProductEmbeddingModel(BaseEmbeddingModel):def__init__(self,embedding_size:int,λ:float,n_unique_airports=N_UNIQUE_AIRPORTS):'''Initializes the model parameters. Args: embedding_size : The desired number of latent dimensions in our embedding space. λ : The regularization strength to apply to the model's dense layers. '''self.n_unique_airports=n_unique_airportsself.embedding_size=embedding_sizeself.λ=λself.model=self._build_model()def_build_model(self):# inputsorigin=Input(shape=(1,),name='origin')dest=Input(shape=(1,),name='dest')origin_geo=Input(shape=(2,),name='origin_geo')dest_geo=Input(shape=(2,),name='dest_geo')# embeddingsorigin_embedding=Embedding(self.n_unique_airports,output_dim=self.embedding_size,embeddings_regularizer=l2(self.λ),name='origin_embedding')(origin)dest_embedding=Embedding(self.n_unique_airports,output_dim=self.embedding_size,embeddings_regularizer=l2(self.λ))(dest)# dot productdot_product=dot([origin_embedding,dest_embedding],axes=2)dot_product=Flatten()(dot_product)dot_product=concatenate([dot_product,origin_geo,dest_geo],axis=1)# dense layerstanh=Dense(10,activation='tanh')(dot_product)tanh=BatchNormalization()(tanh)# outputexists=Dense(1,activation='sigmoid')(tanh)returnModel(inputs=[origin,dest,origin_geo,dest_geo],outputs=[exists]) In [17]:dp_model=DotProductEmbeddingModel(embedding_size=1,λ=.0001)dp_model.compile(optimizer=Adam(lr=.001),loss='binary_crossentropy')SVG(model_to_dot(dp_model.model).create(prog='dot',format='svg')) Out[17]: G 5275051792 origin: InputLayer 4974217368 origin_embedding: Embedding 5275051792->4974217368 5275052128 dest: InputLayer 5274577440 embedding_1: Embedding 5275052128->5274577440 5274577552 dot_1: Dot 4974217368->5274577552 5274577440->5274577552 5274449512 flatten_1: Flatten 5274577552->5274449512 5273038296 concatenate_1: Concatenate 5274449512->5273038296 5275052464 origin_geo: InputLayer 5275052464->5273038296 5275052744 dest_geo: InputLayer 5275052744->5273038296 5272987184 dense_6: Dense 5273038296->5272987184 5272987968 batch_normalization_1: BatchNormalization 5272987184->5272987968 5272758368 dense_7: Dense 5272987968->5272758368 In [18]:dp_model_fit=dp_model.fit(x=[X_train_r['origin_id'],X_train_r['dest_id'],X_train_r[['origin_latitude','origin_longitude']].as_matrix(),X_train_r[['dest_latitude','dest_longitude']].as_matrix(),],y=y_train_r,batch_size=256,epochs=10,validation_data=([X_val_r['origin_id'],X_val_r['dest_id'],X_val_r[['origin_latitude','origin_longitude']].as_matrix(),X_val_r[['dest_latitude','dest_longitude']].as_matrix()],y_val_r),verbose=0,callbacks=[TQDMNotebookCallback(leave_outer=False)])plot_model_fit(dp_model_fit) VISUALIZE EMBEDDINGS ¶ To visualize results, we'll: 1. Compose a list of unique origin airports. 2. Extract the learned (1-dimensional) embedding for each. 3. Scale the results to $[0, 1]$. 4. Use the scaled embedding as a percentile-index into a color gradient. Here, we've chosen the colors of the rainbow: low values are blue/purple, and high values are orange/red. In [19]:# compose DataFrame of unique originssubset_cols=['origin_id','origin','origin_latitude','origin_longitude']unique_origins=routes.drop_duplicates(subset=subset_cols).reset_index(drop=True)unique_origins=unique_origins[subset_cols]unique_origins.columns=['origin_id','origin','latitude','longitude']defget_dp_embeddings(dp_model,unique_origins=unique_origins):'''Returns the origin airport embeddings from the dot-product embedding model, *aligned with the index of `unique_origins`*. '''origin_embeddings=dp_model.model.get_layer(name='origin_embedding').get_weights()[0]returnorigin_embeddings[unique_origins['origin_id'].values]unique_origins['embedding']=get_dp_embeddings(dp_model) In [20]:MARKERS=['#9400D3','#BA55D3','#1E90FF','#9ACD32','#FFFF00','#FFA500','#FF6347']defprepare_colors(intensities:pd.Series,palette=MARKERS):'''Indexes scale-less color intensities into HTML color codes with respect to a given palette. Args: intensities : A Pandas Series containing values with which to scale the color palette. These values do not need to be scaled to a specific interval. palette : A list of HTML color codes (loosely) spanning the principal colors of the rainbow. Returns: list : HTML color codes. '''palette_matrix=np.array(palette)intensities=intensities.values.reshape(-1,1)percentiles=MinMaxScaler().fit_transform(intensities).ravel()percentiles-=1e-5get_percentile_marker=lambdaperc:palette[int(perc*len(palette))]returnpd.Series(percentiles)\ .map(get_percentile_marker)\ .tolist()WORLD_COORDS=[21.2770321,5.0159425,3]defplot_embeddings_on_world_map(unique_origins_df:pd.DataFrame,output_path:str,world_coords=WORLD_COORDS):'''Plots each unique origin airport on the world map, colored by its 1-dimensional network embedding. Darker colors indicate a larger embedding value. Args: unique_origins_df : A Pandas DataFrame containing at least the following columns: `latitude`, `longitude`, `embedding`. output_path : The path to which to write the HTML map file. world_coords : Respectively, the latitude, longitude, and 'zoom factor' appended to the Google Maps query string so as to focus the map on the entire world. Returns: None : Instead, writes an HTML map file to `output_path`. '''unique_origins_df['color']=prepare_colors(unique_origins_df['embedding'])gmap=gmplot.GoogleMapPlotter(*world_coords,markers_base_path='',api_key=os.environ['GOOGLE_MAPS_API_KEY'])gmap.scatter(lats=unique_origins_df['latitude'].tolist(),lngs=unique_origins_df['longitude'].tolist(),color=unique_origins_df['color'].tolist(),marker=True)gmap.draw(output_path) In [21]:plot_embeddings_on_world_map(unique_origins,output_path='../figures/dp_model_map.html') In [38]:# visit the URL for a full-screen view 👇DOT_PRODUCT_EMBED_VIZ_S3_PATH='https://willwolf-public.s3.amazonaws.com/transfer-learning-flight-delays/dp_model_map.html'IFrame(DOT_PRODUCT_EMBED_VIZ_S3_PATH,width=1000,height=800) Out[38]:VARIATIONAL AUTOENCODER ¶ In [23]:classVariationalLayer(KerasLayer):def__init__(self,output_dim:int,epsilon_std=1.):'''A custom ""variational"" Keras layer that completes the variational autoencoder. Args: output_dim : The desired number of latent dimensions in our embedding space. '''self.output_dim=output_dimself.epsilon_std=epsilon_stdsuper().__init__()defbuild(self,input_shape):self.z_mean_weights=self.add_weight(shape=(input_shape[1],self.output_dim),initializer='glorot_normal',trainable=True)self.z_mean_bias=self.add_weight(shape=(self.output_dim,),initializer='zero',trainable=True,)self.z_log_var_weights=self.add_weight(shape=(input_shape[1],self.output_dim),initializer='glorot_normal',trainable=True)self.z_log_var_bias=self.add_weight(shape=(self.output_dim,),initializer='zero',trainable=True)super().build(input_shape)defcall(self,x):z_mean=K.dot(x,self.z_mean_weights)+self.z_mean_biasz_log_var=K.dot(x,self.z_log_var_weights)+self.z_log_var_biasepsilon=K.random_normal(shape=K.shape(z_log_var),mean=0.,stddev=self.epsilon_std)kl_loss_numerator=1+z_log_var-K.square(z_mean)-K.exp(z_log_var)self.kl_loss=-0.5*K.sum(kl_loss_numerator,axis=-1)returnz_mean+K.exp(z_log_var/2)*epsilondefloss(self,x,x_decoded):returnmean_squared_error(x,x_decoded)+self.kl_lossdefcompute_output_shape(self,input_shape):return(input_shape[0],self.output_dim) In [24]:classVariationalAutoEncoderEmbeddingModel(BaseEmbeddingModel):def__init__(self,embedding_size:int,dense_layer_size:int,λ:float,n_unique_airports=N_UNIQUE_AIRPORTS):'''Initializes the model parameters. Args: embedding_size : The desired number of latent dimensions in our embedding space. λ : The regularization strength to apply to the model's dense layers. '''self.embedding_size=embedding_sizeself.dense_layer_size=dense_layer_sizeself.λ=λself.n_unique_airports=n_unique_airportsself.variational_layer=VariationalLayer(embedding_size)self.model=self._build_model()def_build_model(self):# encoderorigin=Input(shape=(self.n_unique_airports,),name='origin')origin_geo=Input(shape=(2,),name='origin_geo')dense=concatenate([origin,origin_geo])dense=Dense(self.dense_layer_size,activation='tanh',kernel_regularizer=l2(self.λ))(dense)dense=BatchNormalization()(dense)variational_output=self.variational_layer(dense)encoder=Model([origin,origin_geo],variational_output,name='encoder')# decoderlatent_vars=Input(shape=(self.embedding_size,))dense=Dense(self.dense_layer_size,activation='tanh',kernel_regularizer=l2(self.λ))(latent_vars)dense=Dense(self.dense_layer_size,activation='tanh',kernel_regularizer=l2(self.λ))(dense)dense=BatchNormalization()(dense)dest=Dense(self.n_unique_airports,activation='softmax',name='dest',kernel_regularizer=l2(self.λ))(dense)dest_geo=Dense(2,activation='linear',name='dest_geo')(dense)decoder=Model(latent_vars,[dest,dest_geo],name='decoder')# end-to-endencoder_decoder=Model([origin,origin_geo],decoder(encoder([origin,origin_geo])))returnencoder_decoder In [25]:vae_model=VariationalAutoEncoderEmbeddingModel(embedding_size=1,dense_layer_size=20,λ=.003)vae_model.compile(optimizer=Adam(lr=LEARNING_RATE),loss=[vae_model.variational_layer.loss,'mean_squared_logarithmic_error'],loss_weights=[1.,.2])SVG(model_to_dot(vae_model.model).create(prog='dot',format='svg')) Out[25]: G 5274278880 origin: InputLayer 5304745880 encoder: Model 5274278880->5304745880 5274277200 origin_geo: InputLayer 5274277200->5304745880 5330847224 decoder: Model 5304745880->5330847224 In [26]:# build VAE training, test setsone_hot_airports=np.eye(N_UNIQUE_AIRPORTS)X_train_r_origin=one_hot_airports[X_train_r['origin_id']]X_val_r_origin=one_hot_airports[X_val_r['origin_id']]X_test_r_origin=one_hot_airports[X_test_r['origin_id']]X_train_r_dest=one_hot_airports[X_train_r['dest_id']]X_val_r_dest=one_hot_airports[X_val_r['dest_id']]X_test_r_dest=one_hot_airports[X_test_r['dest_id']]print('Dataset sizes:')print(' Train: {}'.format(X_train_r_origin.shape))print(' Validation: {}'.format(X_val_r_origin.shape))print(' Test: {}'.format(X_test_r_origin.shape)) Dataset sizes: Train: (87630, 3186) Validation: (21907, 3186) Test: (21907, 3186) In [27]:vae_model_fit=vae_model.fit(x=[X_train_r_origin,X_train_r[['origin_latitude','origin_longitude']].as_matrix()],y=[X_train_r_dest,X_train_r[['dest_latitude','dest_longitude']].as_matrix()],batch_size=1024,epochs=5,validation_data=([X_val_r_origin,X_val_r[['origin_latitude','origin_longitude']].as_matrix()],[X_val_r_dest,X_val_r[['dest_latitude','dest_longitude']].as_matrix()],),verbose=0,callbacks=[TQDMNotebookCallback(leave_outer=False)])plot_model_fit(vae_model_fit) VISUALIZE ¶ In [28]:defget_vae_embeddings(vae_model,unique_origins=unique_origins):'''Returns the origin airport embeddings from the variational autoencoder embedding model, *aligned with the index of `unique_origins`*. '''encoder_inputs=[one_hot_airports[unique_origins['origin_id']],unique_origins[['latitude','longitude']].as_matrix()]returnvae_model.model.get_layer('encoder').predict(encoder_inputs)unique_origins['embedding']=get_vae_embeddings(vae_model)plot_embeddings_on_world_map(unique_origins,output_path='../figures/vae_model_map.html') In [39]:# visit the URL for a full-screen view 👇VAE_EMBED_VIZ_S3_PATH='https://willwolf-public.s3.amazonaws.com/transfer-learning-flight-delays/vae_model_map.html'IFrame(VAE_EMBED_VIZ_S3_PATH,width=1000,height=800) Out[39]:FINALLY, TRANSFER THE LEARNING ¶ Retrain both models with 20 latent dimensions, then join the embedding back to our original dataset. In [30]:# dot product embeddingEMBEDDING_SIZE=20dp_model=DotProductEmbeddingModel(embedding_size=EMBEDDING_SIZE,λ=.0001)dp_model.compile(optimizer=Adam(lr=LEARNING_RATE),loss='binary_crossentropy')dp_model_fit=dp_model.fit(x=[X_r['origin_id'],X_r['dest_id'],X_r[['origin_latitude','origin_longitude']].as_matrix(),X_r[['dest_latitude','dest_longitude']].as_matrix(),],y=y_r,batch_size=256,epochs=5,verbose=0,callbacks=[TQDMNotebookCallback(leave_outer=False)]) In [31]:# variational autoencoder embeddingvae_model=VariationalAutoEncoderEmbeddingModel(embedding_size=EMBEDDING_SIZE,dense_layer_size=30,λ=.003)vae_model.compile(optimizer=Adam(lr=LEARNING_RATE),loss=[vae_model.variational_layer.loss,'mean_squared_logarithmic_error'],loss_weights=[1.,.2])X_r_origin=one_hot_airports[X_r['origin_id']]X_r_dest=one_hot_airports[X_r['dest_id']]vae_model_fit=vae_model.fit(x=[X_r_origin,X_r[['origin_latitude','origin_longitude']].as_matrix()],y=[X_r_dest,X_r[['dest_latitude','dest_longitude']].as_matrix()],batch_size=1024,epochs=5,verbose=0,callbacks=[TQDMNotebookCallback(leave_outer=False)]) EXTRACT EMBEDDINGS, CONSTRUCT JOINT DATASET ¶ In [32]:# get dot product, variational autoencoder embeddingsdp_embeddings=get_dp_embeddings(dp_model)vae_embeddings=get_vae_embeddings(vae_model)assertdp_embeddings.shape==vae_embeddings.shape,'Embedding matrices are of unequal size'# create names for embedding columnsn_embedding_dims=dp_embeddings.shape[1]dp_embedding_cols=['dp_dim_{}'.format(d)fordinrange(n_embedding_dims)]vae_embedding_cols=['vae_dim_{}'.format(d)fordinrange(n_embedding_dims)]embedding_cols=dp_embedding_cols+vae_embedding_cols# create an embeddings DataFrameembeddings_df=pd.DataFrame(data=np.concatenate([dp_embeddings,vae_embeddings],axis=1),columns=embedding_cols,index=unique_origins['origin']) In [33]:# construct joint datasetdelflights,X,yflights=feather.read_dataframe(FLIGHTS_PATH)X=flights[['DayOfWeek','DayofMonth','Month','ScheduledDepTimestamp','Origin','Dest','UniqueCarrier']].copy()y=flights['total_delay'].copy()X=X\ .join(embeddings_df,on='Origin',sort=False)\ .join(embeddings_df,on='Dest',rsuffix='_Dest',sort=False)\ .drop(['Origin','Dest'],axis=1)\ .fillna(0)# one-hotone_hot_matrices=[]embedding_cols=[colforcolinX.columnsif'_dim_'incol]column_filter=lambdacol:col!='ScheduledDepTimestamp'andcolnotinembedding_colsforcolinfilter(column_filter,X.columns):one_hot_matrices.append(pd.get_dummies(X[col]))one_hot_matrix=np.concatenate(one_hot_matrices,axis=1)X=np.concatenate([X[embedding_cols+['ScheduledDepTimestamp']],one_hot_matrix],axis=1)# normalizeX=StandardScaler().fit_transform(X)y=np.log(y+1).values In [34]:X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=TEST_SIZE,random_state=42)X_val,X_test,y_val,y_test=train_test_split(X_test,y_test,test_size=int(TEST_SIZE/2),random_state=42)print('Dataset sizes:')print(' Train: {}'.format(X_train.shape))print(' Validation: {}'.format(X_val.shape))print(' Test: {}'.format(X_test.shape)) Dataset sizes: Train: (30000, 151) Validation: (10000, 151) Test: (10000, 151) TRAIN ORIGINAL MODELS ¶ In [35]:simple_reg=SimpleRegression(input_dim=X.shape[1],λ=.05)simple_reg.compile(optimizer=Adam(lr=.0005),loss='mean_squared_error')simple_reg_fit=fit_flight_model(simple_reg,X_train,y_train,X_val,y_val,epochs=5,batch_size=16)plot_model_fit(simple_reg_fit) In [36]:deeper_reg=DeeperRegression(input_dim=X.shape[1],λ=.03,dropout_p=.2)deeper_reg.compile(optimizer=Adam(lr=.0001),loss='mean_squared_error')deeper_reg_fit=fit_flight_model(deeper_reg,X_train,y_train,X_val,y_val,epochs=5,batch_size=16)plot_model_fit(deeper_reg_fit) In [37]:y_pred_simple=simple_reg.model.predict(X_test).ravel()y_pred_deeper=deeper_reg.model.predict(X_test).ravel()mse_simple=mean_squared_error_scikit(y_test,y_pred_simple)mse_deeper=mean_squared_error_scikit(y_test,y_pred_deeper)print('Mean squared error, simple regression: {}'.format(mse_simple))print('Mean squared error, deeper regression: {}'.format(mse_deeper)) Mean squared error, simple regression: 2.3176028493805263 Mean squared error, deeper regression: 2.291221474968889 SUMMARY ¶ In fitting these models to both the original and ""augmented"" datasets, I spent time tuning their parameters — regularization strengths, amount of dropout, number of epochs, learning rates, etc. Additionally, the respective datasets are of different dimensionality. For these reasons, comparison between the two sets of models is clearly not ""apples to apples."" Notwithstanding, the airport embeddings do seem to provide a nice lift over our original one-hot encodings. Of course, their use is not limited to predicting flight delays: they can be used in any task concerned with airports. Additionally, these embeddings give insight into the nature of the airports themselves: those nearby in vector space can be considered as ""similar"" by some latent metric. To figure out what these metrics mean, though - it's back to the map. ADDITIONAL RESOURCES ¶ * Towards Anything2Vec * Deep Learning for Calcium Imaging * DeepWalk: Online Learning of Social Representations * Variational Autoencoder: Intuition and Implementation * Introducing Variational Autoencoders (in Prose and Code) * Variational auto-encoder for ""Frey faces"" using keras * Transfer Learning - Machine Learning's Next Frontier * Tutorial - What is a variational autoencoder? * A Beginner's Guide to Variational Methods: Mean-Field Approximation * Variational Autoencoder: Intuition and Implementation * CrossValidated - What is the manifold assumption in semi-supervised learning? * David Blei - Variational Inference * Edward - Variational Inference * On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes CODE ¶ The repository for this project can be found here . FOOTNOTES ¶ 1: A Beginner's Guide to Variational Methods: Mean-Field Approximation -------------------------------------------------------------------------------- COMMENTS Tweet Please enable JavaScript to view the comments powered by Disqus.SOCIAL: LINKS: TRAVEL BLOG , SOURCE CODE © Will Wolf 2017 Powered by Pelican","In this work, we explore improving a vanilla regression model with knowledge learned elsewhere. ",Transfer Learning for Flight Delay Prediction via Variational Autoencoders,Live,57 152,"Skip navigation Upload Sign in SearchLoading...Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.WATCH QUEUEQUEUEWatch Queue Queue * Remove all * Disconnect 1. Loading...Watch Queue Queue __count__/__total__ Find out why CloseHOLDEN KARAU - BIGDATASV 2016 - #BIGDATASV - THECUBESiliconANGLE Subscribe Subscribed Unsubscribe 5,686 5KLoading...Loading...Working...Add toWANT TO WATCH THIS AGAIN LATER?Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO? Sign in to report inappropriate content. Sign in * Transcript * Statistics260 views 4LIKE THIS VIDEO?Sign in to make your opinion count. Sign in 5 0DON'T LIKE THIS VIDEO?Sign in to make your opinion count. Sign in 1Loading...Loading...TRANSCRIPTThe interactive transcript could not be loaded.Loading...Loading...Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Mar 31, 201601. Holden Karau, IBM, Visits #theCUBE !. ( 00:21 )02. Give Us An Update On Spark. ( 00:43 )03. Do The Hardcore Spark Developers Have To Main Stream It. ( 01:48 )04. There's A Lot Of Integration What Are Your Thoughts On That. ( 03:22 )05. Is Spark A Comparable Investment To Lynx. ( 04:32 )06. Give Me An Example Of The Magnitude Of Spark. ( 06:11 )07. Can You Give Us Examples Of Products That Are Moving To Spark. ( 07:24 )08. Who Is Policing The Agorithms. ( 08:26 )09. Where Are We In Machine Learning Put On The Process Of The Design And RunTime. ( 11:03 )10. Do We See Big Packet Apps Emerging For This Class Of Apps. ( 15:32 )11. What Is Your Take On The Status Of Machine Learning. ( 17:32 )12. Do You Have Another Book On The Horizon. ( 19:24 )Track List created with http://www.vinjavideo.com .--- ---Machine learning on machine learning software: It’s closer than you think | #BigDataSVby Amber Johnson | Mar 31, 2016As the tech world pivots on game-changing applications, data scientists rise tothe occasion. Such is the case with Holden Karau, principal software engineer ofBig Data at IBM and coauthor of Learning Spark.When asked about the current renovations within Spark, Karau said she sees thistime as an “opportunity to get rid of dead weight” by streamlining certainprocesses. For example, she cited getting functional and relative queries totalk to each other within Spark.Two area of expansion include sequencing and machine learning. Karau notedanother “massive expansion” in getting other applications to run on top of Sparkduring an interview with John Furrier (@furrier) and George Gilbert(@ggilbert41), cohosts of theCUBE from the SiliconANGLE Media team, during theBigDataSV 2016 event in San Jose, California, where theCUBE is celebrating #BigDataWeek , including news and events from the #StrataHadoop conference.The three self-described tech geeks discussed the advances with Spark since thebandwagon effect has kicked in. Karau predicted that machine learning on machinelearning software will arrive sooner than Gilbert’s conservative five-yearestimate. While she didn’t give a specific time frame, Karau stated emphaticallythat it is “closer than five years.”How data science is changing software dynamicsKarau conferred with Furrier and Gilbert about several aspects of data scienceand how it is changing software dynamics. One side project in particular stoodout. Karau is working on a Spark validator that will help with “policingquality” in regards to algorithms within pipeline models. Pipeline modelspresent challenges regarding working large scale and still being able to workwith the Big Data interactively. When asked about getting data science to workon data science, Karau said the tech was “there-ish.”In addition, Karau is working with her coauthor, Rachel Warren, on a new bookcalled High Performance Spark. Karau spoke eloquently and candidly about sourcesof frustration in working with Spark pipeline issues, saying, “How do I savethis damn thing?” However, when it comes to Spark, Karau literally wrote thebook.@theCUBE#BigDataSV #StrataHadoop * CATEGORY * Science & Technology * LICENSE * Creative Commons Attribution license (reuse allowed) * * Remix this video Show more Show lessLoading...Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * Bill Schmarzo - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 19:25. SiliconANGLE 119 views 19:25-------------------------------------------------------------------------------- * 41 videos Play all BigDataSV 2016 - #BigDataSV SiliconANGLE * Holden Karau - Improving PySpark Performance: Spark performance beyond the JVM - Duration: 43:23. PyData 441 views 43:23 * Joel Horwitz - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 20:12. SiliconANGLE 62 views 20:12 * Muddu Sudhakar - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 13:53. SiliconANGLE 335 views 13:53 * Joel Horwitz - IBM Spark Summit 2015 - theCUBE - Duration: 21:06. SiliconANGLE 243 views 21:06 * Rob Thomas & Joel Horwitz - BigDataNYC 2015 - theCUBE - #BigDataNYC - Duration: 23:46. SiliconANGLE 185 views 23:46 * Rajeev Madhavan & Ratnakar Lavu - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 14:42. SiliconANGLE 328 views 14:42 * Ritika Gunnar - IBM Insight 2015 - theCUBE - #ibminsight - Duration: 19:16. SiliconANGLE 130 views 19:16 * Steven Sit - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 16:16. SiliconANGLE 125 views 16:16 * Effective testing of Spark programs and jobs - Strata NY 2015 video - Duration: 33:14. Holden Karau 404 views 33:14 * Day Two Kickoff - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 16:21. SiliconANGLE 49 views 16:21 * Kostas Tzoumas - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 11:45. SiliconANGLE 108 views 11:45 * Christina Noren - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 22:49. SiliconANGLE 278 views 22:49 * Scott Gnau - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 21:42. SiliconANGLE 96 views 21:42 * Holden Karau: Sparkling Pandas- Letting Pandas Roam on Spark DataFrames - Duration: 35:41. PyData 1,593 views 35:41 * Holden Karau - Interview Engineering and Data Tools - Duration: 4:06. Global Data Geeks 79 views 4:06 * Mike Williams - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 9:35. SiliconANGLE 65 views 9:35 * Rishi Yadav - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 18:10. SiliconANGLE 199 views 18:10 * Dan Graham & Stephanie McReynolds - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 21:40. SiliconANGLE 77 views 21:40 * Loading more suggestions... * Show more * Language: English * Country: Worldwide * Restricted Mode: OffHistory HelpLoading...Loading...Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Try something new! * Loading...Working...Sign in to add this to Watch LaterADD TOLoading playlists...","01. Holden Karau, IBM, Visits #theCUBE!. (00:21) 02. Give Us An Update On Spark. (00:43) 03. Do The Hardcore Spark Developers Have To Main Stream It. (01:48)...",Advancements in the Spark Community,Live,58 155,"Homepage PUBLISHED IN AUTONOMOUS AGENTS — #AI Follow Sign in / Sign up 3 Preetham V V Blocked Unblock Follow Following #AI & #MachineLearning enthusiast. Author: Java Web Services / Internet Security & Firewalls. VP, Brand Sciences & Products @inMobi #UltraRunner 6 hrs ago 10 min read -------------------------------------------------------------------------------- HOW TO TAME THE VALLEY — HESSIAN-FREE HACKS FOR OPTIMIZING LARGE #NEURALNETWORKS Let’s say you have the gift of flight (or you are riding a chopper). You are also a Spy (like in James Bond movies). You are given the topography of a long narrow valley as shown in the image and you are given a rendezvous point to meet a potential aide who has intelligence that is helpful for your objective. The only information you have about the rendezvous point is as follows: “Meet me at the lowest co-ordinate of ‘this long valley’ in 4 hours”How do you go about finding the lowest co-ordinate point? More so, how do you intend to find it in a stipulated time period? Well, for complex Neural Networks which has very large parameters, the error surface of the Neural Network is very similar to the long narrow valley of sorts. Finding a “minima” in the valley can be quite tricky when you have such pathological curvatures in your topography. Note: there are many posts written on second-order optimization hacks for Neural Network. The reason I decided to write about it again is that most of it jumps straight into complex Math without much explanation. Instead, I have tried to explain Math as briefly where possible and mostly point to detailed sources to learn if you are not trained in the particular field of Math. This post shall be a bit longish due to that.In the past posts, we used Gradient Descent algorithms while Back-propagating that helped us minimize the errors. You can find the techniques in the post titled “ Backpropagation — How Neural Networks Learn Complex Behaviors ” LIMITATIONS OF GRADIENT DESCENT There is nothing fundamentally wrong with a Gradient Descent algorithm [or Stochastic Gradient Descent (SGD) to be precise]. In fact we have proved that it is quite efficient for some of the Feed Forward examples we have used in the past. The problem of SGD arises when we have “Deep” Neural Networks which has more than one hidden layer. Especially when the Network is fairly large. Here are some illustrations of a non-monotonic error surface of a Deep Neural Network to get an idea. Error Surface — 2 Error Surface — 2Note that there are many minima and maxima in the illustration. Let us quickly look at the weight update process in SGD SGD weight updatesThe problem with using SGD for the illustrations is as follows: * Since SGD uses first order optimization method, it assumes that the error surface always looks like a plane (In the direction of descent that is) and does not account for curvature. * When there is a quadratic curvature, we apply some tricks to ensure that SGD does not just bounce off the surface as shown in the weight update equation. * We control the momentum-value using some pre-determined alpha and control the velocity by applying a learning rate epsilon . * The alpha and the epsilon buffers the speed and direction of SGD and slows down the optimization until we converge. We can only tune these hyper-parameters to get a good balance of speed versus effectiveness of SGD. But they still slow us down. * In large networks with pathological curvatures as shown in illustration, tuning these hyper-parameters is quite challenging. * The error in SGD can suddenly start rising when you move in the direction of the gradient when you are traversing a long narrow valley. In fact SGD can almost grind to a halt before it can make any progress at all. We need a better method to work with large or Deep Neural Networks. SECOND ORDER OPTIMIZATION TO THE RESCUE SGD is a first order optimization problem. First order methods are methods that have linear local curves. In that we assume that we can apply linear approximations to solve equations. Some examples of first-order methods are as follows: * Gradient Descent * Sub-Gradient * Conjugate Gradient * Random co-ordinate descent There are methods called the second-order methods which considers the convexity or curvature of the equation and does quadratic approximations. Quadratic approximations is an extension of liner approximations but provide a additional variable to deal with which helps create a quadratic surface to deal with a point on the error surface. The key difference between the first-order and second-order approximations is that, while the linear approximation provides a “plane” that is tangential to a point on a error surface, the second-order approximation provides a quadratic surface that hugs the curvature of the error surface. If you are new to quadratic approximations, I encourage you to check this Khan Academy lecture on Quadratic approximations .The advantage of a second-order method is that, it shall not ignore the curvature of the error surface. Because of the fact that the curvature is being considered, second-order methods are considered to have better step-wise performance. * The full step jump of a second-order method points directly to the minima of a curvature (unlike first-order methods which requires multiple steps with multiple gradient calculation in each step). * Since a second-order method points to the minima of a quadratic curvature in one step, the only thing you have to worry about is how well the curve actually hugs the error surface. This is a good enough heuristic to deal with. * Working with the hyper-parameters given the heuristic becomes very efficient. The following are some second-order methods * Newton’s method * Quasi-Newton, Gauss-Newton * BFGS, (L)BFGS Let’s take a look at Newton’s method which is a base method and is bit more intuitive compared to others. YO! NEWTON, WHATS YOUR METHOD? Newton’s Method , also called Newton-Raphson Method is a iterative method approximation technique on the roots of a real valued function. This is one of the base method’s used in any second-order convex optimization problems to approximate functions. Let’s first look at Newton’s method using first-derivate of a function. Let’s say we have a function f(x) = 0, and we have some initial solution x_0 which we believe is sub-optimal. Then, Newton’s method suggest us to do the following 1. Find the equation for the tangent line at x_0 2. Find the point at which the tangent line cuts the x-axis and call this new point as x_1. 3. Find the projection of x_1 on the function f(x)=0 which is also at x_1. 4. Now, iterate again from step-1, by replacing x_0 with x_1. Really that simple. the caveat is that the method does not tell you when to stop so we add a 5th step as follows: 5. If x_n (the current value of x) is equal to or lesser than a threshold then we stop. Here is the image that depicts the above: Finding optimal value of X using Newton’s Method.Here is an animation that shows the same: animation creditFirst-degree-polynomial, One-dimension: Here is the math for a function which is a first degree polynomial with one-dimension. Second-degree-polynomial, One-dimension Now, we can work on Newton approximation for a second degree polynomial (second-order optimizations) function with one-dimension (before we get to multiple dimensions). A second degree polynomial is quadratic in nature and would need a second-order derivative to work with. To work on the second-derivative of a function, Let’s use the Taylor approximation as follows: Second-degree-polynomial, Multiple-dimension Suppose that we are working on a second degree polynomial with multiple dimensions, then we work with the same Newton’s approach as we found above but replace the first-derivatives with a gradient and the second-derivatives with a Hessian as follows: A Hessian Matrix is square matrix of second-order partial derivatives of a scalar, which describes the local curvature of a multi-variable function. Specifically in case of a Neural Network, the Hessian is a square matrix with the number of rows and columns equal to the total number of parameters in the Neural Network. The Hessian for Neural Network looks as follows: Hessian Matrix of a Neural NetworkWHY IS HESSIAN BASED APPROACH THEORETICALLY BETTER THAN SGD? Now, the second-order optimization using the Newton’s method of iteratively finding the optimal ‘x’ is a clever hack for optimizing the error surface because, unlike SGDm where you fit a plane at the point x_0 and then determine the step-wise jump, In second-order optimization, we find a tightly fitting quadratic curve at x_0 and directly find the minima of the curvature. This is supremely efficient and fast. But !!! Empirically though, can you now imagine computing a Hessian for a network with millions of parameter? Of course it gets very in-efficient as the amount of storage and computation required to calculate the Hessian is of quadratic order as well. So though in theory, this is awesome, In practice it sucks. We need a Hack for the Hack ! And the answer seems to lie in Conjugate Gradients. CONJUGATE GRADIENTS Actually, there are several quadratic approximation methods for a convex function. But Conjugate Gradient Method works quite well for a symmetric matrix, which are positive-definite. In fact, Conjugate Gradients are meant to work with very-large, sparse systems. Note that a Hessian is symmetric around the diagonal, the parameters of a Neural Network are typically sparse, and the Hessian of a Neural Network is positive-definite (Meaning, it only has positive Eigen Values). Boy, are we in luck? If you need a thorough introduction of Conjugate Gradient Methods, go through the paper titled “ An Introduction to the Conjugate Gradient Method Without the Agonizing Pain ” by Jonathan Richard Shewchuk. I find this quite through and useful. I would suggest that you study the paper in free-time to get a in-depth understanding of Conjugate Gradients.The easiest way to explain the Conjugate Gradient (CG) is as follows: * The CG Descent is applicable on any quadratic form . * CG uses a step-size ‘alpha’ value similar to SGD but instead of a fixed alpha, we find the alpha through a line search algorithm. * CG also needs a ‘Beta’ a scalar value that helps find the next direction which is “conjugate” to the first direction. You can check most of the hairy-math around arriving at a CG equation by the paper cited above. I shall directly jump to the section of the algorithm of the conjugate gradient: For solving a equation Ax=b , we can use the following algorithm (from Wikipedia) image credit * Here r_k is the residual value, * p_k is the conjugate vector and, * x_k+1 is iteratively updated with previous value x_k and the dot product of the step-size alpha_k and conjugate vector p_k. Given that we know how to compute the Conjugate Gradient, let’s look at the Hessian Free optimization technique. HESSIAN-FREE OPTIMIZATION ALGORITHM Now that we have understood the CG algorithm, let’s look at the final clever hack that allows us to be free from the Hessian. CITATION: Hessian-free optimization is a technique adopted to Neural Networks by James Marten at the University of Toronto in a paper titled “ Deep-Learning Via Hessian Free Optimization ”.Let’s start with a second-order Taylor expansion of a function: Here we need to find the best delta_x and then move to x+delta_x and keep iterating until converge. In other words, the steps involved in Hessian-free optimization is as follows: Algorithm: 1. Start with i=0 and iterate 2. Let x_i be some initial sub-optimal x_0 choosen randomly. 3. At current x_n, Given the Taylor expansion shown as above, compute gradient of f(x_n) and hessian of f(x_n) 4. Given the Taylor expansion, Compute the next x_n+1 (which is nothing but delta_x) using the Conjugate Gradient algorithm. 5. Iterate steps 2–4 until the current x_n converges. The crucial insight: Note that unlike in the Newton’s method where a Hessian is needed to compute x_n+1, in Hessian-free algorithm we do not need the Hessian to compute x_n+1. Instead we are using the Conjugate Gradient. Clever Hack: Since the Hessian is used along with a vector x_n, we just need an approximation of the Hessian along with the vector and we do NOT need the exact Hessian. The approximation of Hessian with a Vector is far faster than computing the Hessian itself. Check the following reasoning. Take a look at the Hessian again: Hessian Matrix of a Neural NetworkHere, the i’th row contains partial derivates of the form Where ‘i’ is the row index and ‘j’ is the column index. Hence the dot product of a Hessian matrix and any vector: Using directional derivatives and finite differences, we can optimize the above as following: In fact a thorough explanation and technique for fast multiplication of a Hessian with a vector is available in the paper titled “ Fast Exact Multiplication of the Hessian ” by Barak A. Pearlmutter from Siemens Corporate Research.With this insight, we can completely skip the computation of a Hessian and just focus on the approximation of the Hessian to a vector multiplication, which tremendously reduces the computation and storage capacity. To understand the impact of the optimization technique, check the following illustration. Note that with this approach, instead of bouncing off the side of the mountains like in SGD, you can actually move along the slope of the valley before you can find a minima in the curvature. This is quite effective for very large Neural Networks or Deep Neural Networks with million of parameters. Apparently, It’s not easy to be a Spy… Machine Learning Artificial Intelligence Deep Learning Neural Networks Hessian Free Optimization 3 Blocked Unblock Follow FollowingPREETHAM V V #AI & #MachineLearning enthusiast. Author: Java Web Services / Internet Security & Firewalls. VP, Brand Sciences & Products @inMobi #UltraRunner FollowAUTONOMOUS AGENTS — #AI Notes of Artificial Intelligence and Machine Learning. × Don’t miss Preetham V V’s next story Blocked Unblock Follow Following Preetham V V",Let’s say you have the gift of flight (or you are riding a chopper). You are also a Spy (like in James Bond movies). You are given the…,How to tame the valley — Hessian-free hacks for optimizing large #NeuralNetworks – Autonomous Agents — #AI,Live,59 157,"RStudio Blog * Home * Subscribe to feed READR 1.0.0 August 5, 2016 in Packages readr 1.0.0 is now available on CRAN. readr makes it easy to read many types of rectangular data, including csv, tsv and fixed width files. Compared to base equivalents like read.csv() , readr is much faster and gives more convenient output: it never converts strings to factors, can parse date/times, and it doesn’t munge the column names. Install the latest version with: install.packages(""readr"") Releasing a version 1.0.0 was a deliberate choice to reflect the maturity and stability and readr, thanks largely to work by Jim Hester. readr is by no means perfect, but I don’t expect any major changes to the API in the future. In this version we: * Use a better strategy for guessing column types. * Improved the default date and time parsers. * Provided a full set of lower-level file and line readers and writers. * Fixed many bugs. COLUMN GUESSING The process by which readr guesses the types of columns has received a substantial overhaul to make it easier to fix problems when the initial guesses aren’t correct, and to make it easier to generate reproducible code. Now column specifications are printing by default when you read from a file: mtcars2 <- read_csv(readr_example(""mtcars.csv"")) #> Parsed with column specification: #> cols( #> mpg = col_double(), #> cyl = col_integer(), #> disp = col_double(), #> hp = col_integer(), #> drat = col_double(), #> wt = col_double(), #> qsec = col_double(), #> vs = col_integer(), #> am = col_integer(), #> gear = col_integer(), #> carb = col_integer() #> ) The thought is that once you’ve figured out the correct column types for a file, you should make the parsing strict. You can do this either by copying and pasting the printed column specification or by saving the spec to disk: # Once you've figured out the correct types mtcars_spec <- write_rds(spec(mtcars2), ""mtcars2-spec.rds"") # Every subsequent load mtcars2 <- read_csv( readr_example(""mtcars.csv""), col_types = read_rds(""mtcars2-spec.rds"") ) # In production, you might want to throw an error if there # are any parsing problems. stop_for_problems(mtcars2) You can now also adjust the number of rows that readr uses to guess the column types with guess_max : challenge <- read_csv(readr_example(""challenge.csv"")) #> Parsed with column specification: #> cols( #> x = col_integer(), #> y = col_character() #> ) #> Warning: 1000 parsing failures. #> row col expected actual #> 1001 x no trailing characters .23837975086644292 #> 1002 x no trailing characters .41167997173033655 #> 1003 x no trailing characters .7460716762579978 #> 1004 x no trailing characters .723450553836301 #> 1005 x no trailing characters .614524137461558 #> .... ... ...................... .................. #> See problems(...) for more details. challenge <- read_csv(readr_example(""challenge.csv""), guess_max = 1500) #> Parsed with column specification: #> cols( #> x = col_double(), #> y = col_date(format = """") #> ) (If you want to suppress the printed specification, just provide the dummy spec col_types = cols() ) You can now access the guessing algorithm from R: guess_parser() will tell you which parser readr will select. guess_parser(""1,234"") #> [1] ""number"" # Were previously guessed as numbers guess_parser(c(""."", ""-"")) #> [1] ""character"" guess_parser(c(""10W"", ""20N"")) #> [1] ""character"" # Now uses the default time format guess_parser(""10:30"") #> [1] ""time"" DATE-TIME PARSING IMPROVEMENTS: The date time parsers recognise three new format strings: * %I for 12 hour time format:library(hms) parse_time(""1 pm"", ""%I %p"") #> 13:00:00 Note that parse_time() returns hms from the hms package, rather than a custom time class * %AD and %AT are “automatic” date and time parsers. They are both slightly less flexible than previous defaults. The automatic date parser requires a four digit year, and only accepts - and / as separators. The flexible time parser now requires colons between hours and minutes and optional seconds.parse_date(""2010-01-01"", ""%AD"") #> [1] ""2010-01-01"" parse_time(""15:01"", ""%AT"") #> 15:01:00 If the format argument is omitted in parse_date() or parse_time() , the default date and time formats specified in the locale will be used. These now default to %AD and %AT respectively. You may want to override in your standard locale() if the conventions are different where you live. LOW-LEVEL READERS AND WRITERS readr now contains a full set of efficient lower-level readers: * read_file() reads a file into a length-1 character vector; read_file_raw() reads a file into a single raw vector. * read_lines() reads a file into a character vector with one entry per line; read_lines_raw() reads into a list of raw vectors with one entry per line. These are paired with write_lines() and write_file() to efficient write character and raw vectors back to disk. OTHER CHANGES * read_fwf() was overhauled to reliably read only a partial set of columns, to read files with ragged final columns (by setting the final position/width to NA ), and to skip comments (with the comment argument). * readr contains an experimental API for reading a file in chunks, e.g. read_csv_chunked() and read_lines_chunked() . These allow you to work with files that are bigger than memory. We haven’t yet finalised the API so please use with care, and send us your feedback. * There are many otherbug fixes and other minor improvements. You can see a complete list in the release notes . A big thanks goes to all the community members who contributed to this release: @ antoine-lizee , @ fpinter , @ ghaarsma , @ jennybc , @ jeroenooms , @ leeper , @ LluisRamon , @ noamross , and @ tvedebrink . SHARE THIS: * Reddit * More * * Email * Facebook * * Print * Twitter * * LIKE THIS: Like Loading...RELATED SEARCH LINKS * Contact Us * Development @ Github * RStudio Support * RStudio Website * R-bloggers CATEGORIES * Featured * News * Packages * R Markdown * RStudio IDE * Shiny * shinyapps.io * Training * Uncategorized ARCHIVES * August 2016 * July 2016 * June 2016 * May 2016 * April 2016 * March 2016 * February 2016 * January 2016 * December 2015 * October 2015 * September 2015 * August 2015 * July 2015 * June 2015 * May 2015 * April 2015 * March 2015 * February 2015 * January 2015 * December 2014 * November 2014 * October 2014 * September 2014 * August 2014 * July 2014 * June 2014 * May 2014 * April 2014 * March 2014 * February 2014 * January 2014 * December 2013 * November 2013 * October 2013 * September 2013 * June 2013 * April 2013 * February 2013 * January 2013 * December 2012 * November 2012 * October 2012 * September 2012 * August 2012 * June 2012 * May 2012 * January 2012 * October 2011 * June 2011 * April 2011 * February 2011 EMAIL SUBSCRIPTION Enter your email address to subscribe to this blog and receive notifications of new posts by email. Join 19,780 other followers RStudio is an affiliated project of the Foundation for Open Access Statistics LEAVE A COMMENT Comments feed for this article LEAVE A REPLY CANCEL REPLY Enter your comment here...Fill in your details below or click an icon to log in: * * * * * Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change ) You are commenting using your Twitter account. ( Log Out / Change ) You are commenting using your Facebook account. ( Log Out / Change ) You are commenting using your Google+ account. ( Log Out / Change ) CancelConnecting to %s Notify me of new comments via email. Notify me of new posts via email. « Don’t miss Hadley Wickham’s Master R Workshop September 12 and 13 in NYCBlog at WordPress.com. Subscribe to feed. FollowFOLLOW “RSTUDIO BLOG” Get every new post delivered to your Inbox. Join 19,780 other followers Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","readr 1.0.0 is now available on CRAN. readr makes it easy to read many types of rectangular data, including csv, tsv and fixed width files. Compared to base equivalents like read.csv(), readr is mu…",readr 1.0.0,Live,60 165,"METRICS MAVEN: WINDOW FUNCTIONS IN POSTGRESQL Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published May 10, 2016Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this first article, we'll look at how to use window functions in PostgreSQL. POSTGRESQL WINDOW FUNCTIONS If you use PostgreSQL, you're probably already familiar with many of the common aggregate functions , such as COUNT() , SUM() , MIN() , MAX() , and AVG() . But you may not be familiar with window functions since they're touted as an advanced feature. Window functions aren't nearly as esoteric as they may seem, however. As the name implies, window functions provide a ""window"" into your data, letting you perform aggregations against a set of data rows according to specified criteria that match the current row. While they are similar to standard aggregations, there are also additional functions that can only be used through window functions (such as the RANK() function we'll demonstrate below). In some situations window functions can minimize the complexity of your query or even speed up the performance. Make note: window functions always use the OVER() clause so if you see OVER() you're looking at a window function. Once you get used to how the OVER() clause is formatted, where it fits in your queries, and the kind of results you can get, you'll soon start to see lots of ways to apply it. Let's dive in! OVER( ) Depending on the purpose and complexity of the window function you want to run, you can use OVER() all by itself or with a handful of conditional clauses. Let's start by looking at using OVER() all by itself. If the aggregation you want to run is to be performed across all the rows returned by the query and you don't need to specify any other conditions, then you can use the OVER() clause by itself. Here's an example of a simple window function querying a table in our Compose PostgreSQL database containing the United States Census data on estimated population : SELECT name AS state_name, popestimate2015 AS state_population, SUM(popestimate2015) OVER() AS national_population FROM population WHERE state � Notice that we're using a window function to sum the state populations over all the result rows (that's the OVER() you see in our query... yep, just that one little addition to an otherwise standard query). Returned, we get result rows for each state and their populations with also the population sum for the nation - that's the aggregation we performed with our window function: state_name | state_population | national_population -------------------------------------------------------------- Alabama | 4858979 | 324893002 Alaska | 738432 | 324893002 Arizona | 6828065 | 324893002 Arkansas | 2978204 | 324893002 California | 39144818 | 324893002 Colorado | 5456574 | 324893002 Connecticut | 3590886 | 324893002 Delaware | 945934 | 324893002 District of Columbia | 672228 | 324893002 Florida | 20271272 | 324893002 . . . . Consider how this compares to standard aggregation functions. Without the window function, the simplest thing we could do is return the national population by itself, like this, by summing the state populations: SELECT SUM(popestimate2015) AS national_population FROM population WHERE state � The problem is, we don't get any of the state level information this way. To get the same results as our window function, we'd have to do a sub-select as a derived table: SELECT name AS state_name, popestimate2015 AS state_population, x.national_population FROM population, ( SELECT SUM(popestimate2015) AS national_population FROM population WHERE state > 0 -- only state-level rows ) x WHERE state � Looks ugly in comparison, doesn't it? Using window functions, our query is much less complex and easier to understand. CONDITION CLAUSES In the above example, we looked at a simple window function without any additional conditions, but in many cases, you'll want to apply some conditions in the form of additional clauses to your OVER() clause. One is PARTITION BY which acts as the grouping mechanism for aggregations. The other one is ORDER BY which orders the results in the window frame (the set of applicable rows). So, besides the format of the returned rows as we reviewed above, the other obvious difference with window functions is how the syntax works in your queries: use the OVER() clause with an aggregate function (like SUM() or AVG() ) and/or with a specialized window function (like RANK() or ROW_NUMBER() ) in your SELECT list to indicate you're creating a window and apply additional conditions as necessary to the OVER() clause, such as using PARTITION BY (instead of the GROUP BY you may be used to for aggregation). Let's look at some specific examples. PARTITION BY PARTITION BY allows us to group aggregations according to the values of the specified fields. In our census data for estimated population, each state is categorized according to the division and region it belongs to. Let's partition first by region: SELECT name AS state_name, popestimate2015 AS state_population, region, SUM(popestimate2015) OVER(PARTITION BY region) AS regional_population FROM population WHERE state � Now we can see the population sum by region but still get the state level data: state_name | state_population | region | regional_population ------------------------------------------------------------------------- Alabama | 4858979 | South | 121182847 Alaska | 738432 | West | 76044679 Arizona | 6828065 | West | 76044679 Arkansas | 2978204 | South | 121182847 California | 39144818 | West | 76044679 Colorado | 5456574 | West | 76044679 Connecticut | 3590886 | Northeast | 56283891 Delaware | 945934 | South | 121182847 District of Columbia | 672228 | South | 121182847 Florida | 20271272 | South | 121182847 . . . . Let's add division: SELECT name AS state_name, popestimate2015 AS state_population, region, division, SUM(popestimate2015) OVER(PARTITION BY division) AS divisional_population FROM population WHERE state � Now we're looking at state-level data, broken out by region and division, with a population summary at the division level: state_name | state_population | region | division | divisional_population ------------------------------------------------------------------------------------------- Alabama | 4858979 | South | East South Central | 18876703 Alaska | 738432 | West | Pacific | 52514181 Arizona | 6828065 | West | Mountain | 23530498 Arkansas | 2978204 | South | West South Central | 39029380 California | 39144818 | West | Pacific | 52514181 Colorado | 5456574 | West | Mountain | 23530498 Connecticut | 3590886 | Northeast | New England | 14727584 Delaware | 945934 | South | South Atlantic | 63276764 District of Columbia | 672228 | South | South Atlantic | 63276764 Florida | 20271272 | South | South Atlantic | 63276764 . . . . ORDER BY As you've probably noticed in the previous queries, we're using ORDER BY in the usual way to order the results by the state name, but we can also use ORDER BY in our OVER() clause to impact the window function calculation. For example, we'd want to use ORDER BY as a condition for the RANK() window function since ranking requires an order to be established. Let's rank the states according to highest population: SELECT name AS state_name, popestimate2015 AS state_population, RANK() OVER(ORDER BY popestimate2015 desc) AS state_rank FROM population WHERE state � In this case, we've added ORDER BY popestimate2015 desc as a condition of our OVER() clause in order to describe how the ranking should be performed. Because we still have our ORDER BY name clause for our result set, though, our results will continue to be in state name order, but we'll see the populations ranked accordingly with California as the number 1 ranked based on its population: state_name | state_population | state_rank ----------------------------------------------------- Alabama | 4858979 | 24 Alaska | 738432 | 49 Arizona | 6828065 | 14 Arkansas | 2978204 | 34 California | 39144818 | 1 Colorado | 5456574 | 22 Connecticut | 3590886 | 29 Delaware | 945934 | 46 District of Columbia | 672228 | 50 Florida | 20271272 | 3 . . . . Let's combine our PARTITION BY and our ORDER BY window function clauses now to see the ranking of the states by population within each region. For this, we'll change our result-level ORDER BY name clause at the end to order by region instead so that it'll be clear how our window function works: SELECT name AS state_name, popestimate2015 AS state_population, region, RANK() OVER(PARTITION BY region ORDER BY popestimate2015 desc) AS regional_state_rank FROM population WHERE state � Our results: state_name | state_population | region | regional_state_rank ---------------------------------------------------------------------- Illinois | 12859995 | Midwest | 1 Ohio | 11613423 | Midwest | 2 Michigan | 9922576 | Midwest | 3 Indiana | 6619680 | Midwest | 4 Missouri | 6083672 | Midwest | 5 Wisconsin | 5771337 | Midwest | 6 Minnesota | 5489594 | Midwest | 7 Iowa | 3123899 | Midwest | 8 Kansas | 2911641 | Midwest | 9 Nebraska | 1896190 | Midwest | 10 South Dakota | 858469 | Midwest | 11 North Dakota | 756927 | Midwest | 12 New York | 19795791 | Northeast | 1 Pennsylvania | 12802503 | Northeast | 2 New Jersey | 8958013 | Northeast | 3 . . . . Here we can see that Illinois is the number 1 ranking state by population in the Midwest region and New York is number 1 in the Northeast region. So, we combined some conditions here, but what if we need more than one window function? Read on... NAMED WINDOW FUNCTIONS In queries where you are using the same window function logic for more than one returned field or where you need to use more than one window function definition, you can name them to make your query more readable. Here's an example where we've defined two windows functions. One, named ""rw"", partitions by region and the other, named ""dw"", partitions by division. We're using each one twice - once to calculate the population sum and again to calculate the population average. Our windows functions are defined and named using the WINDOW clause which comes after the WHERE clause in our query: SELECT name AS state_name, popestimate2015 AS state_population, region, SUM(popestimate2015) OVER rw AS regional_population, AVG(popestimate2015) OVER rw AS avg_regional_state_population, division, SUM(popestimate2015) OVER dw AS divisional_population, AVG(popestimate2015) OVER dw AS avg_divisional_state_population FROM population WHERE state � Since we didn't do any manipulation on the averages values yet, the numbers look a little crazy, but that can be easily cleaned up using ROUND() and CAST() if need be. Our purpose here is to demonstrate how to use multiple window functions and the results you'll get. Check it out: state_name | state_population | region | regional_population | avg_regional_state_population | division | divisional_population | avg_divisional_state_population ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Alabama | 4858979 | South | 121182847 | 7128402.764705882353 | East South Central | 18876703 | 4719175.750000000000 Alaska | 738432 | West | 76044679 | 5849590.692307692308 | Pacific | 52514181 | 10502836.200000000000 Arizona | 6828065 | West | 76044679 | 5849590.692307692308 | Mountain | 23530498 | 2941312.250000000000 Arkansas | 2978204 | South | 121182847 | 7128402.764705882353 | West South Central | 39029380 | 9757345.000000000000 California | 39144818 | West | 76044679 | 5849590.692307692308 | Pacific | 52514181 | 10502836.200000000000 Colorado | 5456574 | West | 76044679 | 5849590.692307692308 | Mountain | 23530498 | 2941312.250000000000 Connecticut | 3590886 | Northeast | 56283891 | 6253765.666666666667 | New England | 14727584 | 2454597.333333333333 Delaware | 945934 | South | 121182847 | 7128402.764705882353 | South Atlantic | 63276764 | 7030751.555555555556 District of Columbia | 672228 | South | 121182847 | 7128402.764705882353 | South Atlantic | 63276764 | 7030751.555555555556 Florida | 20271272 | South | 121182847 | 7128402.764705882353 | South Atlantic | 63276764 | 7030751.555555555556 . . . . Now that's an informative report of population metrics... and window functions made it easy! WRAPPING UP This article has given you a glimpse of the power of PostgreSQL window functions. We touched on the benefits of using window functions, looked at how they are different (and similar) to standard aggregation functions, and learned how to use them with various conditional clauses, walking through examples along the way. Now that you can see how window functions work, start trying them out by replacing standard aggregations with window functions in your queries. Once you get the hang of them you'll be hooked. In our next article we'll look at window framing options in PostgreSQL to give you even more control over how your window functions behave. Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","If you use PostgreSQL, you're probably already familiar with many of the common aggregate functions, such as COUNT(), SUM(), MIN(), MAX(), and AVG(). But you may not be familiar with window functions since they're touted as an advanced feature. Window functions aren't nearly as esoteric as they may seem, however. Let's dive in!",Metrics Maven: Window Functions in PostgreSQL,Live,61 167,"Jump to navigation * Twitter * LinkedIn * Facebook * About * Contact * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chats * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Subscribe×BLOGSDATA VISUALIZATION PLAYBOOK: THE IMPORTANCE OF EXCLUDING UNNECESSARY DETAILSPost Comment December 2, 2015 by Jennifer Shin Topics: Big Data Technology Tags: big data , data analytics , data science , data scientist , data visualization , visualizationsAs the big data revolution gathers momentum, data scientists are working withlarger data sets than ever before—a trend that shows no sign of abating. Butwith ever larger data sets comes the temptation to include ever moreinformation, representing the data in all its glorious detail. Who, after all,can resist the temptation to flex some intellectual muscles by mastering trulycomplex data?But as tempting as visualizing every last detail can be, doing so can erectbarriers to understanding. A visualization that includes unnecessary informationcan overwhelm readers, obscuring the message and leaving its audience confused.Let’s explore a real-world scenario, stepping through the thought process thatgoes into designing an effective data visualization.BREAK DOWN THE DATAA foundation set up to fund environmental projects published an overview of thefunding it provided during 2013. In its original form, shown in Figure 1, thereport included a pie chart showing the distribution of grants across a range ofenvironmental issues.Figure 1: The share of funding distributed to 17 environmental issues during2013.EVALUATE YOUR VISUALIZATION’S USABILITYA first glance at the pie chart reveals nothing wildly amiss. The chartrepresents the data simply and directly, breaking down the distribution offunding in its legend. But a c loser look reveals certain flaws that can impede understanding: * Excluded information The chart supplies an exact figure for only 7 of the 17 issues—specifically, only for issues that received at least 7 percent of the overall funding. * Cumbersome design The many slices in the pie chart distract from the larger facts, requiring readers to match the color of each slice with a color in the legend to identify the issue described. * Confusing presentation The choice of colors does little to differentiate issue areas—for example, a reader could easily mistake “Air Quality” for “Rivers and Lakes” or fail to differentiate “Populations” from “Wildlife Biodiversity.”Figure 2: The level of funding distributed to each environmental issue during2013, both in dollars and as a percentage of total funding.FIND THE FOREST IN THE TREESThe organization intended the visualization to provide an overview of issueareas funded during 2013. T o boost the overview’s effectiveness, the designer grouped environmental issuesinto five categories, as depicted in Figure 2: “Environmental Policy,” “Climateand Energy,” “Natural Resources,” “Preservation and Biodiversity” and“Sustainable Development.” The designer then redesigned the pie chart around thenew categories, grouping slices as shown in Figure 3.Figure 3: The share of funding distributed to each issue during 2013, withindividual issues delineated but grouped into colored categories.PROVIDE A QUANTITATIVE OVERVIEWBut the visualization still contained unnecessary information. The designerstreamlined the legend as shown in Figure 4a, emphasizing the categories anddispensing with a complete list of issues. To further emphasize the categories,the designer removed the lines demarcating individual issues and supplied thepercentage of funding distributed to each category, as shown in Figure 4b. By categorizing and unifying individual issues, the new visualization providedan effective overview featuring quantitative information.Figure 4a: The share of funding distributed to each category during 2013 , with individual issues delineated but grouped into colored categories .Figure 4b: The percentage of funding distributed to each category during 2013.CREATE NEW LEVELS OF INSIGHTAfter streamlining the pie chart, the designer introduced a new level ofanalysis, segmenting the data by global region and creating pie charts to showthe worldwide distribution of grants. To obviate the need for another legend,the designer superimposed the pie charts on a world map, as shown in Figure 5.Figure 5: The regional share of funding distributed to each category within eachglobal region during 2013.DESIGN FOR YOUR AUDIENCEBefore you create a data visualization, tailor your message to your audience.Don’t overwhelm your audience with data, but also take care not to render thedata useless through oversimplification. You’ll want to create one kind ofvisualization when presenting to experts in the field, for example, but anotherwhen giving a high-level overview to a general audience. To learn more, d iscover how the IBM advanced analytics portfolio can help you find patterns in and derive insights from your data through visualexploration.Follow @IBMBigDataRELATED CONTENTPODCASTHOW IS OPEN SOURCE TRANSFORMING STREAMING ANALYTICS?Open source is a disruptor that never quits. It seems to be penetrating andtransforming every aspect of established data, analytics and applicationecosystems. In this podcast, recorded at IBM InterConnect 2016, listen to DavidTaieb, a cloud data services developer advocate at IBM, share his... Listen to Podcast Podcast Becoming a cognitive business Podcast InsightOut: Leveraging metadata and governance Blog What is Spark? Blog Internet of Things data access and the fear of the unknown Blog Spark: The operating system for big data analytics Blog Graph databases catch electronic con artists in the act Blog InsightOut: Metadata and governance Blog New IBM DB2 release simplifies deployment and key management Podcast How is open source transforming graph analytics? Blog What is Hadoop? Blog The rise of NoSQL databases Blog Bridging Spark analytics to cloud data servicesView the discussion thread.IBM * Site Map * Privacy * Terms of Use * 2014 IBMFOLLOW IBM BIG DATA & ANALYTICS * Facebook * YouTube * Twitter * @IBMbigdata * LinkedIn * Google+ * SlideShare * Twitter * @IBManalytics * Explore By Topic * Use Cases * Industries * Analytics * Technology * For Developers * Big Data & Analytics Heroes * Explore By Content Type * Blogs * Videos * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Events * Around the Web * About The Big Data & Analytics Hub * Contact Us * RSS Feeds * Additional Big Data Resources * AnalyticsZone * Big Data University * Channel Big Data * developerWorks Big Data Community * IBM big data for the enterprise * IBM Data Magazine * Smarter Questions Blog * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics HeroesMore * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics HeroesSearchEXPLORE BY TOPIC:Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog For strategic planning, business must go beyond spreadsheets Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analyticsMOREBlog For strategic planning, business must go beyond spreadsheets Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analytics Interactive All data all the time: How mobile technology informs travelerhabits Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Blog The secret to enhancing customer engagement Blog For strategic planning, business must go beyond spreadsheets Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analytics Podcast Finance in Focus: Innovative business ideas with Lisa BodellMOREBlog For strategic planning, business must go beyond spreadsheets Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analytics Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of care Podcast InsightOut: Leveraging metadata and governance Blog 6 simple ways to help fight crime with analytics Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive All data all the time: How mobile technology informs travelerhabitsMOREBlog 6 simple ways to help fight crime with analytics Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive All data all the time: How mobile technology informs travelerhabits Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Blog How to protect our PII and sensitive information from fraud Blog Big data in healthcare: The secret to calculating total cost of care Interactive Cognitive business starts with analytics Blog The secret to enhancing customer engagement Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of careMOREInteractive Cognitive business starts with analytics Blog The secret to enhancing customer engagement Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of care Podcast Becoming a cognitive business Podcast InsightOut: Leveraging metadata and governance Blog The LED lighting revolution * Home * Explore By Topic * Use Cases * All * Acquire, Grow & Retain Customers * Create New Business Models * Improve IT Economics * Manage Risk * Optimize Operations & Reduce Fraud * Transform Financial Processes * Industries * All * Banking * Consumer Products * Education * Energy & Utilities * Government * Healthcare & Life Sciences * Industrial * Insurance * Media & Entertainment * Retail * Telecommunications * Analytics * All * Content Analytics * Customer Analytics * Entity Analytics * Social Media Analytics * Technology * All * Business Intelligence * Cloud Database * Data Governance * Data Warehouse * Database Management Systems * Data Science * Hadoop & Spark * Internet of Things * Predictive Analytics * Streaming Analytics * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chat * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Big Data & Analytics Heroes * For Developers * Events * Upcoming Events * Webcasts * Twitter Chat * Meetups * Around The Web * About Us * Contact Us * Search Site",Find out how including too much information can neutralize your data visualization—and your message with it.,Data visualization: The importance of excluding unnecessary details,Live,62 168,"PostgreSQL is a powerhouse of SQL driven database power and Compose's PostgreSQL is all that with the power of Compose's cloud deployments. But before you can harness that power you need to create users to access your database. In this article, we're going to show you the quick way to do that and then introduce you to one of PostgreSQL's powerful tools. Let's begin, right after you've created your first PostgreSQL database.When you create a PostgreSQL deployment, there's only one role that has been created for the database and that's the admin role. ""Wait a minute"" you may be thinking ""I want a user not a role"". PostgreSQL has pulled in all the concepts of users, groups and permissions and turned them into one concept, roles. Roles can represent one or many users, one role can grant another role membership to grant privileges, roles can own tables or other database objects. Roles are PostgreSQL's swiss army knife of access control.Now the admin role that is created is not the database superuser. Thats a more restricted account that isn't remotely accessible. The admin role is, basically, the first user Compose creates on the PostgreSQL database and it has permission to create new databases and create new roles. This user role is also there so essential maintenance can be performed by Compose's own automated processes. You can use the admin role as your sole login to the database, or you can create your own roles. Let's first look at the admin role.When you look at your overview page for the database you'll see that information in the Connection info panel under Credentials. Or rather you won't because by default it is obscured.Below the credentials are the connection string you use to connect applications to your database and the command line you can use, if you have psql installed, to create an interactive command line into the database. Notice all the sensitive information is hidden in these too by marking where you would substitute a username or password.When you click the Show link – you'll be prompted for your Compose account password before anything is shown – the credentials, connection string and command line will all be populated with the admin role's credentials. You can use these as is in your applications if you wish. Make a mental note of the admin password then click the Hide link which appeared where the Show link to obscure that information.You may, though, want to create more roles to be used for particular purposes. It's not necessary to create a role-per-user or role-per-application. Experience tells us that the more roles you create, the more complex your access control will be and the harder to manage. With that in mind we suggest that you create one or two roles at most with appropriate capabilities and use them. The quickest way to create one of those roles is to use the data browser. click on the Browser button in the sidebar.We've talked about the data browser before, but since that article was published, its gained the ability to let you add and remove roles. The first thing the browser shows you are the databases currently configured and by default on Compose, your first database is called compose. You can create more databases here by clicking on Create Database in the top right, but for now we're going to work with the default compose database so click on that in the database list.And now we're into the table level view of the browser, or rather not, because this database doesn't have any tables yet. We'll get back to that. Right now we want to create a role, and we can do this by clicking on the Roles option in the sidebar.But, yes, we just want a user and if you look, you can see we are viewing Users, roles which are configured as database users. And there in the list is our admin user. You can see the power that user has because listed there are the three roles it is a member of - login so admin can log in, createdb so it can create databases and createrole so it can create new users and roles. We can make a user of our own here by clicking on Add User in the top right.This is where we can add user roles; we just fill in the blanks in the command line, so if we're making a user called fred and giving them a password 'drowssap' we put those in the appropriate fields.If we press Add role now though, the created role would just be an unprivileged user who could not login. Click on the Login button to add that privilege to the user. If you want the user to be able the create databases and roles, click the appropriate buttons too but remember in the world of privileges, less privilege usually means more security. When you are done, press Add role and the new user will be created. Apart from dropping the user, thats all you can do with the database browser but its enough to create new users. The next stage will be to connect to the database using the new user we created.PostgreSQL's command line is called psql and you probably don't have it as it is usually bundled with the PostgreSQL database system. This is a common occurrence with database software and it means you'll have to download and install the database software locally to get the official tools – the important part is you shouldn't run the database itself. You can find where to download binaries of PostgreSQL on the project's website. For discussion purposes, we'll set up Mac OS X. If you look on the Mac OS X packages page you'll find a number of options. There's an EnterpriseDB graphical installer and the Postgres.app GUI installer; we'll skip them as they are very much about getting the database itself running. There's also packages in the Fink, MacPorts and Homebrew package managers. At Compose, we're big fans of Homebrew because it works so well. You'll need to install Homebrew first, follow the instructions on the home page. Once that's installed, just run brew install postgresql...Most of the text there relates to how to configure the database server to start up so we can ignore that. At the end of the process, what's important is the psql binary is installed in /usr/local/bin. Now you can get to connecting. Head back to the Overview for your Compose PostgreSQL database and look for the Command Line connection string. Use that entire line, substituting in your new username. You'll be prompted for your password and then connected:Congratulations, you've just plugged into one of the most powerful command line tools for any database. At its most basic, you can type in Postgres SQL commands and see them executed. Let's create a table:The command like doesn't consider a command complete until it is terminated with a correctly placed semi-colon. The thing to keep an eye on it the prompt, specifically the character after the database name and before the >. When it's = it is the start of a new command. When it's - it means that this is a continuation of the previous line and when it's ( it means that what's been typed so far has opened a parentheses, but not yet closed it so a semi-colon entered before closing the parentheses would be an error.Here we are creating the table rockets and we open the parentheses on the first line then hit return. Then we enter the columns, with commas as separators – we could type these all on one line but this is easier to type and read back. Finally we close the parentheses and end the statement with a semicolon. Psql then echos back the type of command that has been run (or displays and error) and returns to the = prompt. We can then insert some data into our new table:The last number after the INSERT reflect the number of rows inserted (usually). Of course, we can select to get that information back. Here we'll break up the command over a couple of lines, because we can:If you want the full list of commands available, enter \h to list all the SQL commands and follow the \h with the name of a command to get further details on it. It's useful help, but remember the PostgreSQL documentation is also a useful companion available online or offline as A4 or US sized PDFs.But it's not only SQL commands that can be entered into psql. It has its own rich command set too. To list those commands type \? - all the psql commands are preceded with a backslash. One of the essential commands is \d which will tell you about the objects and their relations in your database. Without any parameters it'll tell you about tables like so:Give it a parameter like the name of a table and it will give you information about the columns and indexes of that table:There's a huge number of psql commands you may want to put to use. The \e command will call up vi or the editor set in the environment variable ""PSQL_EDITOR"" and let you edit the last SQL command. \i will read and execute commands from a file. \s will display your command history and yes you can cursor up and down through that history. \w lets you write the query buffer (where the last command is saved) to disk. One favorite is \watch which will repeat the last query every two seconds – follow it with a number of seconds to adjust that. We could spend an entire article looking at the applications and uses of the psql command set. Suffice to say for now it is extensive and very useful.We've shown you how to create users quickly on Compose PostgreSQL and how to use those new users with PostgreSQL's powerful command line. If you think that's powerful, just wait till you see what you can do with the database itself!",PostgreSQL is a powerhouse of SQL-driven database power and Compose's PostgreSQL is all that with the power of Compose's cloud deployments. But before you can harness that power you need to create users to access your database.,Compose PostgreSQL: Making users and more,Live,63 169,"* About * Services * Portfolio * Teaching * Blog * Contact PREDICTING GENTRIFICATION USING LONGITUDINAL CENSUS DATA By Ken SteifAuthors: Ken Steif, Alan Mallach, Michael Fichman, Simon Kassel Figure 1: A mockup of a web-based, community-oriented gentrification forecasting applicationRecently, the Urban Institute called for the creation of “neighborhood-level early warning and response systems that can help city leaders and community advocates get ahead of (neighborhood) changes.” Open data and open-source analytics allows community stakeholders to mine data for actionable intelligence like never before. The objective of this research is to take a first step in exploring the feasibility of forecasting neighborhood change using longitudinal census data in 29 Legacy Cities (Figure 2). The first section provides some motivation for the analysis. Section 2 discusses the feature engineering and machine learning process. Section 3 provides results and the final section concludes with a discussion of community-oriented neighborhood change forecasting systems. Figure 2: Legacy cities used in this analysisWhy forecast gentrification? Neighborhoods change because people and capital are mobile and when new neighborhood demand emerges, incumbent residents rightfully worry about displacement. Acknowledging these economic and social realities, policy makers have a responsibility balance economic development and equity. To that end, analytics can help us understand how the wave of reinvestment moves across space and time and how to pinpoint neighborhoods where active interventions are needed today in order to avoid negative outcomes in the future. While the open data movement and open source software like Carto and R lower costs associated with community analytics, time series parcel-level data is expensive to collect, store and analyze. Census data is ubiquitous however, and many non-profits are well-versed in technologies like the Census’ American FactFinder and The Reinvestment Fund’s PolicyMap . Thus, it seems reasonable to develop forecasts using these data before building comparable models using the more expensive, high resolution space/time home sale data. The goal here is to use 1990 and 2000 Census data on home prices to predict home prices in 2010. If those models prove robust, we can use the model to forecast for 2020. Endogenous gentrification The key to our forecasting methodology is the conversion of Census tract data into useful ‘features’ or variables that help predict price. Our empirical approach is inspired by the theory of ‘endogenous gentrification’ – a theory of neighborhood change which suggests that low-priced neighborhoods adjacent to wealthy ones have the highest probability of gentrifying in the face of new housing demand. Typically, urban residents trade off proximity to amenities with their willingness to pay for housing. Because areas in close proximity to the highest quality amenities are the least affordable, the theory suggests that gentrifiers will choose to live in an adjacent neighborhood within a reasonable distance of an amenity center but with lower housing costs. As more residents move to the adjacent neighborhood, new amenities are generated and prices increase which means that at some point, the newest residents are going to settle in the next adjacent neighborhood and so on. This space/time process resembles a wave of investment moving across the landscape. Our forecasting approach attempts to capture this wave by developing a series of spatially endogenous home price features. The models attempt to trade off these micro-economic patterns with macro-economic trends that face many of the Legacy Cities in our sample. Principal among these is the Great Recession of the late 2000s. Of equal importance is the fact that gentrification affects only a small fraction of neighborhoods. As our previous research has demonstrated, neighborhood decline is still the predominant force in U.S. Legacy cities . Featuring Engineering Our dataset consists of 3,991 Census tracts in 29 Legacy Cities from 1990, 2000 and 2010. The data originates from the Neighborhood Change Database (NCDB) which standardizes previous Decennial Census surveys into 2010 geographical boundaries allowing for repeated measurements for comparable neighborhoods over time. While standardizing tract geographies over time is certainly convenient, it does not account for the ecological fallacy nor deal with the fact that tracts rarely comprise actual real estate submarkets. Figure 3 plots the distribution of Median Owner-Occupied Housing Value for 1990, 2000 & 2010 for the 29 cities in our sample. There is no clear global price trend in our sample. Some cities see price increases, some see decreases and others don’t change at all. Figure 3: Median Owner-Occupied Housing Value by cityBetween census variables and those of our own creation, our dataset consists of nearly 200 features or variables that we use to predict price. We develop standard census demographic features as well endogenous features that explain price as a function of nearby prices and other economic indicators like income. There are three main statistical approaches we take to develop these features. The simplest of our endogenous price features is the ‘spatial lag’, which for any given census tract is the simply the average price of tracts that surround it (Figure 4). Figure 5 shows the correlation between the spatial lag and price for the cities in our sample. Figure 4: The spatial lag Figure 5: Price as a function of spatial lagOur second endogenous price feature is one which measures proximity to high-cost areas. Here we create an indicator for the highest priced and highest income tracts for each city in each time period and calculate the average distance in feet from each tract to its n nearest 5th quintile neighbors in the previous time period. The motivation here is to capture emerging demand in the ‘next adjacent’ neighborhood over time (Figure 6). Figure 7 shows the correlation with price. Figure 6: Distance to highest value tract in the previous time period Figure 7: Price as a function of its distance to highest value tract in the previous time periodOur third endogenous price predictor attempts to capture the local spatial pattern of prices for a tract and its adjacent neighbors. As previously mentioned, to be robust, our algorithm must trade-off global trends with local neighborhood conditions. There are three local spatial patterns of home prices that we are interested in: clustering of high prices; clustering of low prices; and spatial randomness of prices. Local clustering of high and low prices suggests that housing market agents agree on equilibrium prices in a given neighborhood. Local randomness we argue, is indicative of a changing neighborhood – one that is out of equilibrium. A similar approach was used for a previous project, predicting vacant land prices in Philadelphia. In a changing neighborhood, buyers and sellers are unable to predict the value of future amenities. Our theory argues that when this uncertainty is capitalized into prices, the result is a heterogeneous pattern of prices across space. Capturing this spatial trend is crucial for forecasting neighborhood change. To do so we develop a continuous variant of the one-sided Local Moran’s I statistic. Assume that the dots in Figure 8 below represent home sale prices for houses or tracts. The homogeneous prices in the left panel are indicative of an area in equilibrium where all housing market agents agree on future expectations. Our Local Moran’s I feature of this area would indicate relative clustering. Figure 8: Equilibrium and disequilibrium marketsConversely, the panel on the right with more heterogeneous prices, is more indicative of a neighborhood in flux – one where housing market agents are capitalizing an uncertain future into prices. In this case, the Local Moran’s I feature would indicate a spatial pattern closer to randomness. We find this correlation in many of the cities in our sample as illustrated in Figure 9. Figure 9: Price as a function of the Local Moran’s I p-valueResults A great deal of time was spent on feature engineering and feature selection. We employ four primary machine learning algorithms, Ordinary Least Squares (OLS), Gradient Boosting Machines (GBM), Random Forests, and an ensembling approach that combines all three. You can find more information on these models in our paper which is linked below. Our models are deeply dependent on cross-validation , ensuring that goodness of fit is based on data that the model has not seen. Although we estimate hundreds of models, Table 1 presents (out of sample) goodness of fit metrics for our four best – each an example of one of the four predictive algorithms. The “MAPE” or mean absolute percentage error, is the absolute value of the average error (the difference between observed and predicted prices by tract) represented as a percentage which allows for a more consistent way to describe model error across cities. Table 1: Goodness of fit metrics for four modelsThe Standard Deviation of R-Squared measures over-prediction. Using cross-validation, each time the model is estimated with another set of randomly drawn observations, we can record goodness of fit. If the model is truly generalizable to the variation in our Legacy City sample, then we should expect consistent goodness of fit across each permutation. If the model is inconsistent across each permutation, it may be that the goodness of fit is driven solely by individual observations drawn at random. This latter outcome might indicate overfitting. Thus, this metric collects R^2 statistics for each random permutation and then uses standard deviation to assess whether the variation in goodness of fit across each permutation is small (ie. generalizability) or a large (ie. overfitting). Figure 10 shows the predicted prices as a function of observed prices for all tracts in the sample. If predictions were perfect, we would expect the below scatterplots to look like straight lines. The obvious deviation from would-be straight lines is much greater for the OLS and GBM models then for the random forest and stacked ensemble models. These models loose predictive power for higher priced tracts. Figure 10: Predicted prices as a function of observed prices for all tractsFigures 11-14 display observed vs. predicted prices by city for each of the predictive algorithms. Again, OLS and Random Forests loose predictive power for high priced tracts. However, when predictions are displayed in this way, the GBM and Ensemble predictions appear quite robust for most of the cities. Figure 11: OLS predicted prices as a function of observed prices for tracts by city Figure 12: GBM predicted prices as a function of observed prices for tracts by city Figure 13: Random Forest predicted prices as a function of observed prices by city Figure 14: Ensemble predicted prices as a function of observed prices by cityFinally, Figures 15 and 16 display the MAPE (error on a percentage basis) by City in bar chart and map form respectively. The highest error rates that we observe at the city-level is around 13.5% and the smallest is around 4%. One important trend to note is that we achieve ~8% errors for many of the larger, post-industrial cities. In addition, it does not seem as though there is an observable city-by-city pattern in error. That is, the model is not biased toward smaller cities or larger ones or those with booming economies. This is evidence that our final model is generalizable to a variety of urban contexts. Figure 15: MAPE by City Figure 16: MAPE by City in map formFinally, Figure 17 illustrates for Chicago, the 2010 predictions generated for each of the four algorithms along with the observed 2010 median owner-occupied home prices. Despite the stacked ensemble predictions having the lowest amount of error, it still appears to underfit for the highest valued tracts (Panel 4, Figure 17). This occurs for three reasons. First, the census data is artificially capped at $1 million dollars which creates artificial outlying “spikes” in the data, that, despite our best efforts, we were unable to model in the feature engineering process. Second, as previously mentioned, the predominant pattern over time is decline not gentrification. Thus, it is difficult for the model, at least in Chicago, to separate a very local phenomenon like gentrification, from a more global phenomenon like decline. Because all cities are modeled simultaneously, these predictions are also weighted not only by the Chicago trend, but by the trend throughout the sample. Finally, and this is probably the most important issue, our time serious has just two preceding time periods to use as predictors while neighborhood trajectories are clearly more fluid. Figure 17: Predicted prices for four algorithms and observed 2010 prices, ChicagoThe implications of this under-prediction in cities like Chicago is that our forecasts in these cities will also under-predict. While we choose to illustrate Chicago, many cities do not in fact under-predict. This is evident in their 2020 forecasts as seen in Figure 18 below. The next step is to rerun our models using 2000 and 2010 to forecast for 2020. Figure 18 shows the results of these forecasts in barplot form by city, alongside observed 1990, 2000 and 2010 prices. Many cities including Baltimore, Chicago, Cleveland, Detroit, Minneapolis, Newark, Philadelphia, show a marginal increase in price forecasts for 2020. Others, such as Baltimore, Boston, Jersey City and Washington D.C. do not, despite the fact that anecdotally, we might expect them to. Figure 19 shows tract-level predictions for three cities. With this under-prediction notwithstanding, we still are quite pleased with how much predictive power we could mine from these data. As we discuss below, there is a real upside to replicating this model on sales-level data. Figure 18: Time series trend with predictions by City Figure 19: Tract-level forecasts in 3 citiesNext steps It appears that endogenous spatial features combined with modern machine learning algorithms can help predict home prices in American Legacy Cities using longitudinal census data – with caveats as mentioned above. As previous work has shown however, this approach is really powerful when using parcel-level time series sales data. This insight motivates what we think are some important next steps with respect to the development of neighborhood change early warning systems. First, check out this phenomenal paper recently published in HUD’s Cityscape journal entitled “Forewarned: The Use of Neighborhood Early Warning Systems for Gentrification & Displacement” by Karen Chapple and Mariam Zuk. The authors raise two important points. The first, is that existing early warning systems are not doing a great job on the predictive analytics side. We think that many of these deficiencies could be addressed as more Planners become versed in machine learning techniques including how to build useful features like the endogeneous gentrification variables described above. The second critical point that Chapple and Zuk raise is that “Little is understood, however, about precisely how stakeholders are using the systems and what impact those systems have on policy.” A UX/UI engineer might restate this question by asking, “What are the use cases? Why would someone use such a system?” The point is that no matter how well the model performs, if insights cannot be converted in to equity and real policy, then predictive accuracy is meaningless. Here are some suggestions about how the next generation of forecasting tools could look: Figure 20: An example of a gentrification early warning system using event-based forecastsInstead of modeling data for 29 cities at once, consider a model built for one city, using consecutive years of parcel-level data. Second, alongside a continuous outcome like price, consider modeling a series of development-related events like new construction permits, rehab permits and evictions. Event-based forecasting could for instance, predict the probability of (re)development for each parcel citywide. These probabilities could help the government and non-profit sectors better allocate their limited resources. This helps us get a better sense of the most appropriate use cases, like, “Where should build our next affordable housing development?” A tool, like the one shown in Figure 20 would allow equity-driven organizations to strategically plan future development and redevelopment opportunities as well as better manage the existing stock of affordable housing in the neighborhood. It would also help state finance agencies better target tax credits, and aid planning and zoning boards to better understand the effect that zoning variances might have on future development patterns. If one were to combine these predictive price and event-based algorithms into one information system, fueled predominately by preexisting city-level open data, the potential value-added for community organizations, government and grant-making institutions would be immense. Conclusion This report experimented with using longitudinal census to predict home prices in 29 American Legacy cities. Our motivation was that if we could develop a robust model, the results could help community stakeholders better allocate their limited resources. Our training models use 1990 and 2000 data to predict for 2010 yielding an average prediction error of just 14% across all tracts. When this error is considered on a city-by-city basis, the median error is around 8%. It is important to note that our endogenous feature approach does not overfit the model. We think these results are admirable given the limitations of our time series; that our unit of analysis, census tracts, rarely if ever conform to true real estate submarkets and that neighborhood decline is still the predominate dynamics in these Legacy Cities. The greatest weakness of our model is that the limited time series is the likely driver for under-prediction in some cities, which affects our 2020 forecasts. Our major methodological contribution is the adoption of endogenous gentrification theory in the development of spatial features that are effective for predicting prices in a machine learning context without overfitting. We believe that this approach can and should be extended to parcel data, using both continuous outcomes like prices, and event-based outcomes like development. These algorithms and technological innovations such as the information system described above, can play a pivotal role in how community stakeholders allocate their limited resources across space. Ken Steif, PhD is the founder of Urban Spatial. He is also the director of the Master of Urban Spatial Analytics program at the University of Pennsylvania. You can follow him on Twitter @KenSteif . The full report can be downloaded here . This work was generously supported by Alan Mallach and the Center for Community Progress . This is the second of two neighborhood change research reports – here is the first . Urban Spatial 508 S. Melville St. Philadelphia, PA 19143",Open data and open-source analytics allows community stakeholders to mine data for actionable intelligence like never before. The objective of this research is to take a first step in exploring the feasibility of forecasting neighborhood change using longitudinal census data in 29 Legacy Cities.,Predicting gentrification using longitudinal census data,Live,64 170,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Home * Cognitive Computing * Data Science * Web Dev * Mike Broberg Blocked Unblock Follow Following Editor for the IBM Watson Data Platform developer advocacy team. OK person. Mar 17 -------------------------------------------------------------------------------- INTERCONNECT WITH US PRACTICAL INFO FOR THE BIG IBM CONFERENCE MARCH 19–23 If you’re an IBM customer or business partner, you’ve probably heard of the company’s InterConnect conference . If you’re unfamiliar, IBM InterConnect is a huge conference at the Mandalay Bay in Las Vegas. It features the “what’s next” of tech innovation for cloud services, Internet of Things, and IBM Watson. This event might sound overwhelming because IBM is such a big company. As developer advocates, however, we work to make things simple. Our team will be attending with a focus on delivering talks and presenting example code that’s open source, useful, and approachable — whether you do business with IBM or not. MACHINE LEARNING WITH APACHE SPARK™ Our robot pal Marvin took an early flight to Vegas—via UPS. You can find him in the InterConnect DevZone, where he plans to beat you at a high-stakes game of Rock, Paper, Scissors. With a little help from Apache Spark , Marvin uses machine learning algorithms to find patterns in human gameplay, then exploits them as he chooses his moves. “Oh, human. Surely you must be cheating somehow.” —MarvinMarvin dishes out the sass. I expect he’ll be playing this card a lot as he kills time in Vegas waiting for human opponents to arrive: “Nothing personal, human. But my brain is connected to Apache Spark.” —MarvinOFFLINE-FIRST DEMO APP Voice of InterConnect is a web app that uses Hoodie for its backend, where it combines with several IBM services to measure attendee sentiment about the conference (and possibly about dinosaurs). Hoodie is a complete backend for your apps, exposed as a JavaScript API and accessible from the browser. For the Voice of InterConnect app, it’s also hooked up to the following IBM services: Cloudant, Watson Speech to Text, and Watson Natural Language Understanding. Architecture for the Voice of InterConnect sentiment app. Code on GitHub .The app uses Offline First design principles by storing recordings locally, in the web browser, and then using Hoodie’s Apache CouchDB-style data replication to synchronize changes to the backend services, where the analysis happens. IBM’s developerWorks TV and The New Builders Podcast did a recent interview on Voice of InterConnect, with the partners building the app: Steve Trevathan of Make&Model and Gregor Martynus of Neighbourhoodie . Here’s the video: MEETUP Speaking of The New Builders folks, they’re hosting a meetup this Sunday at Rí Rá Irish Pub at Mandalay Place , 7:00 p.m. — 10:30 p.m. This event is specifically for developers and data science folks, and will offer lightning talks on chatbots, Offline First development, and the PixieDust helper library for interactive notebooks. The event is free, but you’ll need to register here: The New Builders: Ideas on Tap Event You're invited to join the developer community at RiRa Irish Pub to network, learn & share! Bring a friend!Join us for… www.eventbrite.comPRESENTATIONS Members of our team will be presenting talks and leading Ask Me Anything sessions (AMA) and drop-in labs at InterConnect. AMAs and labs run on a drop-in basis during the times below. Labs take about 20 minutes to complete, and we’ll be there to help. For AMAs, you’ll lead the conversation with your questions, or you can ask for demos. Here’s an overview: VISUALIZING BIG DATA WITH MAPS — AMA WITH RAJ SINGH Ask Raj how to use map-based visualizations to sanity-check your big-data analyses. Tuesday, 3 p.m. — 5 p.m., DevZone AMA # 3 USING NOTEBOOKS WITH PIXIEDUST FOR FASTER, EASIER DATA ANALYSIS — LAB WITH VA BARBOSA AND DAVID TAIEB Explore data sets with PixieDust, an awesome helper library for data science notebooks on Spark, with help from Va and David. Wednesday, 1:15 p.m. — 5 p.m., DevZone Hello World Lab # 4 MOBILE MAPPING WITH THE WATSON DATA PLATFORM — LAB WITH RAJ SINGH Learn to use location data and maps in your mobile apps, with help from Raj. Wednesday, 1:15 p.m. — 5 p.m., DevZone Hello World Lab # 1 CHATBOT ARCHITECTURE, DESIGN AND DEVELOPMENT — AMA WITH MARK WATSON Ask Mark about chatbot architecture. He’ll have some example apps to share too. Wednesday, 2:30 p.m — 5 p.m., DevZone Ask Me Anything # 1 FROM MOBILE FIRST TO OFFLINE FIRST — BREAKOUT SESSION WITH BRADLEY HOLT Bradley will show how you can build fast, responsive apps that will keep users happy, even without a reliable network connection. Thursday, 9:30 a.m. — 10:15 a.m., Islander H FIND US THERE Most of the Watson Data Platform dev advos will be in the DevZone at InterConnect. Where’s the DevZone? It’s in the back of the conference’s concourse (a.k.a. expo area). What does that look like? This!: Get in the zone—the IBM InterConnect DevZone. We don’t want to be starring in “Zone Alone,” after all. LOL, ok, ok.See you in Las Vegas! Thanks to Bradley Holt . * Ibm Watson * Cloudant * Offline First * Apache Spark Blocked Unblock Follow FollowingMIKE BROBERG Editor for the IBM Watson Data Platform developer advocacy team. OK person. FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform. * Share * * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","If you’re an IBM customer or business partner, you’ve probably heard of the company’s InterConnect conference. If you’re unfamiliar, IBM InterConnect is a huge conference at the Mandalay Bay in Las…",InterConnect with us,Live,65 171,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectINTRODUCING CLOUDANT FOODTRACKER: AN OFFLINE-FIRST APPBradley Holt / November 10, 2015I love helping people understand the “why” and the “how” of buildingoffline-first apps. An offline-first app is an app that works, without error,when it has no network connection. An offline-first app then applies progressive enhancement to enable additional features and functionality, such as syncing with a clouddatabase, when and if it has a reliable network connection. I’m happy tointroduce to you a new sample app called Cloudant FoodTracker which demonstrates building an offline-first app using Cloudant Sync for iOS ( we just released Cloudant Sync for iOS v1.0 ).Apple provides a great tutorial on starting to develop iOS apps . The tutorial walks readers through creating a simple meal tracking app calledFoodTracker. From the tutorial:“This app shows a list of meals, including a meal name, rating, and photo. Auser can add a new meal, and remove or edit an existing meal. To add a new mealor edit an existing one, users navigate to a different screen where they canspecify a name, rating, and photo for a particular meal.”In a strict sense of the term, Apple’s FoodTracker can be considered anoffline-first app. As Apple’s FoodTracker has no network capabilities, it mightbe better to call it an offline-only app. All of your meal data is stored locally on the device–and it never leavesthe device. Soon we will publish a tutorial that walks you through transforming Apple’sFoodTracker into a true offline-first app that stores its data locally using Cloudant Sync for iOS , and then synchronizes this data with IBM Cloudant. For those of you who want an early preview, we’ve published the Cloudant FoodTracker code on GitHub .SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: cloudant / Cloudant Sync / FoodTracker / iOS / Offline First Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",I'm happy to introduce to you a new sample app called Cloudant FoodTracker which demonstrates building an offline-first app using Cloudant Sync for iOS.,Introducing Cloudant FoodTracker: An Offline-First App,Live,66 172,"LORNAJANE BLOG 09 Jun 2016FIND MONGO DOCUMENT BY ID USING THE PHP LIBRARY My new job as a Developer Advocate with IBM means I get to play with databases for a living (this is the most awesome thing ever invented, seriously). On my travels, I spent some time with MongoDB which is a document database - but I ran into an issue with fetching a record by ID so here's the code I eventually arrived at, so I can refer to it later and if anyone else needs it hopefully they will find it too. MONGODB AND IDS When I inserted the data to the collection, I did not set the _id field; if this is empty, MongoDB will just generate an ID and use that which is fine by me. My issues arose when I wanted to then fetch that data using that generated identifier. If I inspect my data by using db.posts.find() (my collection is called posts), then the data looks like this: {""_id"":ObjectId(""575038831661d710f04111c1""),... So if I want to fetch by ID, I need to include that ObjectId function call around the ID. USING THE PHP LIBRARY When I came to do this with PHP, I couldn't find an example of using the new MongoDB PHP Library that used the ID in this way (but it's a good library, use it). Older versions of this library used a class called MongoID and I knew that wasn't what I wanted - but had I checked the docs for that, I'd have found that they have been updated to point to the new equivalent so this is also very useful to know if you can only find older code examples! To pass an ID to MongoDB using the PHP Library, you will need to construct a MongoDB\BSON\ObjectID . My example was blog posts and to fetch a record by its ID, I used: $post=$posts->findOne([""_id""=>newMongoDB\BSON\ObjectID($id)]); Later I updated the record - the blog post included nested comments in the record, so to add an array to the comments collection of a record whose _id I knew, I used this code: $result=$posts->updateOne([""_id""=>newMongoDB\BSON\ObjectID($id)],['$push'=>[""comments""=>$new_comment_data]]); Hopefully this gives you a pointer on using the generated IDs in MongoDB from the PHP library and saves you at least as much time as I lost trying to figure this out! FURTHER READING * Importing and Exporting MongoDB Databases * XHGui on VM, Storage on Host * MySQL 5.7 Introduces a JSON Data Type This entry was posted in php and tagged mongodb by lornajane . Bookmark the permalink .POST NAVIGATION ← Previous Next →ONE THOUGHT ON “ FIND MONGO DOCUMENT BY ID USING THE PHP LIBRARY ” 1. Pingback: Community News: Recent posts from PHP Quickfix (06.15.2016) – SourceCode 2. LEAVE A REPLY CANCEL REPLY Please use [code] and [/code] around any source code you wish to share.Comment Name * Email * Website CONTACT * Email: [email protected] * Twitter: @lornajane * Phone: +44 113 830 1739 LINKS * Go PHP7 (ext) * Joind.In * ZCE Links Bundle * ZCE Questions Pack BOOKS AND VIDEOS © 2006-2016 LornaJane.net Icons courtesy of The Noun Project","My new job as a Developer Advocate with IBM means I get to play with databases for a living (this is the most awesome thing ever invented, seriously). On my travels, I spent some time with MongoDB …",Find Mongo Document By ID Using The PHP Library,Live,67 173,"KDNUGGETS Data Mining, Analytics, Big Data, and Data Science Subscribe to KDnuggets News | Follow | Contact * SOFTWARE * NEWS * Top stories * Opinions * Tutorials * JOBS * Academic * Companies * Courses * Datasets * EDUCATION * Certificates * Meetings * Webinars KDnuggets Home » News » 2016 » Jun » Tutorials, Overviews » An Introduction to Scientific Python (and a Bit of the Maths Behind It) – NumPy ( 16:n20 )LATEST NEWS, STORIES * In Deep Learning, Architecture Engineering is the New ... What the Next Generation of IoT Sensors Have in Store MNIST Generative Adversarial Model in Keras Online Master of Science in Predictive Analytics Statistical Data Analysis in Python More News & Stories | Top Stories AN INTRODUCTION TO SCIENTIFIC PYTHON (AND A BIT OF THE MATHS BEHIND IT) – NUMPY Previous post Next post Tweet Tags: numpy , Python , Scientific Computing -------------------------------------------------------------------------------- An introductory overview of NumPy, one of the foundational aspects of Scientific Computing in Python, along with some explanation of the maths involved. By Jamal Moir, Oxford Brookes University . Oh the amazing things you can do with Numpy. NumPy is a blazing fast maths library for Python with a heavy emphasis on arrays. It allows you to do vector and matrix maths within Python and as a lot of the underlying functions are actually written in C, you get speeds that you would never reach in vanilla Python. Numpy is an absolutely key piece to the success of scientific Python and if you want to get into Data Science and or Machine Learning in Python, it's a must learn. NumPy is well built in my opinion and getting started with it is not difficult at all. This is the second post in a series of posts on scientific Python, don't forget to check out the others too. An up-to-date list of posts in this series is at the bottom of this post. ARRAY BASICS Creation NumPy revolves around these things called arrays. Actually nparrays, but we don't need to worry about that. With these arrays we can do all sorts of useful things like vector and matrix maths at lightning speeds. Get your linear algebra on! (Just kidding we won't be doing any heavy maths) # 1D Array a = np.array([0, 1, 2, 3, 4]) b = np.array((0, 1, 2, 3, 4)) c = np.arange(5) d = np.linspace(0, 2*np.pi, 5) print(a) # [0 1 2 3 4]print(b) # [0 1 2 3 4]print(c) # [0 1 2 3 4]print(d) # [ 0. 1.57079633 3.14159265 4.71238898 6.28318531]print(a[3]) # 3 The above code shows 4 different ways of creating an array. The most basic way is just passing a sequence to NumPy's array() function; you can pass it any sequence, not just lists like you usually see. Notice how when we print an array with numbers of different length, it automatically pads them out. This is useful for viewing matrices. Indexing on arrays works just like that of a list or any other of Python's sequences. You can also use slicing on them, I won't go into slicing a 1D array here, if you want more information on slicing, check out this post . The above array example is how you can represent a vector with NumPy, next we will take a look at how we can represent matrices and more with multidimensional arrays. # MD Array, a = np.array([[11, 12, 13, 14, 15], [16, 17, 18, 19, 20], [21, 22, 23, 24, 25], [26, 27, 28 ,29, 30], [31, 32, 33, 34, 35]]) print(a[2,4]) # 25 To create a 2D array we pass the array() function a list of lists (or a sequence of sequences). If we wanted a 3D array we would pass it a list of lists of lists, a 4D array would be a list of lists of lists of lists and so on. Notice how with a 2D array (with the help of our friend the space bar), is arranged in rows and columns. To index a 2D array we simply reference a row and a column. A Bit of the Maths Behind It To understand this properly, we should really take a look at what vectors and matrices are. A vector is a quantity that has both direction and magnitude. They are often used to represent things such as velocity, acceleration and momentum. Vectors can be written in a number of ways although the one which will be most useful to us is the form where they are written as an n-tuple such as (1, 4, 6, 9). This is how we represent them in NumPy. A matrix is similar to a vector, except it is made up of rows and columns; much like a grid. The values within the matrix can be referenced by giving the row and the column that it resides in. In NumPy we make arrays by passing a sequence of sequences as we did previously. Multidimensional Array Slicing Slicing a multidimensional array is a bit more complicated than a 1D one and it's something that you will do a lot while using NumPy. # MD slicingprint(a[0, 1:4]) # [12 13 14]print(a[1:4, 0]) # [16 21 26]print(a[::2,::2]) # [[11 13 15]# [21 23 25]# [31 33 35]]print(a[:, 1]) # [12 17 22 27 32] As you can see you slice a multidimensional array by doing a separate slice for each dimension separated with commas. So with a 2D array our first slice defines the slicing for rows and our second slice defines the slicing for columns. Notice that you can simply specify a row or a column by entering the number. The first example above selects the 0th column from the array. The diagram below illustrates what the given example slices do. Array Properties When working with NumPy you might want to know certain things about your arrays. Luckily there are lots of handy methods included within the package to give you the information that you need. # Array properties a = np.array([[11, 12, 13, 14, 15], [16, 17, 18, 19, 20], [21, 22, 23, 24, 25], [26, 27, 28 ,29, 30], [31, 32, 33, 34, 35]]) print(type(a)) # = ad_1.date - interval '29 days' AND ad_2.date � Using this method we can achieve the same results as described above with the window frame. If you're operating over large amounts of data, the window frame option is going to be more efficient, but this alternative exists if you want to use it. CALCULATING A CUMULATIVE MOVING AVERAGE Now that we've reviewed a couple methods for how to calculate a simple moving average, we'll switch up our window frame example to show how you can also do a cumulative moving average. The same principles apply, but rather than having a continually shifting window frame for an interval, the window frame simply extends. For example, instead of doing a 30 day rolling average, we're going to calculate a year-to-date moving average. For each new date, it's value is simply included in the average calculation from all the previous dates. Let's have a look at this example: SELECT ad.date, AVG(ad.downloads) OVER(ORDER BY ad.date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg_downloads_ytd FROM app_downloads_by_date ad ; Because our base table starts at January 1st for the current year, we're using UNBOUNDED PRECEDING to set our window frame. The results we get back for this cumulative calculation look like this: date | avg_downloads_ytd ----------------------------------- . . . . 2016-05-26 | 20.2585034013605442 2016-05-27 | 20.3243243243243243 2016-05-28 | 20.2348993288590604 2016-05-29 | 20.1933333333333333 2016-05-30 | 20.2052980132450331 2016-05-31 | 20.2039473684210526 2016-06-01 | 20.2287581699346405 2016-06-02 | 20.2727272727272727 2016-06-03 | 20.2967741935483871 2016-06-04 | 20.3910256410256410 2016-06-05 | 20.3885350318471338 2016-06-06 | 20.3924050632911392 2016-06-07 | 20.4465408805031447 2016-06-08 | 20.4812500000000000 2016-06-09 | 20.4968944099378882 2016-06-10 | 20.4938271604938272 2016-06-11 | 20.4478527607361963 2016-06-12 | 20.3719512195121951 2016-06-13 | 20.3454545454545455 2016-06-14 | 20.3734939759036145 2016-06-15 | 20.3772455089820359 2016-06-16 | 20.4583333333333333 2016-06-17 | 20.4260355029585799 2016-06-18 | 20.3941176470588235 2016-06-19 | 20.3625730994152047 2016-06-20 | 20.3953488372093023 2016-06-21 | 20.4277456647398844 2016-06-22 | 20.4080459770114943 2016-06-23 | 20.4342857142857143 2016-06-24 | 20.4090909090909091 2016-06-25 | 20.3672316384180791 2016-06-26 | 20.3314606741573034 2016-06-27 | 20.3128491620111732 2016-06-28 | 20.3166666666666667 2016-06-29 | 20.3480662983425414 2016-06-30 | 20.4120879120879121 2016-07-01 | 20.4426229508196721 2016-07-02 | 20.4184782608695652 If we chart these results, you can see that the advantage of the cumulative moving average is a further smoothing out of the data so that only significant data changes show up as trends. We see now that there is a slight upward trend year-to-date: WRAPPING UP Now that you know a couple different kinds of moving averages you can use and a couple different methods for calculating them, you can perform more insightful analysis and create more effective reports. In our next Metrics Maven article, we'll look at some options for how to make data pretty so that instead of values like ""20.4184782608695652"", we'll see ""20.42"". See you next time! Image by: extrabrandt Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Figuring out how to calculate a moving average can be a bit daunting if you've never done it. Once you learn a method you like, though, (we'll cover two) it's easy to do and you'll find many uses for it in your tracking and reports.",Metrics Maven: Calculating a Moving Average in PostgreSQL,Live,69 176,"Skip navigation Upload Sign in SearchLoading...Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.WATCH QUEUEQUEUEWatch Queue Queue * Remove all * Disconnect 1. Loading...Watch Queue Queue __count__/__total__ Find out why CloseOFFLINE-FIRST APPS WITH POUCHDBnode.js Subscribe Subscribed Unsubscribe 3,495 3KLoading...Loading...Working...Add toWANT TO WATCH THIS AGAIN LATER?Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO? Sign in to report inappropriate content. Sign in * Transcript * Statistics2,108 24LIKE THIS VIDEO?Sign in to make your opinion count. Sign in 25 0DON'T LIKE THIS VIDEO?Sign in to make your opinion count. Sign in 1Loading...Loading...TRANSCRIPTThe interactive transcript could not be loaded.Loading...Loading...Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Dec 11, 2015Bradley Holt, IBM CloudantWeb and mobile apps shouldn't stop working when there's no network connection.Based on Apache CouchDB, PouchDB is an open source syncing JavaScript databasethat runs within a web browser. Offline-first apps that use PouchDB can providea better, faster user experience—both offline and online.Learn how to build offline-enabled responsive mobile web apps using the HTML5Offline Application Cache and PouchDB. We’ll also discuss how to buildcross-platform apps or high-fidelity prototypes using PouchDB, Cordova, andIonic. PouchDB can also be run within Node.js and on devices for Internet ofThings (IoT) applications.This talk includes code examples for creating a PouchDB database, creating a newdocument, updating a document, deleting a document, querying a database,synchronization PouchDB with a remote database, and live updates to a userinterface based on database changes. * CATEGORY * People & Blogs * LICENSE * Standard YouTube License Show more Show lessLoading...Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * Sync With Couchbase Using Only AngularJS And PouchDB - Duration: 54:51. Ca Pham Van 434 views 54:51-------------------------------------------------------------------------------- * How PouchDB works - Duration: 50:03. Nolan Lawson 2,598 views 50:03 * Deep Dive into Offline-First with PouchDB and IBM Cloudant - Duration: 57:39. IBM Cloudant 1,698 views 57:39 * Developing Mobile Apps Offline Experiences with NoSQL DB and Couchbase Mobile - Duration: 1:12:27. Movel 1,977 views 1:12:27 * Say Hello to Offline First • Ola Gasidlo - Duration: 39:14. GOTO Conferences 2,132 views 39:14 * Getting started with PouchDB and CouchDB (tutorial) - Duration: 51:51. Nolan Lawson 20,426 views 51:51 * Building Node.js powered mobile apps with Red Hat Mobile - Duration: 19:58. node.js 906 views 19:58 * Stressed About NoSQL? Relax with CouchDB. - Duration: 54:10. Nodevember 2,868 views 54:10 * Node.js at Netflix - Duration: 25:18. node.js 16,125 views 25:18 * Sprouting Node.js Roots at Ancestry - Duration: 19:01. node.js 643 views 19:01 * JavaScript, For Science! - Duration: 20:26. node.js 1,813 views 20:26 * AngularJS + PouchDB, ng-model and ng-form - Duration: 1:27:22. AngularJS Utah 1,653 views 1:27:22 * CouchDB everywhere with PouchDB - Dale Harvey, Mozilla - Duration: 37:19. IBM Cloudant 6,449 views 37:19 * Building Mobile Apps that Work Online and Offline - Duration: 34:15. Couchbase 1,243 views 34:15 * Sync Data Using PouchDB In Your Ionic Framework App - Duration: 35:16. Nic Raboy 8,191 views 35:16 * Offline-first web and mobile apps with Polymer and Vaadin by AMahdy AbdElAziz - Duration: 14:11. Devoxx 620 views 14:11 * Building Interactive npm Command Line Modules -- All The Things. - Duration: 18:23. node.js 1,037 views 18:23 * JS.Geo- PouchDB and SQLDown - Duration: 26:20. Confreaks 210 views 26:20 * Javascript Offline First - Leeds JS Talk - Duration: 58:01. Leeds JS 733 views 58:01 * Sync With Couchbase Using Only AngularJS And PouchDB - Duration: 54:51. Couchbase 650 views 54:51 * Loading more suggestions... * Show more * Language: English * Country: Worldwide * Restricted Mode: OffHistory HelpLoading...Loading...Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Try something new! * Loading...Working...Sign in to add this to Watch LaterADD TOLoading playlists...","Bradley Holt, IBM Cloudant Web and mobile apps shouldn't stop working when there's no network connection. Based on Apache CouchDB, PouchDB is an open source ...",Offline-First Apps with PouchDB,Live,70 177,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSIMPLE METRICS TUTORIAL PART 1: METRICS COLLECTIONRaj R Singh / August 25, 2015OVERVIEWThis tutorial explains how we created a lightweight web-tracking app to recorduser actions on our site’s search engine page. See how we use the open source Piwik® web analytics app to collect information and Node.js® to store that data in Cloudant . Then try it yourself by implementing tracking on a demo app we provide. Herein Part 1, we focus on data collection. When you’re done, you can try Part 2,where we show how to visualize the data you’ve gathered.WHY WE BUILT THIS APPWe had a problem here in the Cloud Data Services Developer Advocacy group. GlynnBird created a great faceted search engine that we use on our site’s How-To’s page ( read Glynn’s tutorial on creating your own faceted search engine ). Our How-To’s page is more sophisticated than a static web page. It uses AJAXto respond to user requests. Instead of refreshing the entire page, we updatesmall parts of the page to show results. This meant that traditional server-sidetracking tools that log events wouldn’t help us understand what users are doingin this dynamic context. We also ruled out available client-side trackingservices, because they don’t offer full control over what you track, or how datais stored and analyzed. How could we collect and see the user activity data wewanted? The answer was to create our own app to collect and analyze metrics. Weattached link-tracking to the UI elements dynamically generated by our site’sDOM, and we persisted that data to prepare for future analysis.Metrics appGET DEPLOYEDYou can preview the demo app to see how it works. But first things first. Here in Part 1, we’ll explain howthis app collects metrics. You can find all the code for Part 1 of this tutorialin the metrics-collector GitHub repo . The easiest way to explore the app is to deploy it to Bluemix (IBM’s open cloud platform for building, running, and managing applications).Open the repo’s README and click the Deploy to Bluemix button. When you click it, Bluemix creates and hosts a copy of the coderepository. Thanks, Deploy to Bluemix button!HOW IT WORKSHere’s an architectural overview of our metrics collector. Its middlewarecomponent serves tracker.js and piwik.js , which perform the metrics collection work and persist metrics data to thedatabase. We use Cloudant as our database, a NoSQL JSON document store based on Apache CouchDB™ .Metrics collector architectureTRACKING USER ACTIONS WITH PIWIKWe use the Piwik library to capture search events generated in our web page.Piwik’s JavaScript tracking client offers the ability to capture a host ofclient-side information, from basics like page views and outbound link clicks,down to the most detailed user events. For us, it captures the search activityby listening to events on the the user interface elements that create a request:the search text box, and the checkboxes for filtering search results.How-Tos search elementsTo connect Piwik to the web page you want to track, all you do is add one simpleline to that page. If you view the source of our How-Tos page , you’ll find the script tag include that reads:That’s all we do in the HTML page we’re tracking—load the tracker.js script andpass it a single variable, siteid , which is a unique identifier that’s saved to the database with every eventcoming from the How-Tos page. Tip: You can use this tracking app for any type of web page or app. But, you’re notlimited to just one at a time. The “application” identifier is the siteid, so ifyou use the same siteid on different web pages, their metrics are grouped andanalyzed together. (You can still identify the different web pages via the trackPageView Piwik event you’re tracking, and see it in the database as the url key). To see how event collection works, go to the metrics collector app’s repo , open the js folder and look at the tracker.js file. Two interesting functions are customDataFn , which captures metadata about a user’s browser, and enableLinkTrackingForNode , which facilitates link-tracking for a DOM node and lets us programmaticallyattach tracking to individual UI elements as they appear. You can find this line of code in the file cds.js in the search engine GitHub repo . The point of this client-side event tracking is that every user action on thesearch engine interface results in an event submission back to the tracker thatlooks something like this:TRACKING PAYLOAD URL SUBMISSIONhttps://metrics-collector.mybluemix.net/tracker? search=&search_cat=[{""key"":""topic"",""value"":""Data Warehousing""}, {""key"":""topic"",""value"":""Analytics""}]& search_count=7& idsite=cds.search.engine& rec=1&r=493261&h=17&m=46&s=48& url=https://developer.ibm.com/clouddataservices/how-tos/& _id=0e9dcf4b6b5b0dc7& _idts=1433860426& _idvc=2& _idn=0& _refts=0& _viewts=1433881201& _ref=https://google.com& send_image=0& pdf=1&qt=0&realp=0&wma=0&dir=0&fla=1&java=1&gears=0&ag=0& cookie=1&res=3360x2100>_ms=51& uap=MacIntel� rv:31.0) Gecko/20100101 Firefox/31.0& date=2015-5-4Pretty cool so far. We’ve implemented some custom event tracking on our searchengine web app. Next, we persist the data so we can do some usage analytics.PERSISTING USAGE DATA TO CLOUDANTWe’re going to use the Cloudant NoSQL database to store our event data. We do sofor a couple reasons: * Flexibility. Cloudant stores its data as JSON documents. That format provides schema flexibility that’s a nice fit for the event data. * Availability. Cloudant provides high availability read-write access, enabling high levels of concurrent connections, which ensures we never miss user interactions even under heavy load.To take that tracking payload and persist it to a Cloudant database, we wrote alittle Node.js Express app, server.js , which you’ll find in the metrics collector repo . This app accepts the data in an HTTP GET key-value-pair request, transformsit into JSON, and writes it to Cloudant. Here’s a sample JSON document showinghow a record is stored in Cloudant:STRUCTURE OF A TRACKING PAYLOAD DOCUMENT { ""type"": ""search"", //Type of event being captured (currently pageView, search and link) ""idsite"": ""cds.search.engine"", //app id (must be unique) ""ip"": ""75.126.70.43"", //ip of the client ""url"": ""https://developer.ibm.com/clouddataservices/how-tos/"", //source url _for_ the event ""geo"": { //geo coordinates of the client (if available) ""lat"": 42.3596328, ""long"": -71.0535177 } ""search"": """", //Search text if any (specific to search events) ""search_cat"": [ //Faceted search info (specific to search events) { ""key"": ""topic"", ""value"": ""Analytics"" }, { ""key"": ""topic"", ""value"": ""Data Warehousing"" } ], ""search_count"": 7, //search result count (specific to search events) ""action_name"": ""IBM Cloud Data Services - Developers Center - Products"", //Document title (specific to pageView events) ""link"": ""https://developer.ibm.com/bluemix/2015/04/29/connecting-pouchdb-cloudant-ibm-bluemix/"", //_target url_ (specific to link events) ""rec"": 1, //always 1 ""r"": 297222, //random string ""date"": ""2015-5-4"", //event date time -yyyy-mm-dd ""h"": 16, //event timestamp - hour ""m"": 20, //event timestamp - minute ""s"": 10, //event timestamp - seconds ""$_id"": ""0e9dcf4b6b5b0dc7"", //cookie visitor ""$_idts"": 1433860426, //cookie visitor count ""$_idvc"": 2, //Number of visits in the session ""$_idn"": 0, //Whether a new visitor or not ""$_refts"": 0, //Referral timestamp ""$_viewts"": 1433881201, //Last Visit timestamp ""$_ref"": 'google.com',//Referral url ""send_image"": 0, //used image to send payload ""uap"": ""MacIntel"", //client platform ""uab"": ""Netscape"", //client browser ""pdf"": 1, //browser feature: supports pdf ""qt"": 0, //browser feature: supports quickTime ""realp"": 0, //browser feature: supports real player ""wma"": 0, //browser feature: supports windows media player ""dir"": 0, //browser feature: supports director ""fla"": 1, //browser feature: supports shockwave ""java"": 1, //browser feature: supports java ""gears"": 0, //browser feature: supports google gear ""ag"": 0, //browser feature: supports silver light ""cookie"": 1, //browser feature: has cookies ""res"": ""3360x2100"", //browser feature: screen resolution ""gt_ms"": 51 //Config generation performance generation time }Let’s look at server.js . First, we load in required modules, including one called cloudant (loaded from the file storage.js ) that simplifies the process of connecting to a Cloudant database—much thesame way the excellent nano library simplifies connecting to an Apache CouchDB database. (Cloudant is, in manyways, an extension of CouchDB.) We set up our database connection in the trackerDb variable initialization and add some secondary indices to it at the same time.(In Cloudant and in CouchDB, secondary indices are defined by JavaScript Mapfunctions.) Then, we set up Express to serve the static JavaScript files. Thefollowing code around line 66 makes any file in the js directory web-accessible via the url http://metrics-collector.mybluemix.net/ :app.use(express.static(path.join(__dirname, 'js')));Last but not least, the app accepts event-tracking data on the /tracker endpoint. In app.get(""/tracker""... we take the data and use lodash to construct the JavaScript “tracking payload” object shown earlier. You may have noticed that our Node.js Express app is doing double duty. Notonly does it accept requests to save tracking information for persisting toCloudant, that same app serves out the JavaScript files, tracker.js and piwik.js .IMPLEMENT TRACKING ON A SAMPLE APPNow, try it for yourself. Test your deployment and implement tracking on a asample web app.CLONE THE SAMPLE APPLICATIONFor this test, we’ll use the guitars faceted search engine app written by Glynn Bird. 1. Copy the app to your local machine. git clone https://github.com/glynnbird/guitars 2. Add the following tracking script tag to index.html : 3. Edit guitar.js to add the tracking code for dynamically generated content. Locate the following code around line 140 and match what you see here: $('#searchtitle').html(html); //Reset the tracking for these elements if ( typeof _paq !== 'undefined' ){ _paq.push([ enableLinkTrackingForNode, $('#searchtitle')]); } Then around line 52: $.ajax(obj).done(function(data) { $('#loading').hide(); if (callback) { callback(null, data); } //Track the search results, do not log the initial page load as a search if ( searchText !== """" || (filter && $.isArray(filter) && filter.length � } } VERIFY THAT THE EVENTS ARE BEING RECORDED 1. Go to Bluemix and locate your metrics-collector application. 2. Click metrics-collector-cloudant-service . 3. Click the Launch button. 4. Click the tracker_db database and note the number of docs in the database. 5. In your favorite browser, launch the guitars index.html . 6. Search for some guitars and click on a few filters. 7. Go back to the Cloudant dashboard and reload the page. You’ll see that the number of docs has increased.You’ve now verified that the metrics collector application is correctly deployedon Bluemix and gathering data. In Part 2 of this tutorial, you’ll see how torepresent that data graphically in a report.SUMMARY OF METRICS COLLECTIONHere in Part 1 of this tutorial, you learned how to use Piwik to collect useractions and persist the data to a Cloudant database. Now you’re ready for Part2, Metrics Analytics , where you’ll learn how to display that data graphically in a report. Like Simple Metrics Collector?© “Apache”, “CouchDB”, “Apache CouchDB” and the CouchDB logo are trademarks orregistered trademarks of The Apache Software Foundation. All other brands andtrademarks are the property of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Tutorial for creating a web-tracking app that works with dynamically-generated UI elements. Uses Node.js, Cloudant, and IBM Bluemix.",Simple Metrics Tutorial Part 1: Metrics Collection -- Code a web analytics app with Node.js and IBM Cloudant,Live,71 179,"COULD POSTGRESQL 9.5 BE YOUR NEXT JSON DATABASE?Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 13, 2016TL;DR: No, but that's not the right question.Just over a year ago we asked Is PostgreSQL Your Next JSON Database ... Now, with PostgreSQL 9.5 out, it's time to check if Betteridge's law still applies. So let's talk about JSONB support in PostgreSQL 9.5.For context, and for those of you who haven't been following, it's worth knowingthe history of JSON in PostgreSQL. If you're all up to speed already, just skip ahead to read about the new features. The JSON story begins with the arrival of JSONin PostgreSQL 9.2..JSON IN 9.2The original JSON data type that landed in PostgreSQL 9.2 was basically a textcolumn flagged as JSON data for processing through a parser. In 9.2 though, youcould turn rows and arrays in json and for everything else you have to dive intoone of the PL languages. Useful in some cases but ... more, lots more wasneeded. To illustrate, if we had JSON data like this:{ ""title"": ""The Shawshank Redemption"", ""num_votes"": 1566874, ""rating"": 9.3, ""year"": ""1994"", ""type"": ""feature"", ""can_rate"": true, ""tconst"": ""tt0111161"", ""image"": { ""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg"", ""width"": 933, ""height"": 1388 }}We could create a table like so:CREATE TABLE filmsjson ( id BIGSERIAL PRIMARY KEY, data JSON );` And insert data into it like so:compose=> INSERT INTO filmsjson (data) VALUES ('{ ""title"": ""The Shawshank Redemption"", ""num_votes"": 1566874, ""rating"": 9.3, ""year"": ""1994"", ""type"": ""feature"", ""can_rate"": true, ""tconst"": ""tt0111161"", ""image"": { ""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg"", ""width"": 933, ""height"": 1388 }}')INSERT 0 1 compose=�And apart from storing and retrieving the entire document, there was little wecould do with it. Notice that all the spaces and carriage returns have beenpreserved. That'll be important later...FAST FORWARD TO POSTGESQL 9.3.On the back of a new parser for JSON in PostgreSQL 9.3, operators appear toextract values from the JSON data type. Chief among them is -> which can, given an integer, extract a value from a JSON array or, given astring, member of an JSON data type and ->> which does the same but returns text. Building on this is #> and #>> which allow a path to be specified to the value to be extracted.With our previous example table, that meant we could now at least peer into theJSON and do a query like:compose=> select data-� ?column? ---------------------------- ""The Shawshank Redemption""(1 row)compose=> select data#� ?column? ---------- 933(1 row)Yes, the path is a list of keys working down through the JSON document. Don't becaught out thinking the curly braces represent JSON though - this is a textarray as a literal string which PostgreSQL interprets into a text[]. That meansthat query is equivelant to this:select data#� These were joined by a good set of functions but this was all still pretty limited. It didn't really allow for complexqueries, there was limited indexing on particular fields and only a few ways tocreate new JSON elements. But most importantly all that on the fly parsing of atext field wasn't efficient.CUT TO POSTGRESQL 9.4.PostgreSQL 9.4 is where JSONB arrived. JSONB is a binary encoded version of JSONwhich efficiently stores the keys and values of a JSON document. This means allthe space padding is gone and with it all the need to parse the JSON. The downside is that you can't have repeated keys at the same level and you generallylose all the formatted structure of the document. It's a sacrifice thats wellworth making because everything gets generally more efficient because there's noon the fly parsing. It does slow inserts down because it's there that theparsing actually gets done. To see the difference, let's create a JSONB tableand insert our example data into it:compose=� CREATE TABLE compose=�INSERT 0 1 compose=> select * from filmsjsonb id | data ----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 | {""type"": ""feature"", ""year"": ""1994"", ""image"": {""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg"", ""width"": 933, ""height"": 1388}, ""title"": ""The Shawshank Redemption"", ""rating"": 9.3, ""tconst"": ""tt0111161"", ""can_rate"": true, ""num_votes"": 1566874}(1 row)Yes, that is rather wide. All the spaces and returns from the JSON data havegone leaving one compact key/value list.Although they share many features, here's a fun fact: JSONB has no creationfunctions. In 9.4, the JSON data type got a bundle of extra creation functions: json_build_object() , json_build_array() and json_object() . Use those, or other creation functions, and cast to JSONB ( ::jsonb ) to get the JSONB version. It reflects the logic the PostgreSQL developershave applied - JSON for document fidelity and storage, JSONB for fast, efficientoperations. So while JSON and JSONB both have the -> , ->> , #> and #>> operators, only JSONB has the ""contains"" and ""exists"" operators @> , <@ , ? , ?| and ?& .Exists is a check for strings that match top-level keys in the JSONB data so wecan check there's a rating field in our example data like so:compose=> select data-� ?column? ---------------------------- ""The Shawshank Redemption""(1 row)But if we queried for the url key that's inside the image value, we'd failcompose=> select data-� ?column? ----------(0 rows)But we could test the image value, like so:compose=> select data->'title' from filmsjsonb where data-� ?column? ---------------------------- ""The Shawshank Redemption""(1 row)The ?| operator does the same thing but ""or"" matches the keys against an array ofstrings rather than just one string. The ?& operator does a similar thing but ""and"" matches so all the strings in the arraymust be matched.But exists operators just check for presence. With the '@ ' contains operator you can match keys, paths and values. But let's quicklyimport some more movies into the database first. Ok, now say we want all themovies from 1972, we can look for the records that contain ""year"":""1972"".compose=> select data->'title' from filmsjsonb where data @� ?column? ----------------- ""The Godfather"" ""Solaris""(2 rows)And we can look for particular values within objects:compose=> select data->'title' from filmsjsonb where data @� ?column? -------------------------------------- ""The Green Mile"" ""My Neighbor Totoro"" ""Nausicaä of the Valley of the Wind""(3 rows)9.4 also brought creating GIN indexes which cover all the fields in the JSONBdocuments for all JSON operations. It's also possible to create GIN indexes with json_path_ops set which gives smaller, faster indexes but only for use of the @> contains operator which is actually remarkably useful as many JSON operationson nested documents are about finding documents which contain particular values.That said, there's still plenty of scope for more comprehensive and capableindexing.So, 9.4 brought PostgreSQL up to the point where you could create, extract andindex JSON/JSONB. What was missing though was the ability to modify theJSON/JSONB data types. You still had to look at passing the JSON data to a PLv8or PLPerl script where it could be natively manipulated. So, things were closeto being a full service JSON document handling environment, but not quite.ENTER POSTGRESQL 9.5PostgreSQL 9.5's new JSON capabilities are all about modifying and manipulatingJSONB data. Apart from one, that is. The jsonb_pretty() function takes JSONB and makes it more readable so you go from:compose=� data ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- {""type"": ""feature"", ""year"": ""1994"", ""image"": {""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg"", ""width"": 933, ""height"": 1388}, ""title"": ""The Shawshank Redemption"", ""rating"": 9.3, ""tconst"": ""tt0111161"", ""can_rate"": true, ""num_votes"": 1566874}(1 row)To a much more digestable form...compose=� jsonb_pretty --------------------------------------------------------------------------------------------------------------- { + ""type"": ""feature"", + ""year"": ""1994"", + ""image"": { + ""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg"",+ ""width"": 933, + ""height"": 1388 + }, + ""title"": ""The Shawshank Redemption"", + ""rating"": 9.3, + ""tconst"": ""tt0111161"", + ""can_rate"": true, + ""num_votes"": 1566874 + }(1 row)Which is much more readable and going to pop up in any JSON related PostgreSQL9.5 examples. On to the operators....LET US DELETEThe simplest modifier is deletion. Just say what you want gone and make it goaway. For that, 9.5 introduces the - and #- operators. The - operator works like the -> operator except instead of returning a value from an array (if given an integeras a parameter) or object (if given a string), it deletes the value or key/valuepair. So, with our movie database, if we want to remove the rating field thenthis does the trick:compose=� UPDATE 250 The #- operator goes further, taking a path as a parameter. So say we wanted to removethe image's dimension properties:compose=� UPDATE 250 compose=� UPDATE 250 compose=� jsonb_pretty -------------------------------------------------------------------------------------------------------------- { + ""type"": ""feature"", + ""year"": ""1994"", + ""image"": { + ""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg""+ }, + ""title"": ""The Shawshank Redemption"", + ""tconst"": ""tt0111161"", + ""can_rate"": true, + ""num_votes"": 1566874 + }(1 row)We do two updates because the path specifier doesn't allow for optional keys butwe can get it down to one update by remembering that the set expression can beas complex as we need it.compose=� UPDATE 250 Although you can delete data from the database, remember that you can also justremove it from your output too:compose=� jsonb_pretty -------------------------------------------------------------------------------------------------------------- { + ""type"": ""feature"", + ""year"": ""1994"", + ""image"": { + ""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg""+ }, + .... CONCATENATIONThe operator for manipulations is the concatenation operator || . This tries to combine two JSONB objects into one. It works with the top levelkeys of both values only and when the same key is present on both sides, itresolves it by taking the right-hand operand's value. This means you can use itas an update mechanism too. Say, using out example data, we need to set the can_rate field to false, clear the num_votes field and add a new revote field set to true...compose=� UPDATE 250 compose=� jsonb_pretty --------------------------------------------------------------------------------------------------------------- { + ""type"": ""feature"", + ""year"": ""1994"", + ""image"": { + ""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg"",+ ""width"": 933, + ""height"": 1388 + }, + ""title"": ""The Shawshank Redemption"", + ""rating"": 9.3, + ""revote"": true, + ""tconst"": ""tt0111161"", + ""can_rate"": false, + ""num_votes"": 0 + }(1 row)This is a generally useful way to merge JSONB data types, for example in postprocessing. As an update method it leaves something to be desired. Updating asingle top level-field, it's a bit overkill. Updating a nested single field in adocument, then you have to dig your way down to the containing object and mergefrom there. If only there was a simple way to set a particular field...JSONB_SET FOR SUCCESSThe jsonb_set() function is designed for updating single fields wherever they are in the JSONdocument. Let's jump straight to an example:compose=� This will change the value of the image.width property to 1024. The argumentsfor jsonb_set() are simple; the first argument is a JSONB data type you want to modify, thesecond is a text array path and the third is a JSONB value to replace the valueat the end of that path. If the key/value pair at the end of the path doesn'texist, by default, jsonb_set() creates and sets it. To stop that behavior, add a fourth optional parameter(""create_missing"") and set it to false. If ""create_missing"" is true but othercomponents of the path don't exist then jsonb_set() won't try to create the entire path and will just fail. Say we wanted to add anew object to our image data about picture rights, we can simply add in the JSONdata for that new object:compose=� compose=� jsonb_pretty --------------------------------------------------------------------------------------------------------------- { + ""type"": ""feature"", + ""year"": ""1972"", + ""image"": { + ""url"": ""http://ia.media-imdb.com/images/M/MV5BMjEyMjcyNDI4MF5BMl5BanBnXkFtZTcwMDA5Mzg3OA@@._V1_.jpg"",+ ""width"": 1024, + ""height"": 500, + ""quality"": { + ""copyright"": ""company X"", + ""registered"": true + } + }, + ""title"": ""The Godfather"", + ""rating"": 9.2, + ""tconst"": ""tt0068646"", + ""can_rate"": true, + ""num_votes"": 1072605 + }(1 row)jsonb_set() is probably the most important addition in PostgreSQL 9.5's JSON functions. Itoffers the chance to change data in-place within JSONB data types. Do rememberthat where we've used simple values to set parameters is only for examples; youcould have PostgreSQL subqueries creating new values and co-ercing them intoJSONB subdocuments or arrays to create richer JSON documents.CONSIDER THISWhat this all leads to is an interesting position for PostgreSQL. PostgreSQL9.5's JSON enhancements mean that you could use PostgreSQL as a JSON database;it's fast and functional. Whether you'd want to is a different consideration.For example, the relatively accessible APIs or client libraries of many JSONdatabases are not there. In their place is a PostgreSQL specific dialect of SQLfor manipulating JSON which is used in tandem with the rest of the database'sSQL to exploit the full power of it. This means you still have to learn SQL, arequirement which, unfortunately, too many people use as their reason for usinga ""NoSQL"" database.You can use PostgreSQL to create rich, complex JSON/JSONB documents within thedatabase. But then if you are doing that, you may want to consider whether youare using PostgreSQL well. If the richness and complexity of those documentscomes from relating the documents to each other then the relational model isoften the better choice for data models that have intertwined data. Therelational model also has the advantage that it handles that requirement withoutlarge scale duplication within the actual data. It also has literally decades ofengineering expertise backing up design decisions and optimizations.What JSON support in PostgreSQL is about is removing the barriers to processingJSON data within an SQL based relational environment. The new 9.5 features takedown another barrier, adding just enough accessible, built-in and efficientfunctions and operators to manipulate JSONB documents.PostgreSQL 9.5 isn't your next JSON database, but it is a great relationaldatabase with a fully fledged JSON story. The JSON enhancements arrive alongsidenumerous other improvements in the relational side of the database, ""upsert"",skip locking and better table sampling to name a few.It may not be your next JSON database, but PostgreSQL could well be the nextdatabase you use to work with relational and JSON data side by side.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writersince Apples came in II flavors and Commodores had Pets. Love this article? Headover to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Just over a year ago we asked Is PostgreSQL Your Next JSON Database... Now, with PostgreSQL 9.5 out, it's time to check if Betteridge's law still applies. So let's talk about JSONB support in PostgreSQL 9.5.",Could PostgreSQL 9.5 be your next JSON database?,Live,72 180,"☰ * Login * Sign Up * Learning Paths * Courses * Our Courses * Partner Courses * Badges * Our Badges * BDU Badge Program * Events * Blog * Resources * Resources List * Downloads * BLOG Welcome to the Big Data University Blog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * BDU China initiatives * This Week in Data Science (September 27, 2016) * Introducing Two New SystemT Information Extraction Courses * This Week in Data Science (September 20, 2016) * This Week in Data Science (September 13, 2016) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsTHIS WEEK IN DATA SCIENCE (SEPTEMBER 27, 2016) Posted on September 27, 2016 by cora Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * How Open Data Is Making Our Cities More Efficient – A new collaboration between the EU and Japan is looking to support the development of smart cities with a cloud-based shared platform * Self-Driving Cars Gain Powerful Ally: The Government – Uber, the ride-hailing giant, began trials in Pittsburgh last week using driverless technology. The government’s new guidelines for autonomous driving will speed up the rollout of self-driving cars, experts said. * MIT aims to make sense of Twitter’s presidential debate firehose – Using machine learning, Electome researchers analyze public’s debate conversations * Like a gym membership, data has no value unless you use it – Data is like that gym. How much you use it, how well you exercise and apply it, and how far it reaches into your work life determine the value return from having it * Meet Trace Genomics, The “23andMe” Of Soil – For $199, farmers can understand their soil better which is key to keeping their crops healthy. * Researchers Use Wireless Signals to Recognize Emotions – System that uses reflected radio signals has potential applications for smart homes, offices and hospitals * Is Artificial Intelligence Permanently Inscrutable? – Despite new biology-like tools, some insist interpretation is impossible. * Watch: IBM Watson creates the first AI-made movie trailer – and it’s really eerie – Now IBM Watson has added yet another skill to its arsenal as it just learned how to make movie trailers. * Airbnb Shows How Private Sector Can Use Data to Fight Discrimination – Airbnb has acknowledged the bias present on its platform, noting that “minorities struggle more than others to book a listing,” and has created a plan to tackle discrimination on its platform. * What Math Looks Like in the Mind – In a surprise to scientists, it appears blind people process numbers by tapping into a part of their brains that’s reserved for images in sighted individuals. * Top Algorithms and Methods Used by Data Scientists – Latest KDnuggets poll identifies the list of top algorithms actually used by Data Scientists, finds surprises including the most academic and most industry-oriented algorithms. * Why The Cars of the Future Will Rely on the IoT – The future of vehicles is exciting, and engineers are working toward safer, simpler, and faster modes of transportation all the time. * IBM Watson and The Weather Company Are Ready to Launch Their First Cognitive Ads – The Weather Company is getting ready to roll out its first ad campaign since being acquired by IBM earlier this year. But for the first brand, Campbell Soup Company, it’s featuring the supercomputer Watson as the chef. * 14 Traits Of The Best Data Scientists – Actual data scientists are in high demand, and there’s not enough of them to go around. If you want to identify the right talent, consider these tips. * How Big Data Changes the Economics of Renewable Energy – Big data can boost the transition to renewable energy sources much faster, says WSJ Energy Expert Jason Bordoff UPCOMING DATA SCIENCE EVENTS * Deriving value from the data lake – Join Nik Rouda, Senior Analyst for Enterprise Strategy, on October 6th, to learn more about data lakes. * Machine Intelligence Summit New York – Come hear from amazing speakers, discover emerging trends, and expand your network at the Machine Intelligence Summit on November 2nd-3rd. * IBM Webinar: Driving Innovation and Growth with Big Data – Join Noel Yuhanna, Principal Analyst at Forrester Research, on October 6th, to hear how an emerging collection of technologies that Forrester calls big data fabric is driving innovation and growth. NEW IN BIG DATA UNIVERSITY * Text Analytics – This course introduces the field of Information Extraction and how to use a specific system, SystemT, to solve your Information Extraction problem. * Advanced Text Analytics – This course goes into details about the SystemT optimizer and how it addresses the limitations of previous IE technologies. SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * RELATED Tags: analytics , Big Data , data science , events , weekly roundup -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Community * FAQ * Ambassador Program * Legal Follow us * * * * * * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (September 27, 2016)",Live,73 181,"Homepage Follow Sign in / Sign up IBM ML Hub Blocked Unblock Follow Following IBM Machine Learning Hub. To be the best, learn from the best. Latest on Machine Learning, AI & more. Info: MLHub@us.ibm.com http://ibm-ml-hub.com/ #IBMML #ML Apr 26 -------------------------------------------------------------------------------- THE 3 KINDS OF CONTEXT: MACHINE LEARNING AND THE ART OF THE FRAME “What do you do for a living?” That question used to have a pretty clear answer: “I’m a data scientist.” But lately, it’s gotten more complicated… The two of us — Jorge Castañón and Óscar D. Lara Yejas — do data science at IBM’s Machine Learning Hub , where clients from around the world bring us their mission-critical goals for turning data into knowledge. The clients range across industries from insurance to retail to energy to finance. On Monday, a manufacturer needs to use data from the quality control process to fast-forward the testing of components — with no forfeit of excellence. On Wednesday, a healthcare provider needs to isolate the real factors that tip patients from low risk to high risk. On Friday, a credit union needs to improve its loyalty offerings for retirees. Most client visits last just two days. In those 48 hours, our job is to go from zero to insight. We set priorities, access and clean the data, fire up Jupyter notebooks, set up a collaboration environment with Data Science Experience (yes, this is a plug; it’s fantastic.), choose algorithms, build models, run and tweak the models, and generate visualizations and recommendations. But none of that is the complicated part… THE 3 KINDS OF CONTEXT The complicated part is about context. We’ve learned that hopping from project to project doesn’t just mean hopping from one context to another — it means hopping across multiple kinds of context: * Industry context * Data context * Transfer context The first two are fairly intuitive. The third is less so. Let’s take each of them in turn. INDUSTRY CONTEXT Well before we dive into data and models, we ask clients to convey their domain expertise. These are people with a seemingly limitless understanding of the industry issues at play — and of the dynamics that are shaping the demands of those they serve. The more we listen to these clients, the clearer it becomes that each industry (aka sector, aka vertical) represents a problem space unto itself, each with its own goals for data: Healthcare clients tend to want to solve classification problems. Finance and energy clients tend to want to solve certain kinds of prediction problems. And manufacturing, transportation, and insurance clients tend to want to solve optimization problems. Are those tendencies cut-and-dry? Absolutely not. But combined with careful listening, they give us a place to start. But then there are the limitations. Sometimes the limitation is about the client’s familiarity with machine learning itself. Healthcare and retail have been deep into machine learning for years, while other industries are just ramping up. (Interestingly, sometimes the less familiar the better, since some clients turn out to be sitting on troves of accumulated data — typically proprietary data behind the firewall that’s just waiting to be mined .) And sometimes the limitation is about the need for interpretability. The algorithms and models we choose vary from client to client based on whether the models need to “show their work”. Our healthcare client needed more than a numerical prediction of risk migration for a given patient; they needed to know the factors at play and the weight for each factor. By the same token, banks, insurers, and government bureaus need to be able to assure watchdogs and regulators that their ML-driven automations are bias-free. To preserve interpretability for those industries, we might try to favor methods like logistic regression and decision trees . Where interpretability is less important — for example, in retail — we can jump into deep learning and other black-box approaches. It’s only after we have our heads around that industry context that we start to puzzle through the actual data. DATA CONTEXT After cleaning and formatting the data we get from clients, we’re looking for what kinds of ML models the data is capable of driving. And let’s be frank: some clients approach us with real problems that just can’t be addressed with machine learning and the data at hand, so first we talk through what’s possible. Once we have something tractable in mind, we can start to ask more questions: What are the inputs and outputs? What’s the plan for feature extraction? Should we use supervised or unsupervised learning? (So far, it hasn’t made sense to use reinforcement learning , but maybe someday soon.) Is the response variable continuous or a class that you want to predict? If you need a classification model, which variables help to represent the classes we’ll use? And so on. That work gets us a list of potential models. But on top of all that, we also want some context about how data comes to the system in the real world. How much data? How often? As a stream or in batches? Not to mention questions about provenance, governance, and security. We seldom have enough time to go as deep as we want to with clients, but without some of that context, we might end up creating models that can’t actually be deployed, accessed, or retrained. So, the industry information and the data take us a long way toward framing our efforts, but there’s one more angle we didn’t anticipate. TRANSFER CONTEXT The more time we spend at the Machine Learning Hub, the more we’re struck by what we’re learning about learning. Naturally, we want to come fresh to every encounter — but we still want to benefit from all the work we’ve done before. As we think about those trade-offs, we’re realizing that our daily work as flesh-and-blood data scientists maps onto a key aspect of the search for artificial general intelligence (AGI): transfer learning. As the name suggests, transfer learning means trying to improve performance on a task by leveraging knowledge acquired from some related task. That’s something we do every day. How well we do over time will depend on how successfully we can discern the knowledge that we should — and shouldn’t — transfer from one engagement to another. In that sense, the third context is really about our roles as data scientists and being aware of that context means being aware of our opportunities for improving our methods across a wide range of problem spaces — while also thinking of ourselves as learning machines that are prone to cognitive biases . Who knows, maybe the processes we develop at the Machine Learning Hub will offer clues to achieving AGI . ART OF THE FRAME As much as thinking about these three contexts has helped us, it’s also reinforced the fact that machine learning is often more art than science. For us, it’s an art of emphasis and de-emphasis. It’s the art of finding frames to put around the world — whether that’s a frame around an industry, a frame around the data, or a frame around our own learning. Whatever the frame, our hope is to enlarge and energize the features that matter — and to see them with fresh eyes. For more about our work at the Machine Learning Hub or to schedule a session, reach out to us. We’d love to continue the conversation. * Machine Learning * Data Science * Industry * Data * Transfer Learning 5 Blocked Unblock Follow FollowingIBM ML HUB IBM Machine Learning Hub. To be the best, learn from the best. Latest on Machine Learning, AI & more. Info: MLHub@us.ibm.com http://ibm-ml-hub.com/ #IBMML #ML FollowINSIDE MACHINE LEARNING Deep-dive articles about machine learning and data. Curated by IBM Analytics. * Share * 5 * * * Never miss a story from Inside Machine learning , when you sign up for Medium. Learn more Never miss a story from Inside Machine learning Get updates Get updates","“What do you do for a living?” That question used to have a pretty clear answer: “I’m a data scientist.” But lately, it’s gotten more complicated…",The 3 Kinds of Context: Machine Learning and the Art of the Frame,Live,74 182,"Skip navigation Sign in SearchLoading... Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE. WATCH QUEUE QUEUE Watch Queue Queue * Remove all * Disconnect The next video is starting stop 1. Loading... Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: TOUR THE COMMUNITY SECTION developerWorks TVLoading... Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking... Subscribe Subscribed Unsubscribe 17KLoading... Loading... Working... Add toWANT TO WATCH THIS AGAIN LATER? Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO? Sign in to report inappropriate content. Sign in * Transcript * Statistics * Add translations 5 views 0LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1Loading... Loading... TRANSCRIPT The interactive transcript could not be loaded.Loading... Loading... Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning * CATEGORY * Science & Technology * LICENSE * Standard YouTube License Show more Show lessLoading... Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * Data Science Experience: Manage object storage - Duration: 2:01. developerWorks TV No views * New 2:01 -------------------------------------------------------------------------------- * Data Science Experience: Analyze precipitation data using a community notebook - Duration: 5:15. developerWorks TV No views * New 5:15 * Data Science Experience: Analyze NYC traffic collisions data with a community notebook - Duration: 8:08. developerWorks TV 5 views * New 8:08 * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21. IBM Analytics 8,386 views 8:21 * Introduction to Spark and Data Science Experience - Duration: 49:24. Data Gurus 419 views 49:24 * Datascience made simple with IBM DSX | HackerEarth Webinar - Duration: 1:06:11. HackerEarth 260 views 1:06:11 * Introducing the Data Science Experience - Duration: 2:31. IBM Analytics 14,454 views 2:31 * Data Science Experience: Create a project and notebook - Duration: 1:04. developerWorks TV 1 view * New 1:04 * Data Science Experience: Build SQL queries with Apache Spark - Duration: 3:29. developerWorks TV 2 views * New 3:29 * Creating the Data Science Experience - Duration: 3:55. IBM Analytics 3,197 views 3:55 * Data Science and Web Development with Python in Visual Studio - Duration: 8:58. Microsoft Visual Studio 5,303 views 8:58 * Immerse yourself in the world of data science at IBM Datapalooza - Duration: 1:59. IBM Analytics 1,062 views 1:59 * Armand Ruiz Gabernet, IBM - BigDataNYC #BigDataNYC 2016 #theCUBE - Duration: 15:26. SiliconANGLE 826 views 15:26 * JavaOne: Microservice hands-on - Duration: 5:22. developerWorks TV No views * New 5:22 * IBM Data Science Experience (DSX) + Spark SQL Intro - Duration: 29:51. Nacho Alonso 413 views 29:51 * Data Science Expert Interview: Influencer roundtable - Duration: 24:11. IBM Analytics 2,024 views 24:11 * IBM Data Science Experience - Duration: 3:55. Valeria Montrucchio 360 views 3:55 * JavaOne: The excitement so far - Duration: 5:04. developerWorks TV 1 view * New 5:04 * IBM Big SQL: Analyze HDFS data with IBM Cognos Analytics - Duration: 6:54. developerWorks TV No views * New 6:54 * BigInsights on Cloud: Use Sqoop to Ingest Data from Compose for MySQL - Duration: 5:24. developerWorks TV 6 views * New 5:24 * Language: English * Content location: United States * Restricted Mode: Off History HelpLoading... Loading... Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Test new features * Loading... Working... Sign in to add this to Watch LaterADD TO Loading playlists...",This video provides a tour of the Community section in IBM Data Science Experience. ,Tour the Community in DSX,Live,75 183,"☰ * Login * Sign Up * Learning Paths * Courses * Our Courses * Partner Courses * Badges * Our Badges * BDU Badge Program * Watson Student Advisor * BLOG Welcome to the BDUBlog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * This Week in Data Science (May 2, 2017) * This Week in Data Science (April 25, 2017) * This Week in Data Science (April 18, 2017) * This Week in Data Science (April 11, 2017) * How to Become a Data Scientist CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsBLOGROLL * RBloggers THIS WEEK IN DATA SCIENCE (MAY 2, 2017) Posted on May 2, 2017 by Janice Darling Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * Video Roundup: New from IBM Watson – A brief run down of some new things IBM Watson is tackling. * Five Missteps to Avoid on your First Big Data Journey. – Steps to take in order to avoid common Big Data pitfalls. * How Machine Learning Is Changing The Future Of Digital Businesses – How Machine Learning impacts automation and digital transformation. * Hacking maps with ggplot2 – Short look at mapping with the R package ggplot2. * AI & Machine Learning Black Boxes: The Need for Transparency and Accountability – The importance of comprehending the inner workings Machine Learning Algorithms. * How just 30 machines beat a warehouse-sized supercomputer to set a new world record – IBM partners with Nvidia to showcase the ability of massively parallel processing on GPUs. * Data Analytics Is The Key Skill For The Modern Engineer – How engineers can embrace Data Analytics to streamline business operations and task integration. * Building and Exploring a Map of Reddit with Python – A tutorial on how to explore a map of the most popular subreddits with python. * Data scientists really love their jobs, survey finds – The results of a survey showing how satisfied Data Scientists are with their jobs. * Reproducible Data Science with R – A presentation on the application of a Reproducible Workflow to Data Science in R. * IBM uses deep learning to better detect a leading cause of blindness – IBM has made another application of cognitive computing to the medical field. * Awesome Deep Learning: Most Cited Deep Learning Papers – A list of fairly recent must read publications on Deep Learning. * Emotion Detection Using Machine Learning – An example of the use of Deep Learning to perform feature extractions. * Plotting Data Online via Plotly and Python – Introductory steps to creating plots with Plotly. * Machine Learning Classification Using Naive Bayes – A classification exercise using the Naive-Bayes algorithm in R. * The Art of Data –How Watson, fed with data about different subjects, helped to create art. FEATURED COURSES FROM BDU * SQL and Relational Databases 101 – Learn the basics of the database querying language, SQL. * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out. * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used to detect patterns hidden in data. * Using R with Databases – Learn how to unleash the power of R when working with relational databases in our newest free course. * Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to apply deep learning to different data types in order to solve real world problems. UPCOMING DATA SCIENCE EVENTS * UofT Data Science Workshop: Intro to Clustering with R – May 2, 2017 @ 6:00 pm – 9:00 pm * UofT Data Science Workshop: Intro to Classification with R –May 4, 2017 @ 6:00 pm – 7:00 pm * IBM Webinar: Charting Your Analytical Future Webinar: Get the best of Self-service Analytics and Managed reporting together – May 4, 2017 @ 12:00 pm – 1:00 pm COOL DATA SCIENCE VIDEOS * Machine Learning With Python – Collaborative Filtering & Its Challenges – An Exploration of Collaborative Filtering Techniques. * Machine Learning With Python – Course Summary – A review of the BDU course Machine Learning 101. SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * RELATED Tags: analytics , data science , events , weekly roundup -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Events * Ambassador Program * Resources * FAQ * Legal Follow us * * * * * * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (May 2, 2017)",Live,76 184,"FACEBOOK Email or Phone Password Forgot account?UPDATE YOUR BROWSER You’re using a web browser that isn’t supported by Facebook. To get a better experience, go to one of these sites and get the latest version of your preferred browser: Internet Explorer Mozilla Firefox Google Chrome Get Facebook on Your Phone Stay connected anytime, anywhere. * English (US) * Español * Français (France) * 中文(简体) * العربية * Português (Brasil) * Italiano * 한국어 * Deutsch * हिन्दी * 日本語 * Sign Up Log In Messenger Facebook Lite Mobile Find Friends Badges People Pages Places Games Locations Celebrities Groups Moments About Create Ad Create Page Developers Careers Privacy Cookies Ad Choices Terms Help Settings Activity Log Facebook © 2016","While the sum of Facebook's offerings covers a broad spectrum of the analytics space, we continually interact with the open source community in order to share our experiences and also learn from others.",Apache Spark @Scale: A 60 TB+ production use case,Live,77 186,"☰ * Login * Sign Up * Learning Paths * Courses * * Our Courses * Partner Courses * Badges * * Our Badges * BDU Badge Program * Student Advisor * Business * BLOG Welcome to the BDUBlog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * This Week in Data Science (May 16, 2017) * This Week in Data Science (May 9, 2017) * This Week in Data Science (May 2, 2017) * This Week in Data Science (April 25, 2017) * This Week in Data Science (April 18, 2017) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsBLOGROLL * RBloggers THIS WEEK IN DATA SCIENCE (MAY 16, 2017) Posted on May 16, 2017 by Janice Darling Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * General Tips for Web Scraping with Python – Tips on scrapping and saving data from the web. * Top 10 Skills in Data Science – The results of a study on the skills possessed by Data Science. * Data Mining for Social Intelligence – Opinion Data as a Monetizable Resource – A look at how opinion data is quickly becoming a monetary resource. * Sparking change: How analytics is helping global communities improve water security – How Water Mission turned to IBM to use analytics to improve access to safe water. * How to go about interpreting regression coefficients – A brief look at coefficients and how to interpret them. * Three Mistakes that set Data Scientists up for Failure – Mistakes that Data Scientists may make in their line of work and how to avoid them. * Analytics and the cloud: The rise of open source – Open Source and IBM’s involvement in Open Source software. * Top 15 Python Libraries for Data Science in 2017 – A look at 15 of the most popular Python Data Science libraries. * IBM updates PowerAI to make deep learning more accessible – How IBM updates to PowerAI will make it easier for Data Scientists and developers to integrate and deploy models. * Big Data for Humans: The Importance of Data Visualization – The importance of the most crucial and oft overlooked step in Analytics: Data Visualization. * Top 3 ways to measure the success of your analytics investment – Three factors to consider when evaluating technologies that aid in business decisions. * Pretty histograms with ggplot2 – Learn to create visually stimulating histograms by example with ggplot2 for R. * IBM pushes for NVMe adoption to boost storage speeds – Why the adoption of NVMe is necessary for today’s vast amounts of data. * In case you missed it: April 2017 roundup – A look back at all the stories from Revolutions R blog. * Machine Learning Pipelines for R – How the R package pipeliner helps to streamline the process of building machine learning and statistical models. * Machine Learning. Linear Regression Full Example (Boston Housing). – Short tutorial on performing linear regression on a data set. FEATURED COURSES FROM BDU * SQL and Relational Databases 101 – Learn the basics of the database querying language, SQL. * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out. * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used to detect patterns hidden in data. * Using R with Databases – Learn how to unleash the power of R when working with relational databases in our newest free course. * Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to apply deep learning to different data types in order to solve real world problems. SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * RELATED Tags: analytics , Big Data , data science , events , weekly roundup -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Events * Ambassador Program * Resources * FAQ * Legal * Changelog Follow us * * * * * * * We are currently rebranding our site in order to better reflect our focus on Data Science and Cognitive Computing. Please bear with us. Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (May 16, 2017)",Live,78 187,"This video shows you how to a Cloudant Geospatial index is used in a real world application. Watch the other videos in this series titled ""Introducing Cloudant Geospatial"" and ""Build and Query a Cloudant Geospatial Index"". Find more videos in the Cloudant Learning Center at http://www.cloudant.com/learning-center.",See how a Cloudant Geospatial index is used in a real world application. ,Tutorial: How to Cloudant Geospatial in Action,Live,79 190,"Homepage Follow Sign in / Sign up Homepage * Home * Data Science Experience * * Watson Data Platform * Akhil Tandon Blocked Unblock Follow Following Jun 9 -------------------------------------------------------------------------------- LEVERAGE SCIKIT-LEARN MODELS WITH CORE ML OVERVIEW This post discusses how to implement Apple’s new Core ML platform within DSX, which was announced a few days ago at WWDC 2017. Core ML is a platform that allows integration of powerful pre-trained models into iOS and macOS applications. Core ML comes with two main benefits: efficiency and privacy. Core ML has been specifically engineered for on-device performance. Having a pre-trained model accessible on your device removes a network connection requirement and ensures privacy for users. But the best thing about Core ML is that you can continue to use your favorite machine learning libraries in Python, and easily convert your pre-trained models to Core ML objects that can be exported to your iOS and macOS application development. The conversion to Core ML objects from libraries such as Keras , sklearn , LibSVM , and others is supported out-of-the-box in Data Science Experience. INSTALLATION You can install coremltools via pip, which can be called from within a notebook in DSx. It's important to note that Core ML supports Python 2.7 only. !pip install -U coremltools CREATE A LINEAR MODEL WITH SCIKIT-LEARN First, create some data using numpy , a library for computing with Python. We’ll create a very simple model because the focus of this short guide is converting a scikit-learn object to a Core ML model. import numpy as np x_values = np.linspace(-2.25,2.25,300) y_values = np.array([np.sin(x) + np.random.randn()*.25 for x in x_values]) Now that we’ve got our data, we’ll perform a linear regression. from sklearn.linear_model import LinearRegression lm = LinearRegression().fit(x_values.reshape(-1,1), y_values) CREATE A CORE ML MODEL Core ML supports many kinds of machine learning models in addition to linear models, including neural networks, tree-based models, and more. The Core ML model format supports the .mlmodel file extension. We'll show how to instantiate an MLModel using this kind of file. The aim is to painlessly transition from an sklearn object to a Core ML model. from coremltools.converters import sklearn coreml_model = sklearn.convert(lm) print(type(coreml_model)) Now coreml_model is our Core ML object. The MLModel class has a few attributes and methods. Metadata contains information about the origin, author, inputs and outputs, among other things. Let's see how this works. coreml_model.author = ""DSX"" print(coreml_model.author) DSX We can add other metadata as we please. The list of attributes includes: * author : The author of the model. * input_description : The descriptions of the inputs. This can include information about the data types, number of features, and more. In our example, we have a single input, a real valued number. * output_description : A description of the output. * short_description : A comment on the purpose of the model. * user_defined_metadata : Anything you like! coreml_model.short_description = ""I approximate a sine curve with a linear model!"" coreml_model.input_description[""input""] = ""a real number"" coreml_model.output_description[""prediction""] = ""a real number"" print(coreml_model.short_description) I approximate a sine curve with a linear model! At this point you have a tuned and labeled CoreML object. The goal is to seamlessly integrate this into the existing workflow of an iOS/macOS application developer who needs your machine learning models. Saving the model to local storage is very easy using coremltools : coreml_model.save('linear_model.mlmodel') We can also create an MLModel object using a .mlmodel file. from coremltools.models import MLModel loaded_model = MLModel('linear_model.mlmodel') print(loaded_model.short_description) I approximate a sine curve with a linear model! SAVE YOUR MODEL An application developer can access your trained model with Object Storage using IBM Bluemix . You will need your Bluemix credentials to link to Object Storage, which can be generated from the data assets tab in your notebook: You need to have some files in your data assets for this screen to be visible! The cell below shows the code generated from this process: credentials_1 = { 'auth_url':'https://identity.open.softlayer.com', 'project':'object_storage_9-----3', 'project_id':'7babac2********e0', 'region':'dallas', 'user_id':'9603b8************70f', 'domain_id':'2c66d***********b9d26', 'domain_name':'1026***', 'username':'member_******************', 'password':""""""***************"""""", 'container':'TemplateNotebooks', 'tenantId':'undefined', 'filename':'2001.csv' } Don’t worry about the filename in this credentials dictionary, as we will define a function put_file that will use the important security credentials generated above along with the local mlmodel file to send it to Object Storage. from io import BytesIO import requests import json def put_file(credentials, local_file_name): """"""This functions returns a StringIO object containing the file content from Bluemix Object Storage V3."""""" f = open(local_file_name,'r') my_data = f.read() url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens']) data = {'auth': {'identity': {'methods': ['password'], 'password': {'user': {'name': credentials['username'], 'domain': {'id': credentials['domain_id']}, 'password': credentials['password']}}}}} headers1 = {'Content-Type': 'application/json'} resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1) resp1_body = resp1.json() for e1 in resp1_body['token']['catalog']: if(e1['type']=='object-store'): for e2 in e1['endpoints']: if(e2['interface']=='public' and e2['region']=='dallas'): url2 = ''.join([e2['url'],'/', credentials['container'], '/', local_file_name]) s_subject_token = resp1.headers['x-subject-token'] headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'} resp2 = requests.put(url=url2, headers=headers2, data = my_data) print resp2 Calling put_file with your credentials and linear_model.mlmodel as the local filename will send your Core ML model into Object Storage. It is now available for the iOS/macOS application developer to access through Bluemix. You can find documentation on retrieving assets from Object Storage here . Now you can convert pre-trained machine learning models that you made in DSX and provide them to a software developer for use in iOS and macOS applications. Here is a link to the notebook in DSx where we ran this code. Please don’t hesitate to contact myself or Adam Massachi if you have any questions! -------------------------------------------------------------------------------- Originally published at datascience.ibm.com on June 9, 2017. * Machine Learning * Data Science * IBM * Scikit Learn * Coreml 2 Blocked Unblock Follow FollowingAKHIL TANDON FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * 2 * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","This post discusses how to implement Apple’s new Core ML platform within DSX, which was announced a few days ago at WWDC 2017. Core ML is a platform that allows integration of powerful pre-trained…",Leverage Scikit-Learn Models with Core ML,Live,80 199,"Homepage Follow Sign in Get started Homepage * Home * About Insight * Data Science * Data Engineering * Health Data * AI * Javed Qadrud-Din Blocked Unblock Follow Following Nov 28 -------------------------------------------------------------------------------- TRANSFORM ANYTHING INTO A VECTOR ENTITY2VEC: USING COOPERATIVE LEARNING APPROACHES TO GENERATE ENTITY VECTORS Javed Qadrud-Din previously worked as a business architect at IBM Watson. At Insight, he developed a new method that allows businesses to efficiently represent users, customers, and other entities in order to better understand, predict, and serve them. Want to learn applied Artificial Intelligence from top professionals in Silicon Valley or New York? Learn more about the Artificial Intelligence program. -------------------------------------------------------------------------------- Businesses commonly need to understand, organize, and make predictions about their users and partners. For example, trying to predict which users will leave the platform (churn prediction), or identifying different types of advertising partners (clustering). The challenge comes from trying to represent these entities in a meaningful and compact way, to feed them into a machine learning classifier for example. I will be presenting the way I tackled this challenge below, all of the code is available on GitHub here . DRAWING INSPIRATION FROM NLP One of the most significant recent advances in Natural Language Processing (NLP) came from a team of researchers at Google ( Tomas Mikolov , Ilya Sutskever , Kai Chen , Greg Corrado , Jeffrey Dean ) created word2vec , which is a technique to represent words as continuous vectors called embeddings . The embeddings they trained on 100 billion words (and then open sourced) managed to capture much of the semantic meaning of the words they represent. For example, you can take the embedding for ‘king’, subtract the embedding for ‘man’, add the embedding for ‘woman’, and the result of those operations will be very close to the embedding for ‘queen’ — an almost spooky result that shows the extent to which the Google team managed to encode the meanings of human words. Mikolov et. al.Ever since, word2vec has been a staple of Natural Language Processing, providing an easy and efficient building block for many text based applications such as classification, clustering, and translation. The question I asked myself while at Insight was how techniques similar to word embeddings might be employed for other types of data, such as people or businesses. ABOUT EMBEDDINGS Let’s first think about what an embedding is. Physically, an embedding is just a list of numbers (a vector) that represent some entity. For word2vec, the entities were English words. Each word had its own list of numbers. These lists of numbers are optimized to be useful representations of the entities they stand for by adjusting them through gradient descent on a training task. If the training task requires remembering general information about the entities of interest, then the embeddings will end up absorbing that general information. -------------------------------------------------------------------------------- EMBEDDINGS FOR WORDS In the word2vec case, the training task involved taking a word (call it Word A) and predicting the probability that another word (Word B) appeared in a 10-word window around Word A somewhere in a massive corpus of text (100 billion words from Google News). Each word would have this done tens of thousands of times during training, with words that commonly appear around it, and words that never appear in the same context (a technique called negative sampling). This task forces the embedding for each word to encode information about the other words that co-occur with the embedded word. Words that co-occurred with similar sets of words would end up having similar embeddings. For example, the word ‘smart’ and the word ‘intelligent’ are often used interchangeably, so the set of words typically found around them in a large corpus will be a very similar set. As a result, the embeddings for ‘smart’ and ‘intelligent’ will be very similar to each other. Embeddings created with this task are forced to encode so much general information about the word, that they can be used to stand for the word in unrelated tasks. The Google word2vec embeddings are used in a wide range of natural language processing applications, such as sentiment analysis and text classification. There are also alternative word embeddings designed by other teams using different training strategies. Among the most popular are GloVe and CoVe . -------------------------------------------------------------------------------- EMBEDDINGS FOR ANYTHING Word vectors are essential tools for a wide variety of NLP tasks. But pre-trained word vectors don’t exist for the types of entities businesses often care the most about. Where there are pre-trained word2vec embeddings for words like ‘red’ and ‘banana’, there are no pre-trained word2vec embeddings for users of a social network, local businesses, or any other entity that isn’t frequently mentioned in the Google News corpus from which the word2vec embeddings were derived. Businesses care about their customers, their employees, their suppliers, and other entities for which there are no pre-trained embeddings. Once trained, vectorized representations of entities can be used as inputs to a wide range of machine learning models. For example, they could be used in models predicting which ads users are likely to click on, which university applicants are likely to graduate with honors, or which politician is likely to win an election. Entity embeddings allow us to accomplish these types of tasks by leveraging the bodies of natural language text associated with these entities that businesses frequently have. For example, we can create entity embeddings from the posts a user has written, the personal statement a university applicant wrote, or the tweets and blog posts people write about a politician. Any business that has entities paired with text could make use of entity embeddings, and when you think about it, most businesses have this one way or another: Facebook has users and the text they post or are tagged in, LinkedIn has users and the text of their profiles, Yelp has users and the reviews they write, along with businesses and the reviews written about them, Airbnb has places to stay along with descriptions and reviews, universities have applicants and the admission essays they write, and the list goes on. In fact, Facebook recently published a paper detailing an entity embedding technique. The aim with my entity2vec project was to find a way to use text associated with entities to create general-use embeddings that represent those entities. To do this, I used a technique somewhat similar to word2vec’s negative sampling to squeeze the information from a large body of text known to be associated with a certain entity into entity embeddings. EXAMPLE 1: FAMOUS PEOPLE To develop and test the technique, I tried training embeddings to represent prominent people (e.g. Barack Obama, Lady Gaga, Angelina Jolie, Bill Gates). Prominent people were a good starting point because, for these very famous peoples’ names, pre-trained Google word2vec embeddings exist and are freely available, so I’d be able to compare my embeddings’ performance against the word2vecs for those peoples’ names. Like with word2vec, I needed a training task that would force the entity embeddings to learn general information about the entities they stand for. I decided to train a classifier that would take a snippet of text from a person’s Wikipedia article and learn to guess who that snippet is about. The training task would take several entity embeddings as input and would output the position of the entity embedding that the text snippet is about. In the following example, the classifier would see as input a text snippet about Obama, as well as the embeddings for Obama, and three other randomly chosen people. The classifier would output a number representing which of its inputs is the Obama embedding. All of the embeddings would be trainable in each step, so, not only would the correct person embedding learn information about what that person is , but the other incorrect embeddings would also learn something about what their people are not . This technique seemed sensible intuitively, but, in order to validate my results, I needed to try the resulting embeddings out on some other tasks to see if they’d actually learned general information about their entities. To do this, I trained simple classifiers on several other tasks that took entity embeddings as inputs and outputted classifications like the gender or occupation of the entity. Here is the architecture of these classifiers: And here are the results obtained, compared against guessing and against doing the same thing with word2vec embeddings. My embeddings performed pretty much on-par with the word2vec embeddings even though mine were trained on much less text — about 30 million words vs 100 billion. That is four orders of magnitude less text required! -------------------------------------------------------------------------------- EXAMPLE 2: YELP BUSINESSES Next, I wanted to see if this technique was generalizable. Did it just work on people from Wikipedia, or does the technique work more generally? I tested it by trying exactly the same technique to train embeddings that represent businesses using the Yelp dataset. Yelp makes a slice of its dataset available online that contains businesses along with all the tips and reviews written about those businesses. I trained embeddings using precisely the same technique as I used with the Wikipedia people, except this time the text consisted of Yelp reviews about businesses and the entities were the businesses themselves. The task looked like this: Once trained, I tested the embeddings on a new task — figuring out which type of business a certain business was, e.g. CVS Pharmacy is in the ‘health’ category whereas McDonalds is in the ‘restaurants’ category. There were ten possible categories a business could fall into, and a single business could fall into multiple categories — so it was a challenging multi-label classification task with ten labels. The results, as compared with educated guessing, were as follows: This is a great result considering the difficulty of such a task! -------------------------------------------------------------------------------- Altogether, it was a successful experiment. I trained embeddings to capture the information in natural language text, and then I was able to get useful information back out of them by validating them on other tasks. Any business that has entities paired with text could use this technique, to be able to run predictive tasks on their data. NEXT STEPS AND CODE While these results are promising, the idea can be taken further by incorporating structured data into the embeddings along with text, which I will be looking to explore in the future. Anyone can now use this technique on their own data using a Python package I created and just a few lines of code. You can find the package on GitHub here . -------------------------------------------------------------------------------- Want to learn applied Artificial Intelligence from top professionals in Silicon Valley or New York? Learn more about the Artificial Intelligence program. Are you a company working in AI and would like to get involved in the Insight AI Fellows Program? Feel free to get in touch . Thanks to Emmanuel Ameisen . * Machine Learning * Insight Ai * Artificial Intelligence * NLP * Business One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. Blocked Unblock Follow FollowingJAVED QADRUD-DIN FollowINSIGHT DATA Insight Fellows Program —Your bridge to careers in Data Science and Data Engineering. * * * * Never miss a story from Insight Data , when you sign up for Medium. Learn more Never miss a story from Insight Data Get updates Get updates",Using cooperative learning approaches to generate entity vectors.,Transform anything into a vector,Live,81 202,"Skip navigation Sign in SearchLoading... Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE. WATCH QUEUE QUEUE Watch Queue Queue * Remove all * Disconnect The next video is starting stop 1. Loading... Watch Queue Queue __count__/__total__TRY AD-FREE FOR 3 MONTHS Loading... Sign up by October 31st for an extended 3-month trial of YouTube Red.Working... No thanks Try it free Find out why CloseIBM WATSON MACHINE LEARNING: BUILD A LOGISTIC REGRESSION MODEL developerWorks TVLoading... Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking... Subscribe Subscribed Unsubscribe 18KLoading... Loading... Working... Add toWANT TO WATCH THIS AGAIN LATER? Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO? Sign in to report inappropriate content. Sign in * Transcript * Statistics * Add translations 262 views 2LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 3 0DON'T LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1Loading... Loading... TRANSCRIPT The interactive transcript could not be loaded.Loading... Loading... Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017This video shows how to create, train, save, and deploy a logistic regression model that assesses the likelihood that a customer of an outdoor equipment company will buy a tent based on age, sex, marital status and job profession. * CATEGORY * Science & Technology * LICENSE * Standard YouTube License Show more Show lessLoading... Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * IBM Watson Machine Learning: Build a Predictive Analytic Model - Duration: 4:06. developerWorks TV 80 views 4:06 -------------------------------------------------------------------------------- * Building Logistic Regression model using RapidMiner Studio - Duration: 5:07. Fnu Chintan 1,166 views 5:07 * Logistic Regression - The Math of Intelligence (Week 2) - Duration: 44:01. Siraj Raval 20,053 views 44:01 * Video 7: Logistic Regression - Introduction - Duration: 11:53. dataminingincae 118,061 views 11:53 * Transform Data into Intelligence Using IBM Watson Machine Learning - Duration: 1:19. IBM Analytics 657 views 1:19 * Machine Learning with Scikit-Learn - The Cancer Dataset - 8 - Logistic Regression 1 - Duration: 8:26. Cristi Vlad 1,119 views 8:26 * Logistic Regression Machine Learning Method Using Scikit Learn and Pandas Python - Tutorial 31 - Duration: 13:28. TheEngineeringWorld 676 views 13:28 * Predicting and Analyzing Claims Fraud Using IBM Watson Analytics - Duration: 13:45. Brian Snyder 2,150 views 13:45 * 4. Building Logistic Regression models using RapidMiner Studio - Duration: 23:44. Pallab Sanyal 10,291 views 23:44 * Azure Machine Learning: Getting Started - Duration: 3:26. Microsoft Azure 3,593 views 3:26 * 04 Predictive Analytics Training with Weka (Building a classifier) - Duration: 9:01. Predictive Analytics 12,857 views 9:01 * Training a machine learning model with scikit-learn - Duration: 19:49. Data School 76,093 views 19:49 * Logistic Regression Classifiers - Duration: 15:21. Mike Bernico 10,237 views 15:21 * Weka Regression Models - Diabetes Dataset - Duration: 4:50. Arpan Shrivastava 1,805 views 4:50 * Regresion Lineal Simple con Java (commons math, jmathplot) - Duration: 20:06. jc jimenez 5,560 views 20:06 * logistic regression using java et WEKA [Tutoriel Complet ] - Duration: 10:03. Hind 168 views 10:03 * 2 Logistic Regression Example - Duration: 17:19. Quant Education 18,864 views 17:19 * WEKA API 15/19: Making Predictions (Regression) - Duration: 4:47. Noureddin Sadawi 4,909 views 4:47 * Healthy Habits Pet Assembly, Part 2 - Duration: 8:03. developerWorks TV 6 views * New 8:03 * Healthy Habits Pet - Flash MicroPython - Duration: 3:30. developerWorks TV 6 views * New 3:30 * Language: English * Content location: United States * Restricted Mode: Off History HelpLoading... Loading... Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Test new features * Loading... Working... Sign in to add this to Watch LaterADD TO Loading playlists...","This video shows how to create, train, save, and deploy a logistic regression model using IBM Waston Machine Learning and IBM Data Science Experience that assesses the likelihood that a customer of an outdoor equipment company will buy a tent based on age, sex, marital status and job profession.",Build a logistic regression model with WML & DSX,Live,82 203,"Compose The Compose logo Articles Sign in Free 30-day trialCOMPOSE'S FIRST GRAPH DATABASE: JANUSGRAPH Published Jun 15, 2017 graph janusgraph compose Compose's first graph database: JanusGraphAt Compose we've always looked to ensure you can get the databases you need. Today, we are proud to announce that JanusGraph is coming to Compose and will bring with it the power of fully open source graph databases. JanusGraph is a new player in databases with a deep heritage. It builds on a fork of the Titan graph database, a previous leader in open source graph databases. That code is capable of being plugged into a number of different database backends. It's all then integrated with the database-agnostic Apache Tinkerpop graph framework. The JanusGraph project itself is organized under the Linux Foundation and led by developers from Expero, Google, GRAKN.AI and IBM. And it's all open source with new companies joining the community to enhance JanusGraph. At Compose, we've worked with IBM's JanusGraph developers to combine Compose's one-click deployment, high-availability, managed database platform with JanusGraph. A great graph database demands a great backend and we've teamed it with Scylla, the high-performance Cassandra compatible database for best reliability. Then we added our automated backup system, private VLAN configuration and HAProxy managed access to give peace of mind. That means that from today, Compose users can deploy the industry leading graph database from their Compose account. WHY A GRAPH DATABASE? Graph databases model the world as nodes and directed connections, vertices, and edges as graph theory calls them. Both can have properties associated with them and the connections. Both are equal elements in how the database is managed and queries and a query on a graph database can start at a point and explore the connections around it so you can say ""I'm looking any person who likes brand X who has a friends or friends of friends who buy brand Y and Z"". Relational databases typically treat relations as a simple connection between one row and another or demand that you add another table to associate data with the relationship. That means that when you want to query across relationships and examine the network that exists, you have to do a lot of expensive queries. A Graph database as part of your datalayer allows you to understand and explore relationships and networks within your data without compromising the performance of your production relational and document stores. JANUSGRAPH ON COMPOSE We're launching JanusGraph on Compose as a beta as we build the functionality around it. You'll find it ready to deploy in the beta section of the Create Deployment view of Compose.If you haven't discovered Compose yet, you can sign up for a free 30-day trial below. If you want to learn more about JanusGraph on Compose, check out our JanusGraph documentation . Try Compose free for 30 days -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. attribution Nayuki Josh Mintz is an Offering Manager at IBM Watson Data Platform. He has an enthusiasm for homemade hummus, foreign policy, and the English Premier League. Love this article? Head over to Josh Mintz ’s author page and keep reading.RELATED ARTICLES Jun 8, 2017COMPOSE POSTGRESQL POWERS UP TO 9.6 TL;DR: You can now run PostgreSQL 9.6 on Compose, PostGIS has been upgraded and now PGrouting is also available. PostgreSQL 9… Dj Walker-Morgan Jun 6, 2017COMPOSE NOTES: A NEW WAY TO VIEW YOUR DATABASES AND SAFER DELETION We're giving users more control on how they can view their many databases on Compose - and it all started with us adding a ne… Dj Walker-Morgan Jun 1, 2017COMPOSE ENTERPRISE COMES TO IBM BLUEMIX If you use Compose databases on the IBM Bluemix platform, you'll be pleased to know you can now sign up for Compose Enterpris… Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Today, we are proud to announce that JanusGraph is coming to Compose and will bring with it the power of fully open source graph databases.",Compose's first graph database: JanusGraph,Live,83 204,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix * Load dashDB Data with Apache Spark * Load Cloudant Data in Apache Spark Using a Scala Notebook * Load Cloudant Data in Apache Spark Using a Python Notebook * Build SQL Queries * Use the Machine Learning Library * Use Spark Streaming * Tutorials and samples * Sample Notebooks * Sample Python Notebook: Precipitation Analysis * Sample Python Notebook: NY Motor Vehicle Accidents Analysis * Build a Custom Library for Apache Spark * Sentiment Analysis of Twitter Hashtags * Cloudant * Get started * Copy a sample database * Create a database * Change database permissions * Connect to Bluemix * Developing against Cloudant * Intro to the HTTP API * Execute common API commands * Set up pre-authenticated cURL * Database Replication * Use cases for replication * Create a replication job * Check replication status * Set up replication with cURL * Indexes and Queries * Use the primary index * MapReduce and the secondary index * Build and query a search index * Use Cloudant Query * Cloudant Geospatial * Integrate * Create a Data Warehouse from Cloudant Data * Store Tweets Using Cloudant, dashDB, and Node-RED * Load Cloudant Data in Apache Spark Using a Scala Notebook * Load Cloudant Data in Apache Spark Using a Python Notebook * dashDB * dashDB Quick Start * Get * Get started with dashDB on Bluemix * Load data from the desktop into dashDB * Load data from the Cloud into dashDB * Move data to the Cloud with dashDB’s MoveToCloud script * Load Twitter data into dashDB * Load XML data into dashDB * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB * Load JSON Data from Cloudant into dashDB * Integrate dashDB and Informatica Cloud * Load geospatial data into dashDB to analyze in Esri ArcGIS * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion Workbench (DCW) * Install IBM Database Conversion Workbench * Convert data from Oracle to dashDB * Convert IBM Puredata for Analytics to dashDB * From Neteeza to dashDB: It’s That Easy! * Use Aginity Workbench for IBM dashDB * Build * Create Tables in dashDB * Connect apps to dashDB * Analyze * Use dashDB with Watson Analytics * Use dashDB with Spark * Use dashDB with Pyspark and Pandas * Use dashDB with R * Publish apps that use R analysis with Shiny and dashDB * Perform market basket analysis using dashDB and R * Connect R Commander and dashDB * Use dashDB with IBM Embeddable Reporting Service * Use dashDB with Tableau * Leverage dashDB in Cognos Business Intelligence * Integrate dashDB with Excel * Extract and export dashDB data to a CSV file * Analyze With SPSS Statistics and dashDB * DataWorks * Get Started * Connect to Data in IBM DataWorks * Load Data for Analytics in IBM DataWorks * Blend Data from Multiple Sources in IBM DataWorks * Shape Raw Data in IBM DataWorks * DataWorks API LOAD TWITTER DATA INTO DASHDBJess Mantaro / July 17, 2015This video shows how easy it is to consume Twitter data with IBM dashDB forfurther analytics.You can also read a transcript of this videoTry the tutorialRELATED LINKS * Get the codePlease enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",Watch how easy it is to consume Twitter data with IBM dashDB for further analytics.,Tutorial: How to load Twitter data in IBM dashDB ,Live,84 207,"RStudio Blog * Home * Subscribe to feed TESTTHAT 1.0.0 April 29, 2016 in Packages testthat 1.0.0 is now available on CRAN. Testthat makes it easy to turn your existing informal tests into formal automated tests that you can rerun quickly and easily. Learn more at http://r-pkgs.had.co.nz/tests.html . Install the latest version with: install.packages(""testthat"") This version of testthat saw a major behind the scenes overhaul. This is the reason for the 1.0.0 release, and it will make it easier to add new expectations and reporters in the future. As well as the internal changes, there are improvements in four main areas: * New expectations. * Support for the pipe. * More consistent tests for side-effects. * Support for testing C++ code. These are described in detail below. For a complete set of changes, please see the release notes . IMPROVED EXPECTATIONS There are five new expectations: * expect_type() checks the base type of an object (with typeof() ), expect_s3_class() tests that an object is S3 with given class, and expect_s4_class() tests that an object is S4 with given class. I recommend using these more specific expectations instead of the generic expect_is() , because they more clearly convey intent. * expect_length() checks that an object has expected length. * expect_output_file() compares output of a function with a text file, optionally update the file. This is useful for regression tests for print() methods. A number of older expectations have been deprecated: * expect_more_than() and expect_less_than() have been deprecated. Please use expect_gt() and expect_lt() instead. * takes_less_than() has been deprecated. * not() has been deprecated. Please use the explicit individual forms expect_error(..., NA) , expect_warning(.., NA) , etc. We also did a thorough review of the documentation, ensuring that related expectations are documented together. PIPING Most expectations now invisibly return the input object . This makes it possible to chain together expectations with magrittr: factor(""a"") %>% expect_type(""integer"") %>% expect_s3_class(""factor"") %>% expect_length(1) To make this style even easier, testthat now imports and re-exports the pipe so you don’t need to explicitly attach magrittr. SIDE-EFFECTS Expectations that test for side-effects (i.e. expect_message() , expect_warning() , expect_error() , and expect_output() ) are now more consistent: * expect_message(f(), NA) will fail if a message is produced (i.e. it’s not missing), and similarly for expect_output() , expect_warning() , and expect_error() .quiet <- function() {} noisy <- function() message(""Hi!"") expect_message(quiet(), NA) expect_message(noisy(), NA) #> Error: noisy() showed 1 message. #> * Hi! * expect_message(f(), NULL) will fail if a message isn’t produced, and similarly for expect_output() , expect_warning() , and expect_error() .expect_message(quiet(), NULL) #> Error: quiet() showed 0 messages expect_message(noisy(), NULL) There were three other changes made in the interest of consistency: * Previously testing for one side-effect (e.g. messages) tended to muffle other side effects (e.g. warnings). This is no longer the case. * Warnings that are not captured explicitly by expect_warning() are tracked and reported. These do not currently cause a test suite to fail, but may do in the future. * If you want to test a print method, expect_output() now requires you to explicitly print the object: expect_output(""a"", ""a"") will fail, expect_output(print(""a""), ""a"") will succeed. This makes it more consistent with the other side-effect functions. C++ Thanks to the work of Kevin Ushey , testthat now includes a simple interface to unit test C++ code using the Catch library. Using Catch in your packages is easy – just call testthat::use_catch() and the necessary infrastructure, alongside a few sample test files, will be generated for your package. By convention, you can place your unit tests in src/test-.cpp . Here’s a simple example of a test file you might write when using testthat + Catch: #include

This is a test page for some interesting content

Click here!

Next, let’s put together some JavaScript that will listen to mouse events. For now, we'll log out the event directly and see what it gives us: document.addEventListener('click', function(event) { console.log(event); }); Simple enough - a single event listener that listens for clicks throughout our entire document. Now, when we click on the screen, we should see a MouseEvent in the logs that looks something like this: MouseEvent altKey:false bubbles:true button:0 ... returnValue:true screenX:189 screenY:590 ... pageX: 189 pageY: 830 ... toElement:div type:""click"" view:Window which:1 x:180 y:97 We’ll ignore almost all of these fields, but there are a few that are interesting to us with our click tracker. There are a few x and y coordinates, but the ones we’re the most interested in are the pageX and pageY fields, which represent the location on the website that the click occurred, irrespective of scrolling and viewport size. This means that a click on the site at the pageX location will always occur at that exact pixel location no matter how the browser is sized or how far down the user is scrolled when they click. We’ll also want to track the timestamp and toElement methods so we can search for which element was under the mouse when it was clicked. We'll want to associate all of the click events that occurred on each load of the site. We can do this by generating a random ID each time the user loads the page. The snippet of code can give us a workable random session ID: var generateRandomSessionId = function() { return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) { var r = Math.random() * 16 | 0, v = c == 'x' ? r : r } We’ll also want to create a timestamp using the JavaScript Date object: var timestamp = new Date.now(); The resulting object that we’ll store to represent each click looks like the following: { ""x"": 100, ""y"": 100, ""timestamp"": 150000020302, ""sessionId"": 'a53cdbe2-acd1-4231-a331-fc3280d42ef1' } Let’s put these all together into a snippet we can install on our site. (function() { var generateRandomSessionId = function() { return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) { var r = Math.random() * 16 | 0, v = c == 'x' ? r : r SETTING UP ELASTICSEARCH Now that we know what we want to store, let’s start pushing that data over to Elasticsearch. Elasticsearch uses a RESTFul API, but we don’t want to push our clicks directly from the browser since our URL includes our Elasticsearch credentials. To fix both of these, we’ll create a simple Node.JS application, following along from an earlier article on using Elasticsearch from Node.JS and use Express to handle our own API. Let’s start by creating a new Compose Elasticsearch deployment that we can push our click data out to. Then, create your Elasticsearch user and find your connection string on the Deployment Overview page. Once you have a valid connection string, we can start sending our RESTful API calls over to Elasticsearch. You'll need npm and the following Node modules to get this working: * elasticsearch * get-json * express * body-parser Install the modules using npm: npm install express body-parser elasticsearch get-json We’ll use the technique presented in the Getting Started guide to create a client.js and info.js , which we can use to make connections to our Elasticsearch deployment and get info about the deployment. They should look like the following: // client.js var elasticsearch=require('elasticsearch'); var client = new elasticsearch.Client( { hosts: [ 'https://[username]:[password]@[server]:[port]/', 'https://[username]:[password]@[server]:[port]/' ] }); module.exports = client; // info.js var client = require('./client.js'); client.cluster.health({},function(err,resp,status) { console.log(""-- Client Health --"",resp); }); Let’s use our new info.js file to do a quick check before we move forward. Type the following into the terminal: node info.js You should see a response that looks like this: -- Client Health -- { cluster_name: 'el-petitions', status: 'green', timed_out: false, number_of_nodes: 3, number_of_data_nodes: 3, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0 } If you get an error message (usually in HTML format) then double-check your connection credentials and make sure you’ve added a user / password for your deployment. CREATING AN INDEX An Index in Elasticsearch is different than you might be expecting - it’s more analogous to a Table in relational databases or a Collection in MongoDB. We can create the index a number of different ways, but here we’ll follow the Getting Started guide and do this in NodeJS. Create a new file called “init.js” and add the following: // init.js var client = require('./client'); client.indices.create({ index: 'clicks' },function(err,resp,status) { if(err) { console.log(err); } else { console.log(""create"",resp); } }); Run your new init.js file: node init.js And you should get the following response: create { acknowledged: true } Finally, let’s create the expressjs app that our frontend will call to save clicks to our database. We’ll use the Elasticsearch index call to add clicks to our index. // app.js var express = require('express'), app = express(), bodyParser = require('body-parser'), client = require('./client'), path = require('path'); app.use(bodyParser.json()); app.post('/registerClick', function(req, res) { client.index({ index: 'clicks', id: '1', type: 'click', body: req.body },function(err,resp,status) { res.send(resp); }); }); app.get('/', function(req, res) { res.sendFile(path.join(__dirname, 'index.html')); }); app.listen(process.env.PORT || 8080); Our app creates two routes, a /registerClick route where we’ll send our clicks to, and a / route which renders our HTML. You can access the site by running the following: node app.js And then opening http://localhost:8080 in your web browser. Right now anyone can send click events to our app, so when we’re ready to take this live we’ll probably want to add some security measures to make sure that only requests from the same server are allowed (ie: so someone can’t send bad click data into our app), but we won’t cover that for now. CONNECTING THE CLICK TRACKER TO NODE Now that you have your backend set up, let’s send our clicks back to our server so it can relay them on to Elasticsearch. For this article, we’ll include the JQuery library and it’s .ajax method to make our RESTful a little more readable. Add JQuery to the of your HTML file: ... ... Then, let’s update our snippet so that an ajax call is made every time a click is detected: (function($) { ... var clickApp = { trackClick: function(evt) { var click = { ""x"": evt.pageX, ""y"": evt.pageY, ""sessionId"": generateRandomSessionId(), ""timestamp"": Date.now() } $.post(""/registerClick"", click).then(function(response) { console.log(response); }); } } document.addEventListener('click', function(event) { clickApp.trackClick(event); }); ... })(jQuery); This snippet generates an AJAX POST request and sends the click data directly over to our NodeJS web application. We’re also logging out the response we get back, so we should be able to determine whether our click tracker is working. Finally, run your application again using node app.js , navigate to http://localhost:8080 in your browser and start clicking around. In the developer console of your browser, you should see something like the following: created:true _id:""AV3NW08fEajW3QBwsZU2"" _index:""clicks"" _shards:Object _type:""click"" _version:1 __proto__:Object The created: true is what you’re looking for - this means that your click was created successfully. You can head back to the Elasticsearch browser and click on your index to confirm: WRAPPING UP Now that you have your website clicks being tracked, you can add the Kibana plugin and start looking at which regions of your website are being clicked on the most often. In our next article, we’ll look at how to use this click data to generate a heat map of clicks and overlay them onto an image of our website. John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of gadgets, turning caffeine into code, and writing about it all. Love this article? Head over to John O'Connor ’s author page to keep reading.CONQUER THE DATA LAYER Spend your time developing apps, not managing databases. Try Compose for Free for 30 DaysRELATED ARTICLES Feb 3, 2017NEWSBITS: SCYLLADB 1.6, GITLAB DB TROUBLES, ELASTICSEARCH 5.2, NODE 7.5.0, AND MORE NewsBits for week ending February 3rd: The release of ScyllaDB 1.6 RC1, Gitlab shuts down temporarily due to data troubles, R… John O'Connor Feb 1, 2017BUILDING SECURE DISTRIBUTED JAVASCRIPT MICROSERVICES WITH RABBITMQ AND SENECAJS To take Microservices into production, you need to make sure they are communicating securely and reliably. We explore using R… John O'Connor Oct 28, 2016NEWSBITS: ELASTICSEARCH 5.0, NODE 7.0, LAMBDA GO, SWIFT AND NO BATTERY TRANSISTORS Compose NewsBits for the week ending October 28th - Elasticsearch 5.0.0 released, Node 7.0.0 released, an AWS Lambda framewor… Hays Hutton Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","Website Engagement Tracking is a technique that allows businesses to see which parts of their website users are visiting, clicking on, and viewing. In this article, we'll take a look at tracking user engagement using Elasticsearch on Compose.",Website Engagement Tracking with Elasticsearch,Live,91 233,"* Free 7-Day Crash Course * Blog * Masterclass 9 MISTAKES TO AVOID WHEN STARTING YOUR CAREER IN DATA SCIENCE EliteDataScience 0 Comments June 23, 2017 Share Google Linkedin TweetIf you wish to begin a career in data science, you can save yourself days, weeks, or even months of frustration by avoiding these 9 costly beginner mistakes. If you’re not careful, these mistakes will eat away at your most valuable resources: your time, energy, and motivation. We’ve broken them into three categories: * Mistakes while learning data science * Mistakes when applying for a job * Mistakes during job interviews WHILE LEARNING DATA SCIENCE The first set of mistakes are ""undercover"" and they're hard to spot. They slowly but surely drain your time and energy without giving you warning, and they spawn from the misconceptions surrounding this field. 1. SPENDING TOO MUCH TIME ON THEORY. Many beginners fall into the trap of spending too much time on theory, whether it be math related (linear algebra, statistics, etc.) or machine learning related (algorithms, derivations, etc.). This approach is inefficient for 3 main reasons: * First, it's slow and daunting. If you've ever felt overwhelmed by all there is to learn, you've likely sunk into this trap. * Second, you won't retain the concepts as well. Data science is an applied field, and the best way to solidify skills is by practicing. * Finally, there's a greater risk that you'll become demotivated and give up if you don't see how what you're learning connects to the real world. This theory-heavy approach is traditionally taught in academia, but most practitioners can benefit from a more results-oriented mindset. To avoid this mistake: * Balance your studies with projects that provide you hands-on practice. * Learn to be comfortable with partial knowledge. You'll naturally fill in the gaps as you progress. * Learn how each piece fits into the big picture (covered in our free 7-day crash course) . 2. CODING TOO MANY ALGORITHMS FROM SCRATCH. This next mistake also causes students to miss the forest for the trees. At the start, you really don't need to code every algorithm from scratch. While it's nice to implement a few just for learning purposes, the reality is that algorithms are becoming commodities. Thanks to mature machine learning libraries and cloud-based solutions, most practitioners actually never code algorithms from scratch. Today, it's more important to understand how to the apply the right algorithms in the right settings (and in the right way). To avoid this mistake: * Pick up general-purpose machine learning libraries, such as Scikit-Learn (Python) or Caret (R) . * If you do code an algorithm from scratch, do so with the intention of learning instead of perfecting your implementation. * Understand the landscape of modern machine learning algorithms and their strengths and weaknesses. 3. JUMPING INTO THE DEEP END. Some people enter this field because they want to build the technology of the future: Self-Driving Cars, Advanced Robotics, Computer Vision, and so on. These are powered by techniques such as deep learning and natural language processing. However, it's important to master the fundamentals. Every olympic diver needed to learn how to swim first, and so should you. To avoid this mistake: * First, master the techniques and algorithms of ""classical"" machine learning, which serve as building blocks for advanced topics. * Know that classical machine learning still has incredible untapped potential. While the algorithms are already mature, we are still in the early stages of discovering fruitful ways to use them. * Learn a systematic approach to solving problems with any form of machine learning (covered in our free 7-day crash course) . Don't try this at home (until you have plenty of practice) WHEN APPLYING FOR A JOB This next set of mistakes can cause you to miss some great opportunities during the job search process. Even if you're well qualified, you can maximize your results by avoiding these hiccups. 4. HAVING TOO MUCH TECHNICAL JARGON IN A RESUME. The biggest mistake many applicants make when writing their resume is suffocating it with technical jargon. Instead, your resume should paint a picture and your bullet points should tell a story. Your resume should advocate the impact you could bring to an organization, especially if you're applying for entry-level positions. To avoid this mistake: * Do not simply list the programming languages or libraries you've used. Describe how you used them and explain the results. * Less is more. Think about the most important skills to emphasize and give them the space to shine by removing other distractions. * Make a resume master template so you can spin off different versions that are tailored to different roles. This keeps each version clean. 5. OVERESTIMATING THE VALUE OF ACADEMIC DEGREES. Sometimes, graduates can overestimate the value of their education. While a strong degree in a related field can definitely boost your chances, it's neither sufficient nor is it usually the most important factor. To be clear, we're not saying graduates are arrogant... In most cases, what's taught in an academic setting is simply too different from the machine learning applied in businesses. Working with deadlines, clients, and technical roadblocks necessitate practical tradeoffs that are not as urgent in academia. To avoid this mistake: * Supplement coursework with plenty of projects using real-world datasets . * Learn a systematic approach to solving problems with machine learning (covered in our free 7-day crash course ). * Take relevant internships, even if they are part-time. * Reach out to local data scientists on LinkedIn for coffee chats. 6. SEARCHING TOO NARROWLY. Data science is a relatively new field, and organizations are still evolving to accommodate the growing impact of data. You'd be limiting yourself if you only search for ""Data Scientist"" openings. Many positions are not labeled as ""data science,"" but they'll allow you to develop similar skills and function in a similar role. To avoid this mistake: * Search by required skills (Machine Learning, Data Visualization, SQL, etc.). * Search by job responsibilities (Predictive Modeling, A/B Testing, Data Analytics, etc.). * Search by technologies used in the role (Python, R, Scikit-Learn, Keras, etc.). * Expand your searches by job title (Data Analyst, Quantitative Analyst, Machine Learning Engineer, etc.). Source: Cyanide and Happiness DURING THE INTERVIEW The last set of mistakes are stumbling blocks during the interview. You've already done the hard work to get to this step, so now it's time to finish strong. 7. BEING UNPREPARED TO DISCUSS PROJECTS. Having projects in your portfolio serves as a major safety net for ""how would you"" type interview questions. Instead of speaking in hypotheticals, you'll be able to point to concrete examples of how you handled certain situations. In addition, many hiring managers will specifically look for your ability to be self-sufficient because data science roles naturally include elements of project management. That means you should understand the entire data science workflow and know how to piece everything together. To avoid this mistake: * Complete end-to-end projects that allow you to practice every major step (i.e. Data Cleaning, Model Training, etc.). * Organize your methodology. Data science should be deliberate, not haphazard. * Review and practice describing past projects from any internships, jobs, or classes you've taken. 8. UNDERESTIMATING THE VALUE OF DOMAIN KNOWLEDGE. Developing technical skills and machine learning knowledge are the basic prerequisites for landing a data science position. However, to truly stand out above the competition, you should learn more about the specific industry you'll be applying your skills to. Remember, data science never exists in a vacuum. To avoid this mistake: * If you're interviewing for a position at a bank, brush up on some basic finance concepts. * If you're interviewing for a strategy position at a Fortune 500, practice a few case interviews and learn about drivers of profitability. * If you're interviewing for a startup, learn about its market and try to discern how it will gain a competitive edge. * In short, taking a little bit of extra initiative here can pay big dividends! 9. NEGLECTING COMMUNICATION SKILLS. Currently, in most organizations, data science teams are still very small compared to developer teams or analyst teams. So while an entry-level software engineer will often be managed a senior engineer, data scientists tend to work in more cross-functional settings. Interviewers will look for your ability to communicate with colleagues of various technical and mathematical backgrounds. To avoid this mistake: * Practice explaining technical concepts to non-technical audiences. For example, try explaining your favorite algorithm to a friend. * Prepare bullet point responses to common interview questions and practice delivering your answers. * Practice analyzing various datasets, extracting key insights, and presenting your findings. CONCLUSION In this guide, you learned practical tips for avoiding the 9 costliest mistakes by data science beginners: 1. Spending too much time on theory. 2. Coding too many algorithms from scratch. 3. Jumping into advanced topics, e.g. deep learning, too quickly. 4. Having too much technical jargon in a resume. 5. Overestimating the value of academic degrees. 6. Searching too narrowly for jobs. 7. Being unprepared to discuss projects during interviews. 8. Underestimating the value of domain knowledge. 9. Neglecting communication skills. To jumpstart your journey ahead, we invite you to sign up for our free 7-day email crash course on applied machine learning . You'll get exclusive lessons that aren't covered on our blog. For more over-the-shoulder guidance, we also offer a comprehensive Machine Learning Masterclass that will teach you data science while allowing you to build an impressive portfolio along the way. Share Google Linkedin TweetLEAVE A RESPONSE CANCEL REPLY Name* Email* Website* Denotes Required Field RECOMMENDED READING * 9 Mistakes to Avoid When Starting Your Career in Data Science * WTF is the Bias-Variance Tradeoff? (Infographic) * Free Data Science Resources for Beginners * Dimensionality Reduction Algorithms: Strengths and Weaknesses * Modern Machine Learning Algorithms: Strengths and Weaknesses * The Ultimate Python Seaborn Tutorial: Gotta Catch ‘Em All * The 5 Levels of Machine Learning Iteration Copyright © 2017 · EliteDataScience.com · All Rights Reserved * Home * Terms of Service * Privacy Policy","If you wish to begin a career in data science, you can save yourself days, weeks, or even months of frustration by avoiding these 9 costly beginner mistakes.",9 Mistakes to Avoid When Starting Your Career in Data Science,Live,92 237,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Share * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates Lorna Mitchell Blocked Unblock Follow Following Developer Advocate at IBM. Technology addict, open source fanatic and incurable blogger (see http://lornajane.net) 13 mins ago -------------------------------------------------------------------------------- DEPLOY YOUR PHP APPLICATION TO BLUEMIX Deploying to the cloud can save so much time and hassle with commissioning and setting up servers, it’s no surprise that we’re seeing many more organisations take this approach for some or all of their web properties. In this post we’ll take a standard PHP application and deploy it to Bluemix. You’ll learn how to: * prepare your project for the cloud * put the databases or other storage elements in place * safely convey your work to its new home The example application here is a simple web-based Guestbook which every self-respecting website had once upon a time (about 20 years ago!). All the code for the project is on GitHub to make it easy to look at the examples here in the context of a real project. The Bluemix platform offers a 30-day free trial so you have a chance to try deploying your own application before handing over your credit card details. PREPARE FOR CLOUD When we deploy a PHP application, Bluemix automatically realises it’s a PHP project and uses the PHP buildpack to run it. The “buildpacks” are a Cloud Foundry term, meaning a collection of tools needed to run a particular tech stack. The PHP buildpack provides PHP with a selection of extensions and Composer already included. You can also specify other, non Bluemix, buildpacks available for Cloud Foundry if you want, and they’re used (specify this in the manifest.yml which we'll cover later on. Keep reading!). In contrast to a server where we choose which version of PHP to install, the buildpacks typically support multiple versions of PHP so we need to indicate which one this project should use by specifying this in composer.json if it's not already there. For my project, I needed to add this line to the require block in composer.json : ""php"" : "">=7.0"", This lets the buildpack know what my PHP dependency is, along with my other project dependencies since Composer runs when we deploy the application. The other change we need to make, is to specify where the webroot should be. We add this information to a file called .bp-config/options.json , which the buildpack automatically reads. There's a number of options that you can configure in this file, but we'll stick with the webroot for now. I set my webroot to the directory public/ which is where index.php is: Now that the PHP is ready, we move on to putting in the dependencies that the rest of the project needs. GET READY TO BLUEMIX At this point, it’s time to install the tools needed to work with Bluemix: * Cloud Foundry ( cf )provides all the commands you need to build and manage your app. * Bluemix CLI is useful for platform-specific tasks like managing regions, spaces, Bluemix virtual machines, and your account. These tools have a lot of overlap; everything we do today we can achieve with the cf command alone, however the bluemix tool has some bluemix-specific additions which you may find useful as your own projects grow. Once the commands are installed, we can create the services that our application depends on. Mine needs the Cloudant NoSQL database and RabbitMQ; if yours uses PostgreSQL or MySQL instead, or indeed any of the other services, the process will look pretty similar. You’ll need your Bluemix account details handy at this point — sign up for the free trial if you don’t already have an account.Before we can deploy an app to Bluemix, we need to create a space for it to deploy into. There are a few steps to this since Bluemix has multiple regions, organisations, and users, as well as spaces — which makes it really easy to be very organised with large applications but for our first PHP deployment might feel a little bit heavy I admit! 1. Pick a region and set the API endpoint, e.g. for US South use bluemix api https://api.ng.bluemix.net 2. Now log in using the command bluemix login . You'll be prompted for your username and password. 3. Check the list of organisations in your account with cf orgs and switch to the one you want to use by doing cf target -o [org] , replacing [org] with the organisation name you want. 4. Create a new space to deploy to: cf create-space Dev and target it cf target -s Dev 5. Verify that this all made some sense by doing cf target and make sure that your user, organisation and space settings are as you expect. We made it! The tools are ready, so we’ll go ahead and set up the services that my application needs. COMMISSION THE SERVICES FOR YOUR APP My application needs two services: a Cloudant NoSQL Database (the Bluemix name for CouchDB the awesome document database) and RabbitMQ (no fancy names there). Before I deploy, I will use the cf tool to create these services so that my application can use them. First up, I’ll create the Cloudant service. The cf help create-service command tells me that I need to specify the service, the plan and then name my service. To find out the exact service and plan names, use the command cf marketplace (beware it can take quite a long time to return as it has a lot of information to find!). For me, the command is: cf create-service cloudantNoSQLDB Lite guestbook-db Now when I run cf services I can see the database listed there. You can also see it by going to your Bluemix dashboard in the correct region/organisation/space combination. For most services, you can access their administrative interfaces from here. To set up RabbitMQ, I enter the following command: cf create-service compose-for-rabbitmq Standard guestbook-messages Again, cf services shows my new addition and at this point I have the pieces my application depends on. Be careful with your create-service commands. Both the services and the plans are case sensitive which can trip you up very easily! If you see errors, double-check you have everything spelled correctly, including case.There are two next steps: create a manifest file to describe how to deploy the application, and change the PHP code to know how to access these services we just configured using the Bluemix environment variables. We’ll do the manifest file next but if you were thinking there is a missing link, you’re definitely keeping up! On we go … CREATE THE MANIFEST FILE AND DEPLOY You usually describe how to deploy an application with a manifest file called manifest.yml . If you looked at the GitHub project, you can see that this application has other applications (and a dev platform setup) in addition to this PHP application. I'm putting my manifest file in the directory that contains just this PHP application, which is a couple of levels down from the root of the project, so here we're working in src/web . My manifest file simply names my application, allocates it some RAM, and states which services it needs available to it. Here it is: This contains all the information that Bluemix needs to run my application. (The application name must be unique across the whole of the bluemix region, so you need to change the name in your version of this file.) Now we’re ready to deploy. The moment of truth: run cf push to deploy your application to the cloud ... Hopefully that all worked well, and you see your application working through its deployment steps, installing any dependencies. Eventually you see an information block including a urls field. Go to that URL and you should see your project. In fact, you'll probably see an error message from your project because we didn't explain to PHP how to connect to the services we made earlier on, so we'll deal with that next. HANDY COMMANDS This seems like a good time to share a couple of commands that I use a lot when setting up an application for the first time: * cf logs --recent guestbook-web shows the last few lines of the logs from your application, including its deployment. If there's a syntax error in your PHP application, you'll see it here. You can also use just cf logs guestbook-web to see the logs as they happen (replace with your own application name as appropriate). * cf env guestbook-web shows the environment variables available to your application, and since we're about to connect our PHP to the services we created, this is very handy indeed! In particular we'll be looking at the VCAP_SERVICES environment variable as it contains the information about the services that this application has access to. I made my own quick reference card for keeping track of the commands, so feel free to check that out as you go along too! CONNECT PHP TO SERVICES From using the cf env command in the previous section, you can hopefully already see the data you need to plug your PHP application in to its services by accessing the VCAP_SERVICES variable. It's JSON-encoded so in PHP, I use this line to get the configuration into an array I can use: Once you have this, feel free to use var_dump or something to look at the structure, but essentially there's an array element per service type, with an array element inside for each actual service of that type. We need to amend our existing application so that when we detect we're running on Bluemix, we use the Bluemix variables, and otherwise we fall back to whatever our usual configuration process is. For instance, my example app connects to CouchDB (on the local development platform) or Cloudant (on Bluemix) with a block of code like this (from config.php ): The RabbitMQ setup is a bit more complicated since it uses a certificate and the RabbitMQ library I’m using (php-amqplib) expects the parameters all separate whereas Bluemix sends a complete URL with them already assembled. So the code for connecting to RabbitMQ looks like this (also from config.php ): Looking at this RabbitMQ configuration code, you can see that we first grab the contents of the VCAP_SERVICES environment variable, but then we need to grab the URL for the RabbitMQ service. Once we have it, we break it apart using parse_url() to get the arguments we'll need later to construct the RabbitMQ connection. The path piece has a leading slash on it which we don't need, so the substr() function sorts that out for us. We also need to grab the certificate needed for the SSL connection. (If you are deploying to Bluemix and using RabbitMQ, take a look in the dependencies.php file in the GitHub project for where the AMQP objects are actually instantiated. There are a couple of extra options to set that might help you succeed.) RUN PHP IN THE CLOUD The joy of the cloud is that you can just create new applications (and clean up those you don’t need any more) without ordering hardware, installing servers, or keeping up with patching. I’ve found that PHP applications are pretty happy on these cloud platforms and having this sort of setup lets the apps cost as little as necessary but scale as much as demand dictates — capacity planning is a hard problem and it means not having spare servers “just in case”. I hope that by sharing my steps above, I’ve shown you what you need to get your own PHP application into the cloud, and perhaps convinced you to give it a try. Bluemix Deployment PHP Cloud Computing Tutorial Blocked Unblock Follow FollowingLORNA MITCHELL Developer Advocate at IBM. Technology addict, open source fanatic and incurable blogger (see http://lornajane.net ) FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform.",Read how to take a standard PHP application and deploy to Bluemix.,Deploy Your PHP Application to Bluemix,Live,93 239,"Skip navigation Upload Sign in SearchLoading... Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE. WATCH QUEUE QUEUE Watch Queue Queue * Remove all * Disconnect 1. Loading... Watch Queue Queue __count__/__total__ Find out why CloseARMAND RUIZ GABERNET, IBM - BIGDATANYC #BIGDATANYC 2016 #THECUBE SiliconANGLE Subscribe Subscribed Unsubscribe 6,734 6KLoading... Loading... Working... Add toWANT TO WATCH THIS AGAIN LATER? Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO? Sign in to report inappropriate content. Sign in * Transcript * Statistics 166 views 2LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 3 0DON'T LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1Loading... Loading... TRANSCRIPT The interactive transcript could not be loaded.Loading... Loading... Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Sep 28, 2016Armand Ruiz Gabernet (@armand_ruiz), Lead Product Manager - IBM Data Science Experience, IBM, sits down with Dave Vellante & Jeff Frick on the #theCUBE at #BigDataNYC 2016, New York, NY * CATEGORY * Science & Technology * LICENSE * Creative Commons Attribution license (reuse allowed) Show more Show lessLoading... Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * IBM DataFirst Launch Event Keynote - #DataFirst - #theCUBE - Duration: 1:28:16. SiliconANGLE 65 views * New 1:28:16 -------------------------------------------------------------------------------- * Nik Green, Delhaize America & Kevin McIntyre, IBM - BigDataNYC #BigDataNYC 2016 #theCUBE - Duration: 17:19. SiliconANGLE 45 views * New 17:19 * Ritika Gunnar , IBM - BigDataNYC #BigDataNYC 2016 #theCUBE - Duration: 14:35. SiliconANGLE 14 views * New 14:35 * Cory Minton, DellEMC & Simeon Yep, Splunk - Splunk .conf2016 - #splunkconf16 - #theCUBE - Duration: 14:32. SiliconANGLE 63 views * New 14:32 * Robert Herjavec & Atif Ghauri, Herjavec Group - Splunk .conf2016 - #splunkconf16 - #theCUBE - Duration: 24:16. SiliconANGLE 20 views * New 24:16 * Chris Kammerman, Shazam - Splunk .conf2016 - #splunkconf16 - #theCUBE - Duration: 15:46. SiliconANGLE 47 views * New 15:46 * Matt Kraft, Dunkin' Brands - Splunk .conf2016 - #splunkconf16 - #theCUBE - Duration: 18:07. SiliconANGLE 24 views * New 18:07 * Snehal Antani, Splunk - Splunk .conf2016 - #splunkconf16 - #theCUBE - Duration: 16:21. SiliconANGLE 23 views * New 16:21 * Shay Mowlem, Splunk - Splunk .conf2016 - #splunkconf2016 - #theCUBE - Duration: 16:48. SiliconANGLE 31 views * New 16:48 * Haiyan Song & Monzy Merza, Splunk - Splunk .conf2016 - #splunkconf16 - #theCUBE - Duration: 20:52. SiliconANGLE 7 views * New 20:52 * Chuck Yarbrough, Pentaho A Hitachi Group Company - Big Data NYC - #BigDataNYC - #theCUBE - Duration: 16:20. SiliconANGLE 4 views * New 16:20 * Day 2 Kickoff - Splunk .conf2016 - #splunkconf16 - #theCUBE - Duration: 12:17. SiliconANGLE 3 views * New 12:17 * Tendu Yogurtcu, Synsort - BigDataNYC #BigDataNYC 2016 #theCUBE - Duration: 19:18. SiliconANGLE 18 views * New 19:18 * Josh Rogers, Syncsort - Splunk .conf2016 - #splunkconf16 - #theCUBE - Duration: 12:58. SiliconANGLE 35 views * New 12:58 * Ram Varadarajan - Splunk .conf2016 - #splunkconf16 - #theCUBE - Duration: 15:15. SiliconANGLE 13 views * New 15:15 * Steve Hatch, Cox Automotive - Splunk .conf2016 - #splunkconf16 - #theCUBE - Duration: 10:41. SiliconANGLE 2 views * New 10:41 * Wei Wang & Matt Morgan, Hortonworks - BigDataNYC #BigDataNYC 2016 #theCUBE - Duration: 16:49. SiliconANGLE 8 views * New 16:49 * Michael Dell, Dell Technologies - #VMworld 2016 #theCUBE - Duration: 15:37. SiliconANGLE 3,263 views 15:37 * Tom Gerhard, Priceline - Splunk .conf2016 - #splunkconf2016 - #theCUBE - Duration: 12:25. SiliconANGLE 24 views * New 12:25 * Greg Sands & Jim Wilson - Oracle OpenWorld - #oow16 - #theCUBE - Duration: 26:17. SiliconANGLE 31 views * New 26:17 * Loading more suggestions... * Show more * Language: English * Content location: United States * Restricted Mode: Off History HelpLoading... Loading... Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Try something new! * Loading... Working... Sign in to add this to Watch LaterADD TO Loading playlists...","Armand Ruiz Gabernet (@armand_ruiz), Lead Product Manager - IBM Data Science Experience, IBM, sits down with Dave Vellante & Jeff Frick on the #theCUBE at #B...","Armand Ruiz Gabernet, IBM - BigDataNYC #BigDataNYC 2016 #theCUBE",Live,94 242,"☰ * Login * Sign Up * Learning Paths * Courses * Our Courses * Partner Courses * Badges * Our Badges * BDU Badge Program * Events * Blog * Resources * Resources List * Downloads * BLOG Welcome to the Big Data University Blog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * This Week in Data Science (December 20, 2016) * This Week in Data Science (December 13, 2016) * New York Data Science Bootcamp And Validated Badges * This Week in Data Science (December 06, 2016) * This Week in Data Science (November 29, 2016) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsTHIS WEEK IN DATA SCIENCE (DECEMBER 20, 2016) Posted on December 20, 2016 by cora Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * 5 Tips for Leveraging Big Data to Increase Holiday Sales – Running an e-commerce store? A small business? Big data could help you nab more sales this holiday season and get the revenue flowing in. * IBM and BMW want Watson to help drive your car – IBM Watson could soon be helping you to drive your car, as IBM’s cognitive computing unit is set to work with BMW Group to explore how the technology could aid cars of the future. * IBM’s Watson Turns Its Computer Brain to NASA Research – IBM’s Watson computer system, hosted in the cloud, is taking on NASA’s big research data. * Obama administration proposes that all new cars must be able to talk to each other – The Obama Administration on Tuesday proposed a rule that would require all new cars to be able to communicate with other cars wirelessly, a move that advocates said could save lives, but that also raises privacy and hacking concerns among opponents. * How Will the Softer Side of Robots Affect our Lives – Despite the advancements in Robotics and Artificial Intelligence, Robots have not learnt how to show emotion… just yet…but when we think of robots, more often than not images of clunky humanoid contraptions, metal with hinged joints and bulky movement spring to mind (excuse the pun). * How Artificial Intelligence Will Usher in the Next Stage of E-Government – Since the earliest days of the Internet, most government agencies have eagerly explored how to use technology to better deliver services to citizens, businesses and other public-sector organizations. * Data Science, Predictive Analytics Main Developments in 2016 and Key Trends for 2017 – Key themes included the polling failures in 2016 US Elections, Deep Learning, IoT, greater focus on value and ROI, and increasing adoption of predictive analytics by the “masses” of industry. * Data science skills: Is NoSQL better than SQL? – Big data is one of the hottest sectors in tech right now, but how do you stay on top of the changing technologies? David Pardoe of Hays Recruitment talks about the differences between SQL and NoSQL in data. * Amazon makes its first Prime Air drone delivery to a customer – Amazon has completed its first customer delivery by drone. * Big Data Science: Expectation vs. Reality – The path to success and happiness of the data science team working with big data project is not always clear from the beginning. It depends on maturity of underlying platform, their cross skills and devops process around their day-to-day operations. * Data Tools Offer Hints at How Judges Might Rule – Services offer lawyers statistics on how likely a given case is to be dismissed. * A Supercomputer Knows What Flavors You Like Better Than You Do – You probably feel like you have a good idea of what food you like and what you don’t. Turns out, you might actually enjoy flavor combinations (strawberries and jalapeno!?!) that you would never have thought to try. Enter, supercomputers. * Deep-Learning Machine Listens to Bach, Then Writes Its Own Music in the Same Style – Can you tell the difference between music composed by Bach and by a neural network? * The Countries With The Fastest Internet – According to Akamai, South Korea is well ahead of the pack when it comes to fast internet. * Tourists Vs Locals: 20 Cities Based On Where People Take Photos – Tourists and locals experience cities in strikingly different ways. * Deep Learning Reinvents the Hearing Aid – Finally, wearers of hearing aids can pick out a voice in a crowded room. SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * RELATED Tags: analytics , Big Data , data science , events , weekly roundup -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Community * FAQ * Ambassador Program * Legal Follow us * * * * * * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our forty fifth release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (December 20, 2016)",Live,95 253,"Jump to navigation * Twitter * LinkedIn * Facebook * About * Contact * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chats * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Subscribe ×BLOGS IMPROVING QUALITY OF LIFE WITH SPARK-EMPOWERED MACHINE LEARNING Post Comment June 2, 2016 by Michal Malohlava Software Engineer, H2O.ai by Desmond Chan Senior Director, Marketing, H2O.aiWe are in an age in which machine learning has increasing importance in our daily lives. Machine learning is put into action whenever your mobile map application automatically reminds you to leave for your next appointment because of unusual traffic situations. Besides personal assistants on your cell phones, wearable sport devices use machine-learning algorithms to propose personal training plans, and banks depend on accurate machine-learning models to detect malicious transactions. Healthcare, for instance, has also started to find helpful patterns in medical data using machine learning. Modern technologies allow for close monitoring of a patient’s condition through a large volume of data provided by a number of sensors. Machine learning is applied to this data to find patterns and predict how a patient will react to a treatment plan, for example. Accuracy is particularly important in this field because each miss can have significant implications. This case presents several challenges for machine learning technologists: * Complicated model preparation because of the huge data volume and various forms of data including highly imbalanced data sets * Constant model retraining and reevaluation because of the ever-changing nature of patient data and structure, as well as the need to improve the accuracy of model prediction * Fast deployment of newly trained models to monitor a patient’s condition MACHINE LEARNING WITH THE POWER OF SPARK Sparkling Water brings the H2O open source, machine-learning platform to Apache Spark environments. H2O runs directly in a Spark Java virtual machine (JVM), which eliminates any data transfer overhead that other solutions typically incur: H2O allows users to combine the data processing power of Spark with powerful machine-learning algorithms provided by the H2O platform. This combination solves the aforementioned challenges for machine learning technologists in a variety of ways. PARALLELIZED DATA PROCESSING H2O is designed to process huge amounts of data in a distributed and fully parallelized fashion. This approach means a hospital can fully leverage all the data available for their analyses, explore and test more models in quick iterations and benefit from the results. OPERATIONALIZED MODEL TRAINING, EVALUATION AND COMPARISON, AND SCORING Finding the optimum model for a given patient condition is a tedious process that has many moving parts. Hospitals need to try out different strategies to explore the space of possible models and various setups and compare the results best suited for their environments. H2O operationalizes this tedious training process in several ways: * Providing a library of machine-learning algorithms supporting advanced, algorithm-specific features; moreover, H2O allows combining models into ensembles—super learners * Performing fast exploration of hyperspace of parameters (aka grid search) * Offering the facility to specify various criteria that identify and select the best model—for example, accuracy, building time, scoring time and so on * Adding the ability to continue model preparation with modified parameters and additional relevant training data; this specific feature of H2O helps simplify the lives of data scientists and speeds up model preparation turnaround * Creating visualizations of various model characteristics on the fly and the final model during training; moreover, users can explore the performance of the model on training as well as validation—that is, unseen—data. H2O also allows users to stop the model training process manually, if the visual feedback reports unexpected results; modify parameters; and continue the training. OPTIMIZED MODEL DEPLOYMENT Model deployment is one of the most critical elements of the machine-learning process in healthcare—the model, or even multiple models, are instantiated and fed by real-time data from sensors monitoring a patient’s body, and the models need to provide predictions as quickly as possible. To meet these strict requirements, H2O allows for the export of trained models as an optimized code for deployment into target systems—that is, web services, applications and so on. The optimized code delivers the best possible response time, which is crucial for applications that need to react quickly to changing conditions. USE CASES WITH STREAMLINED IMPLEMENTATION Sparkling Water improves and streamlines the way machine learning is applied to healthcare. Besides healthcare, Sparkling Water can also elevate the use of machine learning in a variety of other use cases: * Detecting fraud in the finance industry, where high accuracy and speed are key factors * Proposing interest rates for insurance applications or predicting drivers’ risk factors * Planning truck maintenance based on tracking trucks’ telemetry. Next time when you think about improving the quality of your life, remember Sparkling Water. At the Apache Spark Maker Community Event, 6 June 2016, IBM is sharing important announcements for helping customers to use Spark, R and open data science to drive business innovations. Register for this in-person event . If you can’t attend, then register to watch a livestream presentation of the event . Follow @IBMBigData Topics: Analytics , Big Data Technology , Data Scientists Tags: machine learning , algorithm , machine-learning algorithm , Apache Spark , Spark , analytics , predictive analytics , big data , R , PythonRELATED CONTENT WHITE PAPERS & REPORTS INTRODUCING NOTEBOOKS: A POWER TOOL FOR DATA SCIENTISTS Check out the details on a tool that can change the game for data scientists—open source analytics notebooks. Learn what notebooks are, what value they provide and how to get started using them today. View White papers & Reports Blog The power of machine learning in Spark Blog How can data scientists collaborate to build better business applications? Blog InsightOut: The role of Apache Atlas in the open metadata ecosystem Blog Top analytics tools in 2016 Blog End-to-end analytics in the cloud Blog Highlights from the Apache Spark Maker Community Event Blog Experiencing deeper productivity in open data science White papers & Reports Using a predictive analytics model to foresee flight delays Blog Learning to fly: How to predict flight delays using Spark MLlib Blog Innovative business applications: The disruptive potential of open data science Blog Lean data science with Apache Spark Blog Boosting the productivity of the next-generation data scientist View the discussion thread. IBM * Site Map * Privacy * Terms of Use * 2014 IBM FOLLOW IBM BIG DATA & ANALYTICS * Facebook * YouTube * Twitter * @IBMbigdata * LinkedIn * Google+ * SlideShare * Twitter * @IBManalytics * Explore By Topic * Use Cases * Industries * Analytics * Technology * For Developers * Big Data & Analytics Heroes * Explore By Content Type * Blogs * Videos * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Events * Around the Web * About The Big Data & Analytics Hub * Contact Us * RSS Feeds * Additional Big Data Resources * AnalyticsZone * Big Data University * Channel Big Data * developerWorks Big Data Community * IBM big data for the enterprise * IBM Data Magazine * Smarter Questions Blog * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics Heroes More * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics Heroes SearchEXPLORE BY TOPIC: Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacyMORE Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacy Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information Lifecycle Governance solutions Blog Cloud-based ingestion: The future is hereMORE Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information Lifecycle Governance solutions Blog Cloud-based ingestion: The future is here Blog 3 strategies to get your CFO to care about Sales Performance Management Blog Proactive emergency plans: Data empowers law enforcement agencies at all levels Blog Emergency management information system data needs to be filtered Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Shifting winds in the Cognitive Era for banking’s digital transformationMORE Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Shifting winds in the Cognitive Era for banking’s digital transformation Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog Keep your head above water with information lifecycle governance Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information Lifecycle Governance solutionsMORE Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information Lifecycle Governance solutions Video What does Hadoop and big data success look like? Blog The death of application performance White papers & Reports Introducing notebooks: A power tool for data scientists * Home * Explore By Topic * Use Cases * All * Acquire, Grow & Retain Customers * Create New Business Models * Improve IT Economics * Manage Risk * Optimize Operations & Reduce Fraud * Transform Financial Processes * Industries * All * Banking * Consumer Products * Education * Energy & Utilities * Government * Healthcare & Life Sciences * Industrial * Insurance * Media & Entertainment * Retail * Telecommunications * Analytics * All * Content Analytics * Customer Analytics * Entity Analytics * Social Media Analytics * Technology * All * Business Intelligence * Cloud Database * Data Governance * Data Warehouse * Database Management Systems * Data Science * Hadoop & Spark * Internet of Things * Predictive Analytics * Streaming Analytics * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chat * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Big Data & Analytics Heroes * For Developers * Events * Upcoming Events * Webcasts * Twitter Chat * Meetups * Around The Web * About Us * Contact Us * Search Site",Discover an open source machine learning platform that combines the data processing power of Spark with powerful machine learning algorithms.,Improving quality of life with Spark-empowered machine learning,Live,96 256,"☰ * Login * Sign Up * Learning Paths * Courses * Our Courses * Partner Courses * Badges * Our Badges * BDU Badge Program * Events * Blog * Resources * Resources List * Downloads * BLOG Welcome to the Big Data University Blog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * This Week in Data Science (November 08, 2016) * Protected: Partnering with Big Data University – UMUC Case Study * This Week in Data Science (November 01, 2016) * This Week in Data Science (October 25, 2016) * This Week in Data Science (October 18, 2016) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsTHIS WEEK IN DATA SCIENCE (NOVEMBER 08, 2016) Posted on November 9, 2016 by cora Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * NASA Is Harnessing Graph Databases To Organize Lessons Learned From Past Projects – The space agency has a new tool to discover unexpected patterns in past projects. * Tracking How the World is Feeling – Spur Projects, an Australian organization focusing on suicide prevention, has published the data from its “How Is the World Feeling?” mental health survey. * Machines Can Now Recognize Something After Seeing It Once – Algorithms usually need thousands of examples to learn something. Researchers at Google DeepMind found a way around that. * Uber Self-Driving Truck Packed With Budweiser Makes First Delivery in Colorado – The ride-hailing giant teamed up with AB InBev to transport beer in an autonomous vehicle, which they say is the world’s first such commercial delivery. * These Cows Will Text You When They’re in Heat – Dairy farmers are using sensors in cows’ stomachs to track the health of the herd. * How To Boost An Organization’s Competitive Advantage By Using Cognitive Computing – Between AI-powered chatbots and gadget-based voice assistants, cognitive computing capabilities have captured the public imagination. But consumer products are just the tip of the iceberg. * Taking the hard work out of Apache Hadoop – Why has IBM created its own distribution of Apache Hadoop and Apache Spark, and what makes it stand out from the competition? * MIT CSAIL brings reasoning to machine learning – More and more companies are taking advantage of artificial intelligence to train machines on their data and make predictions. Researchers from MIT’s Computer Science and Artificial Laboratory (CSAIL) want to take it a step further by revealing how a machine makes those insights. * The White House Releases Paper on What It Wants to Do With Artificial Intelligence – The White House released a document earlier in the year via the Office of the President and the National Science and Technology Council Committee on Technology (NSTC). * How Artificial Intelligence can enhance educational efforts – By taking advantage of artificial intelligence (AI) technologies, schools are giving teachers more tools to help their students while removing unnecessary obstacles. * How Data Mining Reveals the World’s Healthiest Cuisines – Algorithms are teasing apart the link between food and health to provide the first evidence that we really are what we eat. * IBM Teams Up With Slack to Build Smarter Data-Crunching Chatbots – IBM is teaming up with Slack Technologies Inc. to make it easier for companies to build custom chatbots into the startup’s workplace-messaging systems. * The app developer’s guide to creating your first Watson bot – You’re building your first chat bot and the pressure’s on. Never fear – the Watson team is here! * Google’s neural networks invent their own encryption – A team from Google Brain, Google’s deep learning project, has shown that machines can learn how to protect their messages from prying eyes. UPCOMING DATA SCIENCE EVENTS * IBM Webinar: Self service analytics in a flash with dashDB – On November 17th, learn how the dashDB family of warehousing solutions can help with self-service analytics. * Introduction to Python for Data Science – Learn how to use Python for data science on November 10th. * IBM Event: Analytics Strategies in the Cloud – Join IBM and 2-time Canadian Olympic gold-medalist Alexandre Bilodeau on November 7th for a complimentary event in Montreal where you’ll network, eat, drink and engage in an inspiring discussion on making business analytics easier and more available for all departments throughout your company. SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * RELATED Tags: analytics , Big Data , data science , events , weekly roundup -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Community * FAQ * Ambassador Program * Legal Follow us * * * * * * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our thirty ninth release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (November 08, 2016)",Live,97 257,"SHARP SIGHT LABS * HOME * MEMBER LOGIN * ABOUT HOW TO MAP GEOSPATIAL DATA: USA RIVERS February 7, 2017 R CODE Here’s the R code to produce the map: #=============== # LOAD PACKAGES #=============== library(tidyverse) library(maptools) #=============== # GET RIVER DATA #=============== #========== # LOAD DATA #========== #DEFINE URL # - this is the location of the file url.river_data <- url(""http://sharpsightlabs.com/wp-content/datasets/usa_rivers.RData"") # LOAD DATA # - this will retrieve the data from the URL load(url.river_data) # INSPECT summary(lines.rivers) lines.rivers@data %>% glimpse() levels(lines.rivers$FEATURE) table(lines.rivers$FEATURE) #============================================== # REMOVE MISC FEATURES # - there are some features in the data that we # want to remove #============================================== lines.rivers <- subset(lines.rivers, !(FEATURE %in% c(""Shoreline"" ,""Shoreline Intermittent"" ,""Null"" ,""Closure Line"" ,""Apparent Limit"" ))) # RE-INSPECT table(lines.rivers$FEATURE) #============== # REMOVE STATES #============== #------------------------------- # IDENTIFY STATES # - we need to find out # which states are in the data #------------------------------- table(lines.rivers$STATE) #--------------------------------------------------------- # REMOVE STATES # - remove Alaska, Hawaii, Puerto Rico, and Virgin Islands # - these are hard to plot in a confined window, so # we'll remove them for convenience #--------------------------------------------------------- lines.rivers <- subset(lines.rivers, !(STATE %in% c('AK','HI','PR','VI'))) # RE-INSPECT table(lines.rivers$STATE) #============================================ # FORTIFY # - fortify will convert the # 'SpatialLinesDataFrame' to a proper # data frame that we can use with ggplot2 #============================================ df.usa_rivers <- fortify(lines.rivers) #============ # GET USA MAP #============ map.usa_country <- map_data(""usa"") map.usa_states <- map_data(""state"") #======= # PLOT #======= ggplot() + geom_polygon(data = map.usa_country, aes(x = long, y = lat, group = group), fill = ""#484848"") + geom_path(data = df.usa_rivers, aes(x = long, y = lat, group = group), color = ""#8ca7c0"", size = .08) + coord_map(projection = ""albers"", lat0 = 30, lat1 = 40, xlim = c(-121,-73), ylim = c(25,51)) + labs(title = ""Rivers and waterways of the United States"") + annotate(""text"", label = ""sharpsightlabs.com"", family = ""Gill Sans"", color = ""#A1A1A1"" , x = -89, y = 26.5, size = 5) + theme(panel.background = element_rect(fill = ""#292929"") ,plot.background = element_rect(fill = ""#292929"") ,panel.grid = element_blank() ,axis.title = element_blank() ,axis.text = element_blank() ,axis.ticks = element_blank() ,text = element_text(family = ""Gill Sans"", color = ""#A1A1A1"") ,plot.title = element_text(size = 34) ) USE THIS AS PRACTICE If you’ve learned the basics of data visualization in R (namely, ggplot2) and you’re interested in geospatial visualization , use this as a small, narrowly-defined exercize to practice some intermediate skills. There are at least three things that you can learn and practice with this visualization: 1. Learn about color: Part of what makes this visualization compelling are the colors. Notice that in the area surrounding the US, we’re not using pure black, but a dark grey. For the title, we’re not using white, but a medium grey. Also, notice that for the rivers, we’re not using “blue” but a very specific hexadecimal color. These are all deliberate choices. As an exercise, I highly recommend modifying the colors. Play around a bit and see how changing the colors changes the “feel” of the visualization. 2. Learn to build visualizations in layers: I’ve emphasized this several times recently, but layering is an important principle of data visualization. Notice that we’re layering the river data over the USA country map. As an exercise, you could also layer in the state boundaries between the country map and the rivers. To do this, you can use map_data() . 3. Learn about ‘Spatial’ data: R has several classes for dealing with ‘geospatial’ data, such as ‘ SpatialLines ‘, ‘ SpatialPoints ‘, and others. Spatial data is a whole different animal, so you’ll have to learn its structure. This example will give you a little experience dealing with it. ITERATE TO GET THE DETAILS RIGHT What really makes this visualization work is the fine little details. In particular, the size of the lines and the colors. The reality is that creating good-looking visualizations requires attention to the little details. To get the details right for a plot like this, I recommend that you build the visualization iteratively . Start with a simple version of just the map of the US. ggplot() + geom_polygon(data = map.usa_country, aes(x = long, y = lat, group = group), fill = ""#484848"") Next, layer on the rivers: ggplot() + geom_polygon(data = map.usa_country, aes(x = long, y = lat, group = group), fill = ""#484848"") + geom_path(data = df.usa_rivers, aes(x = long, y = lat, group = group)) Make no mistake: this doesn’t look good. But, in the early stages, that’s not the goal. You just want to make sure that the data are structurally right. You want something simple that you can build on. Ok, next, play with the river colors. Start with a simple ‘ blue ‘: ggplot() + geom_polygon(data = map.usa_country, aes(x = long, y = lat, group = group), fill = ""#484848"") + geom_path(data = df.usa_rivers, aes(x = long, y = lat, group = group), color = ""blue"") Let’s be honest. This still does not look good. But it’s closer. From here, you can play with the colors some more. Select a new color (I recommend using a color picker ), and modify the color = aesthetic for geom_path() . ggplot() + geom_polygon(data = map.usa_country, aes(x = long, y = lat, group = group), fill = ""#484848"") + geom_path(data = df.usa_rivers, aes(x = long, y = lat, group = group), color = ""#99ccff"") Not perfect, but better still. From here, you can continue to iterate, add more details, and get them all “perfect”: * The exact color (this takes lots of trial-and-error, and a bit of good taste) * The line size for geom_path() * The title and text annotations * Modify the projection, and change it to the “albers” projection with coord_map() * The other theme() details like background color, removing extraneous elements (like the axis labels) etc Once again: getting this just right takes lots of iteration. Try it yourself and build this visualization from the bottom up. LEARN GGPLOT2 (BECAUSE GGPLOT2 MAKES THIS EASY) In this post, we’ve used ggplot2 to create this particular visualization. While I would classify this visualization at an “intermediate” level, ggplot2 still makes it relatively easy. That said, if you’re interested in data science and data visualization, learn ggplot2 . Longtime readers at Sharp Sight will know my thoughts on this, but if you’re a new reader this is important. ggplot2 is almost without question, the best data visualization tool available . Of course, different people will have different needs, but speaking generally, ggplot2 is flexible, powerful, and it allows you to create beautiful data visualizations with relative ease. Not interested in visualization per se? Do you want to focus on machine learning instead? Fair enough. If you want to learn machine learning , you still need to be able to analyze and explore your data . Once again, the best tool for exploring and analyzing your data is ggplot2 . This is particularly true when you combine it with dplyr , tidyr , stringr , and other tools from the tidyverse . SIGN UP TO MASTER DATA VISUALIZATION Do you want to get a job as a data scientist? You need to master data visualization. We’ll show you how. Sign up now, and we’ll show you step-by-step how to learn (and master) data visualization in R. SIGN UP NOW LEAVE A REPLY CANCEL REPLY Your email address will not be published. Required fields are marked * Comment Name * Email * Website GET THE FREE DATA SCIENCE CRASH COURSE Sign up now and learn: • a step-by-step data science learning plan • the 1 programming language you need to learn • 3 essential data visualizations • how to do data manipulation in R • how to get started with machine learning • the difference between machine learning and statistics • and more ... Your first name Your best email address By signing up for the newsletter you'll also get ... ✓ Free machine learning tutorials ✓ Free data visualization tutorials ✓ Learning strategies to skyrocket your progress ... delivered to your inbox on a regular basis. RECOMMENDED READING FlowingData R-Bloggers R-users (jobs site)Subscribe to receive our free ""Getting Started with Analytics and Data Science"" pdf. First Name E-Mail Address © 2017 · Powered by data",Sign up now to learn about data visualization in R,How to map USA rivers using ggplot2,Live,98 266,"SEVEN DATABASES IN SEVEN DAYS – DAY 2: MONGODB Lorna Mitchell and Matt Collins / August 5, 2016This post is part of a series of posts created by the two newest members of our Developer Advocate team here at IBM Cloud Data Services. In honour of the book Seven Databases in Seven Weeks by Eric Redmond and Jim R. Wilson, we challenged Lorna and Matt to take a new database from our portfolio every day, get it set up and working, and write a blog post about their experiences. Each post reflects the story of their day with a new database. We’ll update our seven-days GitHub repo with example code as the series progresses. —The Editors * Database type: schemaless JSON-like storage with search and data aggregation * Best tool for: creating highly scalable apps that need to query large datasets fast MongoDB . It’s kind of a thing. OVERVIEW MongoDB is a NoSQL database that allows you to store your data in JSON-like documents rather than the more traditional RDBMS approach. With a focus on scalability (sharding and replication are available out of the box) and flexibility (data stores are schemaless and easily searchable via secondary indexes — even geospatial!). MongoDB intends to provide a database that maps to your application and keeps up through iterations. There are also a number of other features, such as a powerful Data Aggregation Pipeline and MapReduce, or for more in depth analysis you can connect MongoDB directly to Hadoop or Spark. MongoDB is open source, so you can get up and running pretty quickly although we are going to make use of MongoDB from Compose to get up and running for the purposes of this article. We will cover how to get started with MongoDB and put together a simple example showing how you can utilise this database to store blog posts with threaded comments. GETTING SET UP Start by setting up a MongoDB instance on Compose — this may take a few minutes to deploy. Make sure to check the SSL option when configuring your deployment! Once deployed, you will be able to add a database and a user. Compose presents you with a cool “command line” style interface to do this, and it’s simple enough. When creating your user, make sure you make a note of the password as Compose will hide this from you going forwards. Compose will automatically give your user the permissions it requires. Also, for this tutorial, make sure you create your user in the special admin database. For more on user permissions in Compose’s MongoDB service, see Connecting to the new MongoDB at Compose . Creating a database is just as simple as giving it a name. Feel free to try it out, but our example code will handle database creation for you. We are going to use PHP to create our examples, and you’ll need to install the MongoDB PHP extension . Since I’m on Linux, I’ll use pecl : pecl install mongodb According to the PHP documentation , Mac users should use brew : brew install php55-mongodb New to PHP on a Mac? Without installing MAMP , here’s how to get PHP running on your local Apache web server . And here’s how to access php.ini if you need to append extension=mongodb.so . And finally, you will need to save a copy of the SSL Certificate from the Overview page of the Compose control panel — we have saved ours into a file simply called cert . Make sure to include the -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- in your saved file. CONNECTING FROM PHP MongoDB is well supported, with libraries available for all of your favourite languages, including PHP . These libraries make it very easy to gain access to all of MongoDBs features and let you focus on building your app. To connect from PHP you will need the following: * Your username * Your password * Your SSL Certificate Combine the username and password with the connection string provided by Compose on the Overview screen to get something like this: mongodb://your-user-here:your-password-here@aws-us-east-1-portal.11.dblayer.com:28086/admin?ssl=true This connection string will be different for each deployment, so be careful when copying and pasting! Here are two general notes, however: * Use the connection string for drivers that cannot handle failover between mongos nodes. * Connect to the admin database, since our code keeps it simple. In addition to the MongoDB extension, we’ll use the PHP library that MongoDB provide to give a nice, easy wrapper for accessing MongoDB. This can be installed via Composer : composer require ""mongodb/mongodb=^1.0.0"" This adds the requirement into your composer.json file (creating it if it didn’t exist already). You’ll need to run the composer install command to bring the files in; these can be found in the vendor directory. Now use the connection string from before to connect to your MongoDB deployment as so: connect.php $client = new MongoDB\Client(""mongodb://sevendbs:a22733d78a33d34c20da2d84ee9db5e4@aws-us-east-1-portal.11.dblayer.com:28086/admin?ssl=true"", [], [""cafile"" => ""./cert $posts = $client->selectDatabase(""posts"")->selectCollection(""posts Notice that as well as passing in the connection string, we are also providing the path to the SSL Certificate. We can then select the posts collection from the posts database, that we create. This file is saved as connect.php and our other scripts also use it to connect. What is a collection in MongoDB? From the MongoDB reference manual : “A [collection is a] grouping of MongoDB documents. A collection is the equivalent of an RDBMS table. A collection exists within a single database. Collections do not enforce a schema. Documents within a collection can have different fields. Typically, all documents in a collection have a similar or related purpose.” INSERTING DATA The PHP library is a fairly lightweight wrapper around MongoDB’s command-line interface, which really helps to make this database feel like a consistent interface across platforms (all the other language drivers also follow this pattern). In this case, we’re using the insertOne method to add a new blog post to our posts collection. This example shows a very basic PHP script which will display an HTML form, allowing the user to enter some data, which then gets saved in the database. Here’s the form itself, followed by the code: If you’re not seeing “Post saved” upon form submission, remember to enable debugging ! add_post.php filter_input(INPUT_POST, ""title"", FILTER_SANITIZE_STRING), ""description"" => filter_input(INPUT_POST, ""post $posts- echo ""Post saved } else { // show the form ?> MongoDB in action

Add A Post

You can see that if there’s no data supplied, a simple form is shown here (with a little http://purecss.io to make it nicer to look at) so we can quickly start adding data. If data does arrive as a POST request, then we build up an array with the data we want, and then save it to MongoDB. Remember that MongoDB does not have a schema, you can build up whatever data structure you like before inserting, and the shape of the data can be different each time which makes it ideal for sparse properties, for example. MongoDB will give our record a unique ID when it saves it, you may also want to supply this yourself which you can do by including a _id key and the desired value when creating the data to insert. Either way, this is useful when we come to fetch a list of records and want to be able to identify just one of them. FETCHING DATA Mongo has some great query functionality, and its “aggregation framework” is excellent for gaining insights into potentially large and nested data sets. We just want a list of posts however, and for that we simply use the find() method, then output each of our posts along with a count of comments (more on comments in the next section): index.php MongoDB in action

Blog Posts

MongoDB returns each document as an object, with properties set for each of the fields that were stored. This makes it very easy to access using object notation, e.g. the $p->title in the example above. In the list, we’re also adding hyperlinks and using the ID so that we can fetch individual records on another page. ADDING NESTED DATA MongoDB doesn’t really do joins, so for the most part, database design involves storing data together that will be used together. So if you’re storing content, you’ll probably have a bunch of content elements and anything they rely on, all inside one document. In this example, we’re storing blog posts and we’ll add the comments as part of the post record. Here’s the individual post page, which displays the post, allows a user to add a comment, and lists the comments that have already been added: post.php filter_input(INPUT_POST, ""name"", FILTER_SANITIZE_STRING), ""comment"" => filter_input(INPUT_POST, ""comment $result = $posts->updateOne([""_id"" => new MongoDB\BSON\ObjectID($id)], ['$push' => [""comments"" = header(""Location: /post.php?id= } else { $id = filter_input(INPUT_GET, ""id } if($id): $post = $posts->findOne([""_id"" = ?> MongoDB in action

title ?>

description ?>

Add Comments

"" />
comments as $comment): ?>
comment ?> - by username?> // if the post actually existed ?> The interesting bit here is really where we save the comments, the call to $posts->updateOne . We use the same filter criteria as we do when we fetch the post, but then we go on to push the $data array onto the end of the comments collection. If this collection doesn’t exist, MongoDB will simply create it. Look out for using the mongo identifiers such as $push — in PHP we need to carefully wrap them in single quotes so that PHP doesn’t try to interpret the $ ! Now our comments are inside our existing MongoDB document: { ""_id"" : ObjectId(""575038cc1661d711090e9911""), ""title"" : ""Databases are excellent"", ""description"" : ""We could talk about them for hours"", ""comments"" : [ { ""username"" : ""lorna"", ""comment"" : ""I think so too"" }, { ""username"" : ""lorna"", ""comment"" : ""I think so too"" }, { ""username"" : ""fred"", ""comment"" : ""Thanks for this post, it helped me!"" }, { ""username"" : ""george"", ""comment"" : ""I totally disagree, they are a hazard"" } ] } With this in place, we can add some comments to our database and then revisit the index page to see how things are looking: CONCLUSION MongoDB is quite a key player in the NoSQL arena, and this shows through with the amount of developer support that is available on their website in the shape of libraries and docs; however, there were some instances where we were looking for examples that didn’t seem to exist! On the plus side, MongoDB does have a solid user base and there is a rich ecosystem of content from forums and other people’s blog posts that will help you — beware that the PHP libraries changed relatively recently though so you may find some content is outdated. One feature that can set it apart from some of its rivals is that you don’t need to write the whole document back again when updating — you can simply push updates to the fields that you require. This can help avoid conflicts in a write heavy application. The big selling point, however, is the schema-less and scalable nature of the database, meaning that you really can build apps with the future in mind without worrying about how your infrastructure will adapt. The inclusion secondary indexes allows quick searching on huge amounts of data and that can only be a positive. With MongoDB being open source you can get started on any platform and deploy to more or less anywhere, or if you want to avoid that entirely there are a number of cloud based providers available. SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Click to share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Compose / MongoDB / NoSQL / php Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Looking to learn the basics of cloud databases? In this series, we show them running on Compose and intro programmatic access. Enter: MongoDB + PHP.",Seven Databases in Seven Days – Day 2: MongoDB,Live,99 268,"Skip navigation Sign in SearchLoading... Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE. WATCH QUEUE QUEUE Watch Queue Queue * Remove all * Disconnect The next video is starting stop 1. Loading... Watch Queue Queue __count__/__total__ Find out why CloseIBM DATA CATALOG: USE DATA ASSETS IN A PROJECT developerWorks TVLoading... Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking... Subscribe Subscribed Unsubscribe 18KLoading... Loading... Working... Add toWANT TO WATCH THIS AGAIN LATER? Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO? Sign in to report inappropriate content. Sign in * Transcript * Statistics * Add translations 9 views 0LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1Loading... Loading... TRANSCRIPT The interactive transcript could not be loaded.Loading... Loading... Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 31, 2017This video shows you how to add a data asset to an existing project and then load that data for analytics in a Python notebook. Find more videos in the IBM Data Catalog Learning Center at http://ibm.biz/data-catalog-learning * CATEGORY * Science & Technology * LICENSE * Standard YouTube License Show more Show lessLoading... Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * IBM Data Catalog: Create and administer a data catalog - Duration: 3:19. developerWorks TV 13 views * New 3:19 -------------------------------------------------------------------------------- * IBM Data Catalog: Add data assets to a catalog - Duration: 3:03. developerWorks TV 2 views * New 3:03 * IBM Data Catalog: Governance overview - Duration: 4:11. developerWorks TV 5 views * New 4:11 * IBM Data Catalog: Overview - Duration: 2:03. developerWorks TV 2 views * New 2:03 * IBM InfoSphere Information Analyzer: Analyzing Data Quality and Risk with the Thin Client - Duration: 3:56. IBM Analytics 219 views 3:56 * Information Governance Catalog Basic Overview - Duration: 9:01. 3milevideo 1,368 views 9:01 * IBM Blockchain Car Lease Demo - Duration: 3:01. developerWorks TV 50,058 views 3:01 * IBM Data Refinery: Create a project and add data - Duration: 1:56. developerWorks TV 7 views * New 1:56 * IBM Cognos Integrated with IBM Information Governance Catalog - Duration: 13:02. Jeff Martin 701 views 13:02 * Watson Data Platform: Provision IBM Data Catalog or IBM Data Refinery services - Duration: 1:05. developerWorks TV 22 views * New 1:05 * IBM Data Refinery: Shape data - Duration: 5:46. developerWorks TV 10 views * New 5:46 * UrbanCode Deploy: Using composite blueprints - Duration: 9:13. developerWorks TV 6 views * New 9:13 * IBM Data Refinery: Create a connection and add it to a project - Duration: 1:54. developerWorks TV 3 views * New 1:54 * Data Profiling (Column Analysis) using IBM Information Analyzer 11.5 - Duration: 28:43. PR3 Systems 131 views 28:43 * IBM MVS 3.8 Catalog Management Introduction - Duration: 35:06. moshix 125 views 35:06 * How to kick-off a Data Governance Project using IBM Information Governance Catalog - Duration: 52:14. PR3 Systems 81 views 52:14 * Healthy Habits Pet Assembly, Part1 - Duration: 5:53. developerWorks TV 6 views * New 5:53 * IBM - InfoSphere Information Governance Catalog Demo - Duration: 11:43. PR3 Systems 3,833 views 11:43 * IBM Information Server Data Lineage in Two Easy Steps - Duration: 2:50. BusinessDrivenInfo 4,047 views 2:50 * Welcome - Duration: 1:35. developerWorks TV No views * New 1:35 * Language: English * Content location: United States * Restricted Mode: Off History HelpLoading... Loading... Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Test new features * Loading... Working... Sign in to add this to Watch LaterADD TO Loading playlists...",This video shows you how to add a data asset to an existing project and then load that data for analytics in a Python notebook. ,Use data assets in a project using IBM Data Catalog,Live,100 269,"SHARP SIGHT LABS * HOME * MEMBER LOGIN * ABOUT HOW TO CHOOSE A PROJECT TO PRACTICE DATA SCIENCE March 14, 2017 Here at Sharp Sight , I’ve derided the “jump in and build something” method of learning data science for quite some time. Learning data science by “jumping in” and starting a big project is highly inefficient. However, projects can be extremely useful for practicing data science and refining your skillset, if you know how to select the right project. Before I give you some pointers on how to select a good project, let’s first talk about why “jump in and build” is not the best method of learning data science. JUMP IN AND BUILD IS BAD FOR LEARNING As I mentioned above, Jump in and Build Something™ is the method of learning where you jump in and just build something. It’s based on the idea that the best way to learn a new skill is to select a large project and just build , even if you don’t know most of the requisite skills. You see this quite a bit in programming. A few years ago, you used to hear guys say “I’m going to learn PHP by building an online social network” (essentially, building a Facebook copy). JUMP IN AND BUILD IS EXTREMELY INEFFICIENT While I will admit that it is possible to learn a new skill by jumping into a new project, you have to understand that it’s extremely inefficient. I also tend to think that for beginners, the “knowledge gained” decreases dramatically as the size and complexity of a project increases. That’s another way of saying that if a beginner selects a project that’s too big, they’re likely to learn very little (although, large projects can be very useful for advanced practitioners). The reason for this is that if you choose a project that’s too big, and you don’t know most of the skills, you get bogged down just trying to learn everything before you can move on to getting things done. If you “jump in” to a very complicated project, but you don’t know the requisite skills, you’re going to spend 99% of your time just looking things up. If you’re a beginner and you don’t know much, you might even have trouble figuring out where to start. Essentially, if you try to work on a project that’s too large or too complicated, you’ll spend all of your time trying to learn dozens of small things that you should have learned before starting the project. To help clarify why this point, I’ll give you a few analogies. EXAMPLES: WHEN “JUMPING IN” IS A BAD IDEA I can think of dozens of examples in other arenas where “jumping in” can get you in over your head, but here are two that some of you might be familiar with: learning an instrument and lifting weights. TRYING TO LEARN GUITAR WITH SOMETHING WAY TOO HARD At some point in their life, most people have a desire to learn to play an instrument. For many people (and guys in particular) learning guitar is a goal. If you want to learn to play guitar, are you going to jump in and try to learn to do this right away? There are some people who are foolish enough to try. The fact is, learning to play guitar like this would take most people years. More importantly, it would take years of preparation by learning thousands of little skills before you’d be at a level to perform like this. You’d have to learn a thousand little things: how to position your fingers on the fretboard. How to pick. How to play little “phrases” and also how to play fast. Etcetera. Moreover, it’s not just the nuts-and-bolts techniques that makes it hard. It’s also a matter of style. To play guitar like this, you need to learn how to be expressive with the guitar. That’s a completely separate skill that also takes years. So if you want to learn to play guitar, could you do it by jumping in and learning the guitar solo in the video? Is it possible to learn guitar by trying to learn this complicated guitar solo, one note at a time? Would you be able to do this without knowing any foundational guitar skills beforehand? Maybe. But it would be a long, frustrating effort. My guess is that such a task would induce most people to quit. For beginning guitarists, it’s much, much more effective and efficient to start with the absolute foundational guitar skills, master the foundational skills, and progressively move on to skills of increasing difficulty. It’s much more effective to put together a systematic plan with a skilled teacher that puts you on the path to your goal in structured way. Data science is exactly like this. The most efficient and effective way to learn data science is to be highly systematic. You need to have a plan. You need to learn the right things in the right order. The optimal strategy for learning data science is almost the opposite of “jump in and build something.” TRYING TO GET STRONG BY LIFTING TOO MUCH WEIGHT Here’s another example. If you want to get fit and strong, it’s a terrible idea to jump in and try to lift very heavy weight. If you “jump in” and try to lift an amount of weight that’s far beyond your strength level, you’re likely to fail. Like this guy. Wow. Too much man. Take some weight off the bar. In weightlifting, if you try to lift too much, you’re likely to fail and you might even get hurt. In data science, you won’t have a risk of injuring yourself physically, but you might incur a different sort of damage: you might injure your ego . You might attempt a project that’s too hard and subsequently fail. Your failure might cause you to believe that you’re “not smart enough” to learn data science, and you might give up altogether. I hear it all the time. People try something that’s too hard, fail, and then give up. It’s a very real risk. There’s actually a much better way to become a strong data scientist and it’s a lot like trying to get strong in the gym. In the gym, the best way to get strong is to start with light weights, and learn the basic motions safely with those low weights. Then, add a little weight each week. Five pounds. Maybe ten. That doesn’t sound like a lot, but over the course of only a few months, if you continue to add weight to the bar each week, you will get stronger. Similarly, in data science, instead of jumping into a project with a high difficulty level, you should start with something small and do-able with your current skill level, then increase the size and complexity of your projects as you learn more over time. It’s remarkably similar to weightlifting. Start small, then increase complexity. Over time, you will become a strong, highly skilled data scientist. WHEN TO USE PROJECTS TO PRACTICE DATA SCIENCE At this point, I want to clarify something, to make sure that you don’t get the wrong idea. Projects are great, but not for learning . At a high level, projects are not very good for learning skills. However, projects are excellent for 2 things: 1. Integrating skills that you’ve already learned 2. Identifying skill gaps PROJECTS HELP YOU INTEGRATE SKILLS YOU’VE ALREADY LEARNED As you develop as a data scientist, projects are best for integrating the things that you already know. Here’s what I mean: Many of the skills that you need to learn in order to become a data scientist are highly modular . This is particularly true if you’re using the tidyverse in R. For the most part, the tidyverse was designed such that each function does one thing, and does it well. Each of these small tools (I.e., each function) is a small unit that you should learn and practice on a very small scale before starting a project. You should find very, very simple examples and practice those examples repeatedly over time. This is just like a guitar player: a guitar player might practice a guitar scale every single day for a few weeks (or years). He might have a set of 3 chords and practice simple transitions between those guitar chords. Similarly, you should have small, learnable units that you practice regularly. As a beginner, you should practice just making a bar chart. You should practice how to use dplyr::mutate() to add a new variable to a dataset . You should learn these skills on very simple examples, and practice them repeatedly until you can write that code “with your eyes closed.” Then, when you start working on a small project, the project will help you integrate those skills. For example, you’ll often need to use dplyr::filter() in combination with ggplot2 to subset your data and create a new plot. Working on a project gives you an opportunity to put these two tools together. It allows you to take ggplot() and filter() – which you should have practiced separately – and integrate them in a way that produces something new and more complex. This is what projects are great for: they help you put the pieces together. Projects help you integrate skills that you’ve already learned into a more cohesive whole. PROJECTS HELP YOU IDENTIFY SKILL GAPS The second use for projects is to help you identify skill gaps. When you start a new project, I recommend that you know most of the tools and techniques that you need to complete the project. So if the project requires bar charts, histograms, data sorting, adding new variables, etc, you should already know those skills. You should have learned them with small, simple examples, and practiced them for a while so that you’re “smooth” at executing them. However, even if you’ve learned and practiced the required tools, when you dive into your project, you’ll begin to find little gaps. You’ll find things that you don’t know quite as well as you thought you did. You’ll discover that maybe you don’t know a particular function that well. Or you’ve forgotten a critical piece of syntax. This is gold. When you work on a project, these “missing pieces” tell you what you need to work on in order to get you to the next level. Let me give you an example: when you’re starting out with ggplot2 , I recommend that you learn 5 critical data visualizations : the bar, the line, the scatter, the histogram, and the small multiple. These comprise what I sometimes call “the big 5” data visualizations. These are the essentials. After learning these, let’s say that you decide to work on a project. You decide to analyze a small dataset that you obtained online, and you plan to use the “essential visualizations.” But after creating the basic visualizations to analyze the data, you decide that you want to make them look a little more polished by modifying the plot themes. If, at that point, you haven’t learned ggplot2::theme() and all of the element functions (like element_line() element_rect() , etc) then you’ll have a hard time formatting your plots and making them look more professional. In this case, you will have identified a “skill gap.” These are next skills to work on. You’d know that to get to the next level, you need to learn (and practice!) the theme() function and the accompanying functions & parameters of the ggplot2 theme system. Projects are excellent for identifying your skill weaknesses. That will help you refine your learning plan as you move forward. HOW TO CHOOSE A GOOD DATA SCIENCE PROJECT TO PRACTICE DATA SCIENCE To get the benefits from project work, the critical factor is selecting a project that’s at the right skill level: not too hard, but not too easy. If you’ve selected well, then you’ll have a small and manageable list of “things to learn right now” in order to finish the project. If you’ve selected a project at an appropriate skill level, then your “skill gap” will be small, you’ll be able to learn those new skills on the fly, and you’ll be able to complete the project. Afterwards, you’ll be able to add these “new skills” to your practice routine so you can remember them over time . Choosing such a project is more of an art than a science, but here are a few pointers: CHOOSE SOMETHING THAT YOU THAT’S MOSTLY WITHIN YOUR CURRENT SKILL LEVEL Ultimately, you want something that’s within your skill level, but will push you just a little bit. Having said this, when you consider a new project, you should just ask a few simple questions: 1. What skills do I think I’ll need? 2. Do I know those skills? Here’s an example: about a year ago, I did an analysis of a car dataset that I obtained online. Before starting this project, I had a good idea of the tools that I’d need: * Bar chart * Line chart * Histogram * Small multiple * Joining datasets * Adding variables to a dataset There were a few other tools and techniques, but that’s the short list. Before I even started the project, I had a rough idea that those were the skills that I needed to know. If you wanted to execute a similar project, you should make a similar list, and ask yourself, do I know most of these skills already? YOU SHOULD KNOW HOW TO DO ABOUT 90% OF THE WORK After identifying the tools and techniques you’ll need for a project, here’s a good rule of thumb: you should already know about 90 percent of the tools and techniques. For example, if you’re working on a project that requires about 20 primary tools or techniques, you should be able to execute roughly 18 of those techniques. That means that there would be about 2 – 4 techniques that you didn’t know. Such a project would be a decent stretch. For the 18 techniques that you do know, it will be good practice. You’ll get to repeat those techniques (repetition is essential for long-term memory) and perhaps combine them into new or interesting ways. What about the techniques that you don’t know? You’ll have to learn them on the fly and integrate them into the project. This is actually hard to do, because learning a new technique will slow you down. Learning a new technique while you’re working on your project will dramatically reduce your effectiveness and slow down the project’s progress. That’s why I recommend that you mostly learn and practice techniques outside the context of project work. To rapidly learn and master your tools , should be learning and practicing your toolkit regularly and separate from your projects. But again, if you begin a project and realize that there are a few necessary techniques that you don’t know, that’s fine. In fact, it’s good. It tells you what your next steps are for your learning plan. This invites a question though: What counts as a technique? I actually think that the tidyverse’s modular structure gives us a good way of breaking things down. Individual tidyverse functions are a good way to dissect the project into different tools. In this scheme, I’d consider dplyr::mutate() to be one tool. dplyr::arrange() would be another. Among the ggplot2 techniques, you could consider geom_line() to be a single technique. Some of the intermediate tools like scale_fill_gradient() could also be considered separate techniques. Again, the tidyverse is highly modular, in that each function is a little functional module that does one thing. That being the case, you can treat these little, modular functions as units that you either know or don’t know when you evaluate a potential project. So to restate, here’s a good rule of thumb: when you start a project, you should already know about 90% of the techniques (and the remaining 10% will force you to stretch your skill). IF IT FEELS TOO EASY, CHOOSE SOMETHING HARDER Having said that, if you evaluate a project, and it seems too easy, then try to find something harder. You want to push yourself just a little. For example, if you’ve been a data scientist for a year or two, and you’ve made a few hundred bar charts and line charts, then choosing a project that uses only the basic tools might be little too easy for you. If that’s the case, try to find something that is just a little out of our comfort zone. Again, it’s like weight lifting: you need to add a little weight to the bar every week in order to get strong. If the weight on the bar is so easy that you can do a couple dozen repetitions, it’s too light. You need something more difficult. If you look at a potential project, and you know you’ve done something very similar many times before, choose something more difficult. PROJECTS ARE PART OF A LARGER PROCESS OF SYSTEMATIC LEARNING If you use projects the right way, then they are a critical part of a much larger scheme of highly systematic learning. In this post, I dropped some hints, but here I’ll be more explicit: to rapidly learn and master data science, you need to be systematic. You need to be systematic in what you learn, when you learn it, and how you practice. High performers of all stripes know that relentless, systematic practice is the most effective way to learn a new skill. Having said that, as I mentioned above, projects are an important part of a systematic learning plan because they help you integrate what you’ve already learned, they help you identify skill gaps, and they can push you beyond your comfort zone. But whatever you do, don’t fall into the “jump in and build something” trap by trying to learn data science without a plan. SIGN UP NOW, AND DISCOVER HOW TO RAPIDLY MASTER DATA SCIENCE To rapidly master data science, you need a plan. You need to be highly systematic. Sign up for our email list right now, and you’ll get our “Data Science Crash Course.” In it you’ll discover: * A step-by-step learning plan * How to create the essential data visualizations * How to perform the essential data wrangling techniques * How to get started with machine learning * How much math you need to learn (and when to learn it) * And more … SIGN UP NOW LEAVE A REPLY CANCEL REPLY Your email address will not be published. Required fields are marked * Comment Name * Email * Website GET THE FREE DATA SCIENCE CRASH COURSE Sign up now and learn: • a step-by-step data science learning plan • the 1 programming language you need to learn • 3 essential data visualizations • how to do data manipulation in R • how to get started with machine learning • the difference between machine learning and statistics • and more ... Your first name Your best email address By signing up for the newsletter you'll also get ... ✓ Free machine learning tutorials ✓ Free data visualization tutorials ✓ Learning strategies to skyrocket your progress ... delivered to your inbox on a regular basis. RECOMMENDED READING FlowingData R-Bloggers R-users (jobs site)Subscribe to receive our free ""Getting Started with Analytics and Data Science"" pdf. First Name E-Mail Address © 2017 · Powered by data","Projects can be great for mastering data science, but you have to choose your projects carefully. This article will give you tips on how to choose a project that's appropriate for your skill level (and tell you some pitfalls to watch out for). For more data science tutorials, sign up for our email list.",How to choose a project to practice data science,Live,101 271,"Homepage Follow Sign in / Sign up Homepage * Home * Data Science Experience * Data Catalog * * Watson Data Platform * Susanna Tai Blocked Unblock Follow Following Offering Manager, Watson Data Platform | Data Catalog Sep 14 -------------------------------------------------------------------------------- HOW TO EASE THE STRAIN AS YOUR DATA VOLUMES RISE September 14, 2017 | Written by: Manish Bhide Ever had to make a decision when you didn’t have the time, means or patience to look up all the data that could help you choose the best option? Yes, well, you’re not alone on that score. Usually, this doesn’t have significant or long-lasting consequences — does it really matter if you choose where to go for dinner because you like the look of a place, rather than combing through recent reviews? But some decisions carry a lot more weight. For example, executives at Kodak decided not to pursue the digital camera technology that their employees invented, giving arch-rivals Fuji and Sony a golden opportunity to seize market share that they were never able to claw back. Executives at Kodak decided not to pursue the digital camera technology that their employees invented, giving arch-rivals Fuji and Sony a golden opportunity to seize market share that they were never able to claw back. For some time now, the party line has been that big data could have saved these organizations and countless others from bad decisions. But that isn’t the whole story. As my colleague Jay Limburn shared in a previous blog post , having lots of information — particularly when it is poorly organized, difficult to find or not fully trusted — can hold you back just as much as not having enough data. SOLVING THE SCALABILITY CONUNDRUM We all know how important scalability is when building an infrastructure that can cope with big data — the clue is in the name ‘big data’! But how do you actually achieve scalability that delivers service continuity as your data grows? First, you need to take some key considerations into account. Scalability isn’t just about coping with gigabytes of data that grows to terabytes, petabytes, exabytes, zettabytes, yottabytes and beyond. It’s also about dealing with increasing numbers of data sets, formats and types. For that, you’ll need to make sure that the tools that help your knowledge workers make sense of data and manage governance policies can scale up too, or you’ll soon be in trouble. You can scale data infrastructure vertically, by adding resources to existing systems, or horizontally, by adding more systems and connecting them so you can load balance across them as a single logical unit. Vertical scaling is limited, because you will eventually reach the maximum capacity of your machine. In contrast, horizontal scaling may take more planning but presents far fewer restrictions. SO, WHAT’S THE ANSWER? The best approach is multi-faceted: give knowledge workers access to lots of data, along with the tools they need to quickly find the most relevant assets without violating governance policies along the way. Of course, this is easier said than done. But with data management tools that include built-in cataloging — such as IBM’s new IBM Data Catalog solution — you will be able to quickly search for data both within and across extremely large sets. As an example, if one of your data scientists discovers a relevant data set when researching a topic, they will be able to add tags and descriptions to make it easier for other data workers to find it when working on similar problems or questions. As more people add to the metadata, it will become increasingly easy for data scientists to gather the information they need through keyword searches. In addition to its cataloging capabilities, IBM Data Catalog will also feature a business glossary, to help users tackle the challenges of continually evolving terminology. Different people refer to different things in different ways, which can prevent knowledge workers from finding relevant data sets, a problem that only gets worse as organizations and their data get larger. A business glossary will enable you to establish a consistent set of terms to describe your data, so that knowledge workers can quickly understand which assets are useful and which are irrelevant to their analyses. Users will also be able to take advantage of an auto-discovery service. It will trawl through their systems to find available data sources, work out the types and formats of data in each, and present them to the data user, who can then choose which to publish in the catalog. It doesn’t stop there — through auto-profiling, the solution will be able to automatically classify data, figuring out whether it contains social security numbers, names, addresses, zip codes, or other common types of data. As discussed in more detail in another previous blog post , IBM Data Catalog will also offer automated, real-time classification and enforcement of governance policies. This is currently a unique proposition, and resolves one of the major obstacles to scaling up the size and use of data management systems. Automated governance will remove the need for the Chief Data Officer (CDO)’s team to manually enforce governance policies, avoiding scalability issues as the number of data assets grows. Moreover, the governance dashboard will offer CDOs an aggregated view of enforcement across an organization, including requests for access and usage of assets. The scale and complexity of governance efforts usually grow alongside companies and their data, so these tools will represent a real game-changer in the building and use of data management systems. AND WHAT WILL HAPPEN BEHIND THE SCENES? Delivered via the cloud, IBM Data Catalog will give users the chance to no longer worry about scaling infrastructure. But let’s take a look behind the curtain to understand a few of the ways IBM will ensure seamless services, even when demand suddenly spikes. The IBM cloud provides load-balancers that can automatically distribute workload between the available application nodes, avoiding bottlenecks when one node gets busy. The cloud platform can also automatically scale horizontally, spinning up new nodes to deal with more data when demand exceeds a certain threshold. The result is that the user can enjoy stable response times, with little to no degradation of performance even during busy periods. For added resilience, nodes can also be deployed across data centers in multiple availability zones, protecting service continuity in the event of an outage at one location. The same scalability and resilience are provided at the storage layer, too. All Data Catalog metadata is stored in IBM Cloudant, where it is auto-replicated across nodes. This replication avoids the risk of having a single point of failure, helping to keep the Catalog available even in the event of a node failing. And for customers who choose to use Data Catalog not only as a metadata store, but also as a repository for the data itself, the solution harnesses IBM Cloud Object Storage to provide massive scalability for any volume of data. Finally, behind the scenes, IBM will have a team of specialists monitoring your infrastructure to catch and address any potential scalability issues. Utilizing the best of IBM’s technology to analyze and track key performance indicators such as CPU and memory usage, they will be notified of any emerging problems so they can take action before users feel the impact. LOOKING AHEAD In summary, Data Catalog has been engineered from both a functional and non-functional perspective to solve the real problems posed by scaling big data architectures. Instead of focusing purely on the storage of the data itself, Data Catalog addresses practical issues such as findability, usability and governance — helping you not only preserve and organize your data, but also allow users and data stewards to work with it more effectively. Learn more about IBM Data Catalog today -------------------------------------------------------------------------------- Originally published at www.ibm.com on September 14, 2017. * Data Management * Data Catalog * Cloud Services * Big Data A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out. Blocked Unblock Follow FollowingSUSANNA TAI Offering Manager, Watson Data Platform | Data Catalog FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Ever had to make a decision when you didn’t have the time, means or patience to look up all the data that could help you choose the best option? Yes, well, you’re not alone on that score. ",How to ease the strain as your data volumes rise,Live,102 272,"* R Views * About this Blog * Contributors * Some Resources * * R Views * About this Blog * Contributors * Some Resources * R FOR ENTERPRISE: HOW TO SCALE YOUR ANALYTICS USING R by Sean Lopp At RStudio, we work with many companies interested in scaling R. They typically want to know: * How can R scale for big data or big computation? * How can R scale for a growing team of data scientists? This post provides a framework for answering both questions. SCALING R FOR BIG DATA OR BIG COMPUTATION The first step to scaling R is understanding what class of problems your organization faces. At RStudio, we think of three use cases: data extraction, embarrassingly parallel problems, and analysis on the whole. Garrett Grolemund hosted an excellent webinar on Big Data in R , in which he outlined the differences in these three cases. DISCLAIMER: These three cases are not exhaustive, nor are most problems easily categorized into one of the three classes. But, when scoping a scaled R environment, it is imperative to understand which class needs to be enabled. Your organization might have all three cases, or it might have only one or two. CASE 1: COMPUTE ON THE DATA EXTRACT Example: I want to build a predictive model. I only need a few dozen features and a three-month window to build a good model. I can also aggregate my data from the transaction level to the user level. The result is a much smaller data set that I can use to train my model in R. Computing on data extracts is arguably the most common use case; an analyst will run a query to pull a subset of data from an external source into R. If your data extracts are large, you can run R on a server. At RStudio, we recommend using the server version of the IDE (either open-source or professional), but there are many ways to use R interactively on a server. CASE 2: COMPUTE ON THE PARTS Example: When I worked at a national lab (NREL), we validated fuel economy models against real-world datasets. Each dataset had hundreds of recorded trips from individual vehicles. While the total dataset was TBs, each individual trip was a few hundred MBs. We ran independent models in parallel against each trip. Each of these jobs added a single line to a results file. Then we aggregated the results with a reduction step (taking a weighted mean). By using an HPC system, a task that would take weeks to run sequentially was completed in a few hours. Compute on the parts happens when the analyst needs to run the same analysis over many subsets of data, or needs to run the same analysis many times, and each model is independent of the others. Examples include cross validation, sensitivity analysis, and model scoring. These problems are called: “embarrassingly parallel” (often a misnomer, since scaling for embarrassingly parallel problems is rarely embarrassingly simple). COMPUTE ON THE PARTS WITH A SINGLE MACHINE By default, R is single threaded; however, you can also use R packages to do parallel processing on a multicore server or a multicore desktop. Local parallelization is facilitated by packages like parallel, snow, foreach, etc. These packages parallelize your R commands by running them on independent threads in multicore processors. Alternatively, low-level parallelization can be facilitated with packages like Rcpp and RcppParallel. These packages facilitate the interaction of R with C++. COMPUTE ON THE PARTS WITH A HIGH PERFORMANCE CLUSTER (HPC) In some cases, R users have access to High Performance Computing environments. These environments are becoming more readily available with technologies like Docker Swarm. An R user will test R code interactively (on an edge node or their local machine), and then submit the R code to the cluster as a series of batch jobs. Each batch job will call R on a slave node. Note that RStudio, as an interactive IDE, may run on an edge node of the cluster or on a local machine. RStudio does not run on the slave nodes. Only R is run on the slave nodes and is executed in batch (not interactively). One challenge faced by R users is knowing how to submit batch jobs to the cluster, tracking their progress, and re-running jobs that fail. One solution is the batchtools package. This package abstracts the details of job submission and tracking into a series of R function calls. The R functions, in turn, use generic templates provided by system administrators. Parallel R with Batch Jobs provides a nice overview. Some analysts have created Shiny applications that leverage these functions to provide an interactive Job Management interface from within RStudio! One challenge faced by system administrators is ensuring the dependencies for the batch R script are available on all the slave nodes. Dependencies include: data access, the correct version of R, and correct versions of R packages. One solution is to store the R binaries and package libraries on shared storage (accessible by every slave node), alongside shared data and the project’s read/write scratch space. Case 2: Compute on the parts. Technologies: parallel, snow, RcppParallel, LSF, SLURM , Torque , Docker SwarmCASE 3: COMPUTE ON THE WHOLE Example: A recommendation engine for movies that is robust to “unique” tastes. The entire domain space needs to be considered all at once. Image classification falls into this class; the weights for a complex neural network need to be fit against the entire training set. This class of problem is the most difficult to solve, and has generated the most hype. Sometimes analysts will purchase, use, and modify ready-made implementations of these algorithms. Computing on the whole happens when the analyst needs to run a model against an entire dataset, and the model is not embarrassingly parallel or the data does not fit on a single machine. Typically, the analyst will leverage specialized tools such as MapReduce, SQL, Spark, H20.ai, and others. R is used as an orchestration layer. Orchestration involves using R to run jobs in other languages. R has a long history of orchestrating other languages to accomplish computationally intensive tasks. See Extending R by John Chambers. When orchestrating a case 3 problem, the R analyst will use R to direct an external computation engine that does the heavy lifting. This approach is very similar case 1. For example, Oracle’s Big Data Appliance and Microsoft SQL Server 2016 with R Server both include routines for fitting models in the database. These routines are accessible as specialized R functions. These functions are used in addition to case 1 extracts created with traditional SQL queries through RODBC or dplyr. Another example is Apache Spark. The R analyst will work from an edge node running R. (The open-source or professional RStudio Server can facilitate this interactive use.) In R, the user will call functions from a specialized R package, which in turn accesses Spark’s data processing and machine learning routines. One available R package is sparklyr. Note that the machine learning routines are not running in R. The analyst uses these routines as black boxes that can be pieced together into pipelines, but not modified directly. Case 3: Compute on the whole. Technologies: Hadoop, Spark, Tensorflow, In-DB computing (RevoScaleR, OracleR, Aster, etc)MULTIPLE USERS: SCALING R FOR TEAMS As organizations grow, another concern is how to scale R for a team of data scientists. This type of scale is orthogonal to the previous topic. Scaling for a team addresses questions like: How can analysts share their work? How can compute resources be shared? How does R integrate with the IT landscape? In many cases, these questions need to be answered even if the R environment doesn’t need to scale for big data. Scaling R for teams. Technologies: Version control (Git, SVN), miniCRAN, RStudio Server ProOpen-source packages can address many of these concerns. For example, many organizations use packrat and miniCRAN to manage R’s package ecosystem. The use of version control become increasingly important as teams grow and work together. Many companies will create internal R packages to facilitate sharing things like data access scripts, ggplot2 themes, and R Markdown templates. Airbnb provides a detailed example . For more information on version control, packrat, and packages, see the webinar series RStudio Essentials . At RStudio, we recommend using RStudio Server Pro because its features such as load balancing, multi-session support, collaborative editing and auditing are designed specifically to support a large numbers of user sessions. WRAP UP Whether you need to compute on big data, grow your analytic team, or do both, R has tools to help you succeed. As more companies look to data to drive business decisions, creating a scaleable R environment will be a critical step towards success. Many of the topics in this blog deserve their own posts. However, understanding and discussing these different types of scale can help create the correct roadmap. If you’ve created an R environment at scale, we’d love to hear from you. In a later post, we’ll address another outstanding question: after I scale the R platform, how do I scale the distribution of results and insights to non-R users? seanlopp 2016-12-21T14:06:49+00:00 250 Northern Ave, Boston, MA 02210 844-448-1212 info@rstudio.com DMCA Trademark Support ECCN * Switch tabs w/o muscle cramps: New RStudio Desktop 1.0.136 switches w/ Ctrl+Tab. Lots of tabs? Ctrl+Shift+. to select tab by name! #rstats 6 days ago Copyright 2016 RStudio | All Rights Reserved | Legal Terms Twitter Linkedin Facebook Rss Email github Rss","by Sean LoppAt RStudio, we work with many companies interested in scaling R. They typically want to know:How can R scale for big data or big computation?How can R scale for a growing team of data scientists?This post provides a framework for answering both questions.Scaling R for Big Data or Big ComputationThe first step to",How to Scale Your Analytics Using R,Live,103 274,"☰ * Login * Sign Up * Learning Paths * Courses * Our Courses * Partner Courses * Badges * Our Badges * BDU Badge Program * Events * Blog * Resources * Resources List * Downloads * BLOG Welcome to the Big Data University Blog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * This Week in Data Science (November 15, 2016) * This Week in Data Science (November 08, 2016) * Partnering with Big Data University – UMUC Case Study * This Week in Data Science (November 01, 2016) * This Week in Data Science (October 25, 2016) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsTHIS WEEK IN DATA SCIENCE (NOVEMBER 15, 2016) Posted on November 15, 2016 by cora Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * What Artificial Intelligence Can and Can’t Do Right Now – Lately the media has sometimes painted an unrealistic picture of the powers of AI. * Can deep learning help solve lip reading? – New research paper shows AI easily beating humans, but there’s still lots of work to be done. * Trump, Failure of Prediction, and Lessons for Data Scientists – The shocking and unexpected win of Donald Trump of presidency of the United States has once again showed the limits of Data Science and prediction when dealing with human behavior. * Understanding the Four Types of Artificial Intelligence – Machines understand verbal commands, distinguish pictures, drive cars and play games better than we do. How much longer can it be before they walk among us? * Google DeepMind’s AI learns to play with physical objects – Push it, pull it, break it, maybe even give it a lick. Children experiment this way to learn about the physical world from an early age. Now, artificial intelligence trained by researchers at Google’s DeepMind and the University of California, Berkeley, is taking its own baby steps in this area. * IBM’s Watson to use genomic data to defeat drug-resistant cancers – The five-year, $50 million project will study thousands of drug-resistant tumors. * A Day in the Life of a Data Engineer – This post is part of our Day in the Life of Data series, where our alumni discuss the daily challenges they work on at over 200 companies. * Machine-Learning Algorithm Quantifies Gender Bias in Astronomy – Calculation suggests papers with women first-authors have citation rates pushed down by 10 percent. * Delivering real-time AI in the palm of your hand – As video becomes an even more popular way for people to communicate, we want to give everyone state-of-the art creative tools to help you express yourself. * Six Data Science Lessons from the Epic Polling Failure – Big data analytics suffered a huge setback on Tuesday when nearly every political poll failed to predict the outcome of the presidential election. * Deep learning is already altering your reality – If we’re living in an algorithmic bubble, we should know how it’s bending and coloring whatever rays of light we’re able to glimpse through it. * How IBM Watson May Help Solve Cancer Drug Resistance – We may soon know how cancer dodges powerful drugs and becomes resistant to them. * How to approach machine learning in the cloud – Machine learning needs lots of data, and the best place for all that data and the systems that use it is in the cloud. UPCOMING DATA SCIENCE EVENTS * Data Analysis with Spark – Come learn how to work with Big Data using Apache Spark on November 17th. * Apache Spark – Hands-on Session – Come join speakers Matt McInnis and Sepi Seifzadeh, Data Scientists from IBM Canada as they guide the group through three hands-on exercises using IBM’s new Data Science Experience to leverage Apache Spark. * Analytics Strategies in the Cloud – Join IBM and 2-time Canadian Olympic gold-medalist Alexandre Bilodeau for a complimentary event in Montreal where you’ll network, eat, drink and engage in an inspiring discussion on making business analytics easier and more available for all departments throughout your company. * Self service analytics in a flash with dashDB – Join this IBM Webinar on November 17th to learn about self service analytics. * The DNA of a Data Science Rock Star – Join us on November 29th for this latest Data Science Central Webinar and learn what skills, tools, and behaviors are emerging as the DNA of the Rock Star Data Scientist. SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * RELATED Tags: analytics , Big Data , data science , events , weekly roundup -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Community * FAQ * Ambassador Program * Legal Follow us * * * * * * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our fortieth release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (November 15, 2016)",Live,104 275,"BUILDING OFFLINE-FIRST, PROGRESSIVE WEB APPS -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Glynn Bird 11/8/16Glynn Bird Before joining IBM Cloud Data Services, Glynn served as the Head of IT and Development for Central Index, creating a white-label frontend for a NoSQL business directory (using PHP, Node.js, MySQL, Redis, Cloudant, and Redshift). His experience includes writing CRM systems, ""find my nearest"" indexes, e-commerce platforms, and a phone… Learn More Recent Posts * Building Offline-First, Progressive Web Apps In this article, I aim to summarise Progressive Web Apps and provide recommendations from my… * Plug into the Cloudant Node.js Library v1.5 Today marks version 1.5 of the Cloudant Node.js Library. The library comes with a new… * Importing JSON Documents with nosqlimport Introducing nosqlimport, an npm module to help you import comma-separated and tab-separated files into your… I’ve been creating websites for many years and I’ve watched the definition of “best practice” evolve over time. Web technology is a movable feast driven by: * Web users who consume the websites being built * Web developers who are tasked with building websites using the tools available * Browser developers who introduce new features into their products that developers can utilise * Standards committees who attempt to gain consensus between all the interested parties so that innovation happens in a way that is mutually beneficial Inevitably there are casualties along the way: standards or browser innovations that show promise but are little-used, fail to gain cross-platform consensus or are superseded by another round of innovation. “BEST PRACTICE IN A BOTTLE, YEAH” In this blog post I aim to summarise Progressive Web Apps (PWAs) , which seem to me to form a manifesto of best practices for the websites of today (November 2016.) The recommendations herein stem from my experience refactoring one of my apps in the summer of 2016. I hope they help you get started with your own PWA implementation. It’s important to note that this blog will only have a limited shelf life. In a year or two, the advice set out here will be out-of-date, perhaps laughably so, but that’s the nature of the beast. It’ “best practice” rolls onwards and the very programming language we use to pin it all together changes radically. WHAT ARE PWAS? The term “Progressive Web App” refers to a website that aims to provide a user experience akin to a native app. The “Progressive” bit refers to the web app selecting which technologies it engages depending on the capabilities of the platform the website is running on. On an older browser, a PWA may not have any special features, but on the latest Firefox or Chrome builds they may silently enable the modern APIs that those platforms afford. Simply put, a PWA aims to provide: * Responsive design – that displays well on mobile, tablet and desktop form factors * Offline rendering – where the web page can be viewed and used with no network connection * Offline-First storage – where data is stored locally on the device and synced to the cloud later * App-Like install – where a mobile user can save the web app to their desktop Pokedex.org by Nolan Lawson runs offline and can install to your phone like a normal app. See this write-up or the source on GitHub. Some of these aims are not new, but the PWA manifesto brings together this shopping list of best practice and offers APIs that web developers can use today. Many of the aims are technology-neutral and can be solved using a variety of tools. RESPONSIVE DESIGN There are any number of CSS frameworks that dictate the markup you can use to achieve a fluid, collapsible web interface that looks good on all devices. I have used Bootstrap for years but for this blog post, and in the interest of variety, I chose the Materialize library instead. Incorporating Google’s Material Design principles, Materialize makes it very simple to create a good-looking, responsive website that works well on mobile devices. OFFLINE RENDERING A standard website won’t function at all if there’s no network connection. Even if the network connection is patchy, such as when browsing on a mobile device, a site may struggle to deliver a satisfactory user-experience. The AppCache API allows websites to be aggressively cached on the device to the point where they can render with no network connectivity (as long as they were visited at least once on a previous occasion!). The AppCache API is an example of a solution that was designed by committee, received widespread browser adoption but was not widely loved by developers. It has been superseded by the Service Worker API . Service Workers are JavaScript tasks (a bit like server-side daemons but running on the client side) that are instantiated by web pages and from that point, can intercept and route traffic emanating from that page. The Service Worker API is much more flexible than AppCache as it allows the developer to decide in minute detail what happens to each client-side web request — but with flexibility comes complexity. OFFLINE-FIRST STORAGE Offline-First storage allows data to be stored in an in-browser database, giving your web application the opportunity to read and write data to and from its local database, even when offline. There are several solutions to this problem. IndexeDB , DexieJS , and SQLite are supported by a range of browsers, but my favourite in-browser database is PouchDB , which works on a wide variety of browsers and devices and provides the same API to you (the developer) while choosing the best in-browser storage technology at runtime. Making a website work on a range of browsers and platforms is hard enough, but in-browser storage varies greatly from browser to browser, and PouchDB smooths the path immeasurably. Wrote some code with @pouchdb today. Soooo easy, sooo simple. — Simon /\/\e†s0|\| (@drsm79) September 19, 2016 PouchDB also allows the in-browser database to be synced to a remote Apache CouchDB™ , IBM Cloudant or PouchDB database when there is network connectivity using the CouchDB replication protocol. The ability of CouchDB-like databases to allow the same data to be replicated, modified in different ways and re-synced without data-loss makes this an ideal solution for offline-first storage. APP-LIKE INSTALL Progressive Web Apps are not installed from an app-store like native apps; they are shared using URLs as the Web’s design intends. Once loaded on a phone’s browser, the URL can be added to the phone’s home screen, but implementations of this functionality vary between browsers and platforms. Google Chrome supports a manifest.json file that lists the application’s name, colours, icons and other metadata. BUILDING A PWA While I can’t share all the source code, I will include some snippets from my work refactoring my app earlier this summer. Here’s the toolkit I chose to produce the PWA features I was after: * Cloudant Envoy – to allow my one-database-per-user model to result in a single database on the server side * MaterializeCSS – for responsive CSS and markup. Other frameworks are available, of course. * jQuery – I’m not a full-time front-end developer. I understand jQuery, and I haven’t the time to learn one of the formal frameworks like Angular or ReactJS. * PouchDB – for in-browser storage and sync * LeafletJS – for maps and HTML5 geolocation * Mustache – for HTML templating * Simple Data Vis – absurdly simple visualisation library based on d3 The range of choices is bewildering. This list doesn’t represent the only way to build a PWA by any means, but it’s the tooling I was comfortable using. Don’t be overwhelmed! Here’s where I started with my PWA. I found it easiest to start with the front end in my app. I wrote my front end code assuming that the user in my application was authenticated and by hard-coding a few settings. Then I wrote my front end app to read and write its data from its local PouchDB database. I knew that with a few more lines of code I could get it to sync correctly, so that “solved problem” wasn’t one I needed to waste time on. If I could get an app to allow data to be added, edited and deleted on the client side, then the rest should fall into place. I also ignored the offline caching code until the last minute too. I assumed (correctly) that if I got my app working then I could add the Service Worker to provide a caching service at a later date. GETTING STARTED WITH CLOUDANT ENVOY In your blank directory create a new “package.json” file with: > npm init We can then add Cloudant Envoy : > npm install --save cloudant-envoy We are going to put our static website (index.html, JavaScript, CSS, images, etc.) in a “public” sub-directory and our Node.js app in “app.js”: > mkdir public > touch public/index.html > mkdir public/js > mkdir public/css > touch app.js Create your app in app.js : var path = require('path'), express = require('express'), router = express.Router(); // my custom API call router.post('/myapicall', function(req, res) { res.send({ok: true}); }); // setup Envoy to // - log incoming requests // - switch off demo app // - serve out our static files // - add our routes var opts = { logFormat: 'dev', production: true, static: path.join(__dirname, './public'), router: router }; // start up the web server var envoy = require('cloudant-envoy')(opts); envoy.events.on('listening', function() { console.log('[OK] Server is up'); }); The above code uses Envoy to start the web server and adds in: * Our “public” directory to be served out * Our custom API calls to be incorporated This design allows us to build a website that is static web server, handles API calls, and is a CouchDB-compatible replication target all in one go. In the client-side code, the app then uses PouchDB to create a database: var db = new PouchDB('mylocaldatabase'); That PouchDB database can then be used to store data: var mydata = { a:1, b:2, c: 'three'}; db.post(mydata).then(function(d) { console.log('Data saved to', d.id); }); When you need to sync the data, simply use the PouchDB replicate or sync tools: var remotedb = new PouchDB('https://username:password@mywebserver.myhost.com/envoy'); db.sync(remotedb); The URL you sync to depends on where your app is running. It could be https://username:password@myapp.mybluemix.net/envoy or http://localhost:8000/envoy . The database name (after the last slash) has to match the one that your app is using ( envoy is the default db name). CREATING USERS WITH ENVOY By default, Cloudant Envoy looks for users in its envoyusers database. Here’s what a user object looks like: { ""_id"": ""user123"", ""_rev"": ""1-89de8ebc2b1ad4385ced1f0ed29fa708"", ""type"": ""user"", ""name"": ""user123"", ""roles"": [], ""username"": ""user123"", ""password_scheme"": ""simple"", ""salt"": ""1d5d80c9-d925-4f1e-8114-ed44501c38a5"", ""password"": ""4809dcd4f8dd1cf16f592d90d518875d3c5916f8"", ""seq"": null, ""meta"": { ""user_name"": ""johnsmith"", ""facebook_id"": ""johnsmith88"", ""premium"": true } } Envoy can create users for you. In your code, simply call: var username = 'user123'; var password = 'mysecretpassword'; var meta = { ""user_name"": ""johnsmith"", ""facebook_id"": ""johnsmith88"", ""premium� Once added, the username-password combination should work for replication too. LOCAL DOCUMENTS If you need to store state locally that you don’t want to be replicated to the remote replica, then simply store data to a document whose _id begins with _local/ , e.g.: var localstate = { _id: '_local/mystate', a:1, b:2}; db.put(localstate); Local documents are only stored on the device and are not included in the list of documents to be copied during replication. OFFLINE MAPS The Leaflet JavaScript library is easy enough to cache so that it works offline, but the map tiles themselves are pretty tricky: there’s lots of them at lots of resolutions. The solution I developed was to use an empty map and add a GeoJSON layer that contained a rough outline of the world. For my application, I only need to geo-locate users approximately, and I didn’t need every road, river and hill to be rendered on the map. To render the map, I created a Leaflet map: var mymap = L.map('mapid').setView([20, 0], 1); Then, I fetched the 250k GeoJSON file and rendered it on top: $.ajax({url: '/js/world.json', success: function(data) { var style = { color: ""#666"", fillColor: ""#66bb66� If we cache the Leaflet CSS & JavaScript files together with the world.json file referenced in the snippet, then we have offline-first maps! CONCLUSION Progressive Web Apps give users a vastly improved experience when used with modern browsers: * The same app can be used on desktop and mobile browsers * Data is stored and retrieved from a local data set, so performance and battery life are excellent * Site assets can be cached locally, making the app available despite the network connection status * Apps can be distributed through URLs without app store submission and installation with much smaller application size Compliments? Complaints? Mild salutations? Direct them to @glynn_bird , and don’t forget to have a look at PouchDB , Cloudant Envoy and the other tools here for your next Progressive Web App.",A summary of Progressive Web Apps and recommendations on refactoring code to use offline-first storage and other aspects of PWAs.,"Building Offline-First, Progressive Web Apps",Live,105 276,"Jump to navigation * Twitter * LinkedIn * Facebook * About * Contact * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chats * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Subscribe×BLOGSDATA VISUALIZATION PLAYBOOK: USING FUNCTION TO DRIVE DESIGNPost Comment November 24, 2015 by Jennifer Shin Topics: Big Data Technology Tags: big data , data analytics , data science , data scientist , data visualization , visualizationsData scientists must be selective when choosing what type of visualization touse with a data set. In particular, should we select a visualization beforeworking with the data, or should the same type of visualization always accompanya particular type of data? To decide, let’s explore another, more foundationalquestion: Which comes first—the data, or the visualization?PUTTING A FACE ON DATAConsider a nonprofit organization that wishes to create a data visualization foran upcoming report. The members of the board decide to create a visualizationdepicting the distribution of funding for all initiatives, across eightdifferent types of projects. To do so, they commission a graphic designer tocreate a distinctive icon to represent each project.To emphasize the icons, the new visualization arranges them in the circularformat shown in Figure 1. The icons are also ranked by amount of fundingreceived, with each icon sized to scale.Figure 1: The distribution of funding for projects across the organization’smajor initiatives.IDENTIFYING THE PROBLEMHowever, for several reasons, the new visualization failed to accomplish theboard’s full purpose in creating it.ISSUE 1: THE MORE-IS-BETTER APPROACHNoticing that the visualization did not highlight information about theorganization’s two most important initiatives, health and education, the boardcommissioned two additional visualizations for only those initiatives in thesame format, as shown in Figure 2a.Figure 2a: Two additional visualizations were created for the health andeducation initiatives, using the same format.ISSUE 2: FUNCTIONAL LIMITATIONSEach individual figure accomplished the board’s objectives in commissioning it,using specially designed graphics to display the amount of funding for eachproject category. However, the visualizations proved less effective thanexpected and did not effectively communicate information to readers.Figure 2b: The three visualizations appeared separately in the report, makingcomparison of initiatives difficult.Once inserted into the report, each visualization filled half a page. Moreover,because the figures were separated by pages of text, comparison required readersto flip between visualizations.What’s more, the images’ visual similarity led readers astray, creating theimpression that the visualizations represented similar data sets. However, thevisualizations for the health and education initiatives were merely spotlightson two important portions of the whole amount, whereas the first visualizationdepicted the total amount for all initiatives—including the amounts broken downin the other two visualizations.Accordingly, some readers did not understand that the amounts given in the firstvisualization also included the amounts shown in the other two, a problemcompounded because the largest icon in each visualization was sized the same asthe largest icon in each other—yet represented a different dollar amount.Indeed, although two visualizations focused on health and education, novisualization similarly highlighted the remaining initiatives—employment,cultural and social.ITERATING TOWARD A SOLUTIONWhen the board attempted to address the issue, several rounds of revisionsensued, each more closely approximating the board’s intent.STEP 1: RETHINKING THE DESIGNThe icons’ circular layout in the initial visualization worked well for a set oficons displayed in a single figure but frustrated comparison of icons acrossseveral figures. Changing the design from a circular format to a linear onereduced the amount of space required to display each series of icons andconsolidated all three visualizations into a single image, allowing readers toeasily compare icons across initiatives.Figure 3: The revised visualization consolidated the three former visualizationsby arranging series of icons in such a way as to allow their comparison.STEP 2: REDESIGNING BY RESIZINGThough the new format helped readers compare icons, sizing each icon to reflectits associated percentage of overall funding decreased the figure’s visualappeal, as shown in Figure 3. The icons for the projects that received the leastfunding became unrecognizably small, rendering them useless. The newvisualization didn’t capitalize on the newly designed icons, and it frustratedthe organization’s attempt to associate particular icons with particular projecttypes. However, although scaling the icons decreased the image’s visual appealand reduced its usefulness, readers were able to easily understand thedistribution of funds within initiatives.STEP 3: CHOOSING THE RIGHT DATATo provide users with information about all initiatives in the organization, thedesigner reworked the visualization once more, as shown in Figure 4, replacingdollar amounts for overall funding with figures representing combined fundingfor projects associated with the employment, cultural and social initiatives.The designer also sized all icons uniformly, taking full advantage of theirdistinctiveness. The resulting visualization provided a complete overview of thedistribution of funding across initiatives and project types alike.Figure 4: The final visualization showed the amount of funding received by eighttypes of projects across the organization’s five initiatives.USING DESIGN TO SUPPORT FUNCTIONChoosing design over function can lead to redundant visualizations that fail totake full advantage of their graphics. By welcoming changes to the originaldesign, the organization created an effective visualization that preserved themost important features of the original design. The uniformly sized iconsallowed readers to quickly identify issue areas, and the linear layout allowedcomparison of sums without page-turning. Moreover, by arranging projects indescending order along a horizontal line, the final visualization preserved therelative rankings of amounts distributed for project types within an initiative.Balancing design and function can produce a beautiful and effective datavisualization, but finding balance can be tricky. In particular, overemphasis onvisual design can diminish a visualization’s ultimate effectiveness. To create apowerful visualization, keep the big picture in mind—and remember that moreisn’t always better.Discover how the IBM advanced analytics portfolio can help you visually explore your data to find patterns and uncover insights.Follow @IBMBigDataRELATED CONTENTINTERACTIVECOGNITIVE BUSINESS STARTS WITH ANALYTICSOrganizations today have tremendous opportunities to transform theirprofessions, businesses and industries. They must optimize their use of advancedanalytics and capitalize on all available data. With the right solutions, theycan gain clarity about their business, generate new insights and take... View Interactive Podcast How is open source transforming streaming analytics? Podcast Becoming a cognitive business Podcast InsightOut: Leveraging metadata and governance Blog What is Spark? Blog Internet of Things data access and the fear of the unknown Blog Spark: The operating system for big data analytics Blog Graph databases catch electronic con artists in the act Blog InsightOut: Metadata and governance Blog New IBM DB2 release simplifies deployment and key management Podcast How is open source transforming graph analytics? Blog What is Hadoop? Blog The rise of NoSQL databasesView the discussion thread.IBM * Site Map * Privacy * Terms of Use * 2014 IBMFOLLOW IBM BIG DATA & ANALYTICS * Facebook * YouTube * Twitter * @IBMbigdata * LinkedIn * Google+ * SlideShare * Twitter * @IBManalytics * Explore By Topic * Use Cases * Industries * Analytics * Technology * For Developers * Big Data & Analytics Heroes * Explore By Content Type * Blogs * Videos * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Events * Around the Web * About The Big Data & Analytics Hub * Contact Us * RSS Feeds * Additional Big Data Resources * AnalyticsZone * Big Data University * Channel Big Data * developerWorks Big Data Community * IBM big data for the enterprise * IBM Data Magazine * Smarter Questions Blog * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics HeroesMore * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics HeroesSearchEXPLORE BY TOPIC:Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog For strategic planning, business must go beyond spreadsheets Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analyticsMOREBlog For strategic planning, business must go beyond spreadsheets Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analytics Interactive All data all the time: How mobile technology informs travelerhabits Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Blog The secret to enhancing customer engagement Blog For strategic planning, business must go beyond spreadsheets Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analytics Podcast Finance in Focus: Innovative business ideas with Lisa BodellMOREBlog For strategic planning, business must go beyond spreadsheets Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analytics Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of care Podcast InsightOut: Leveraging metadata and governance Blog 6 simple ways to help fight crime with analytics Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive All data all the time: How mobile technology informs travelerhabitsMOREBlog 6 simple ways to help fight crime with analytics Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive All data all the time: How mobile technology informs travelerhabits Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Blog How to protect our PII and sensitive information from fraud Blog Big data in healthcare: The secret to calculating total cost of care Interactive Cognitive business starts with analytics Blog The secret to enhancing customer engagement Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of careMOREInteractive Cognitive business starts with analytics Blog The secret to enhancing customer engagement Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of care Podcast Becoming a cognitive business Podcast InsightOut: Leveraging metadata and governance Blog The LED lighting revolution * Home * Explore By Topic * Use Cases * All * Acquire, Grow & Retain Customers * Create New Business Models * Improve IT Economics * Manage Risk * Optimize Operations & Reduce Fraud * Transform Financial Processes * Industries * All * Banking * Consumer Products * Education * Energy & Utilities * Government * Healthcare & Life Sciences * Industrial * Insurance * Media & Entertainment * Retail * Telecommunications * Analytics * All * Content Analytics * Customer Analytics * Entity Analytics * Social Media Analytics * Technology * All * Business Intelligence * Cloud Database * Data Governance * Data Warehouse * Database Management Systems * Data Science * Hadoop & Spark * Internet of Things * Predictive Analytics * Streaming Analytics * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chat * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Big Data & Analytics Heroes * For Developers * Events * Upcoming Events * Webcasts * Twitter Chat * Meetups * Around The Web * About Us * Contact Us * Search Site","Find out why redundant visualizations can turn detail into too much of a good thing, obscuring connections and diminishing contrast.",Data visualization: Function drives design,Live,106 283,"ERIK BERNHARDSSON ABOUT WHEN MACHINE LEARNING MATTERS 2016-08-05I joined Spotify in 2008 to focus on machine learning and music recommendations. It’s easy to forget, but Spotify’s key differentiator back then was the low-latency playback. People would say that it felt like they had the music on their own hard drive. (The other key differentiator was licensing — until early 2009 Spotify basically just had all kinds of weird stuff that employees had uploaded. In 2009 after a crazy amount of negotiation the music labels agreed to try it out as an experiment. But I’m getting off topic now.) Music distribution is a trivial problem now. Put everything on a CDN and you’re done. The cost of bandwidth and storage has gone down by an order of magnitude, not to mention the labor cost needed to build and maintain it. Anyway, at some point in 2009 we realized that we had far bigger challenges at Spotify than building a music recommendation system. So instead, I switched gears and ran the “Analytics team” for 2 years. We did the first A/B tests, ad delivery optimizations, provided data points crucial to bizdev deals, etc. Not until 2013 did we feel like it was time to focus on music recs. So I switched back and built up a team around that. The feeling was that we already solved the “tablestakes” problems around music distribution and music management. Those problems had become easy to solve for anyone. The next differentiator would be more advanced features that deliver user value and are harder for competitors to copy. So we focused a lot on ML again. Which brings me to this conclusion In the majority of all products, machine learning will not be a key differentiator in the first five years. MOST MACHINE LEARNING IS SPRINKLES ON THE TOP The first few years of product iteration is about getting the “tablestakes” out of the way. The ROI of those are just vastly bigger. I lead the tech team at a startup and we are nowhere near using any kind of sophisticated machine learning, two years into the process. There are a few promising opportunities where we want to use it. I absolutely think it’s going to be a huge competitive advantage for us. But right now far more simpler things matter. Spending a few days working on the conversion funnel is guaranteed do deliver far more business value. Rarely is machine learning the fundamental enabler of a product. It’s often an enhancer . This unfortunately means that the machine learning team isn’t a team that creates the core business value and has a crucial strategic role. It will be the team that comes in after 5-10 years once the “basic” features have been built and then squeezes out another 10% MAU by A/B testing the crap out of the product. Despite the current AI hype, most of the big shops focus on relatively mundane things. Google is trying to get you to click on more ads, Facebook to use the newsfeed more. It’s all incremental improvements on top of a product that already existed for 10 years. Obviousy the image above has nothing to do with this post. I just thought it was funny. Sorry. PICK YOUR COMPETITIVE ADVANTAGE How can we get around this? How can we build a company that’s founded based on machine learning first? I suspect ML in itself is very rarely a competitive advantage. Any machine learning company needs to find a sustainable non-ML advantage. Do you have a fantastic set of image filters? Great, use that tiny head start, launch an app and build a social network. Do you have a really good fraud detection system? Go out and sign up enterprise customers that feed you data back. Machine learning can be a first mover advantage. But there’s a high likelihood whatever insight you have will be independently discovered and published at the next NIPS/KDD/ICML. You need to turn it into something sustainable — having data, or lots of users, or very sticky enterprise contracts, or something else. Besides the core machine learning, other technology can definitely be a competitive advantage. Building super nasty integrations with vendors, or figuring out the control engineering of the suspension system of a self driving car. Those are proprietary assets where there’s little open research. For the pure machine learning I think we’ll see a separate force of commoditization of machine learning in those areas, where the technological differental between companies coverges towards zero. Knowing how to build a convolutional neural network will not be a valuable asset. Hooking it up to a surveillance system and building video distribution system could be a really key piece of technology. Don’t underestimate the power of data. Scraping the web doesn’t create valuable asset. But if you can obtain highly valuable unique data then that’s a huge competitive advantage. Another type of data I think people underestimate is in people’s heads — learnings from real production usage. Eg. Netflix has iterated movie recommendations for 10 years. They know their shit. It’s hard building a better recommender system even if you magically had ten times the data that Netflix has. What seems to happen in reality is that the human capital becomes real asset. Here’s a list of some acquisitions . It’s clear to me these acquisitions were 90% acqui-hire — about human capital being redeployed to something else. Google and other big players has shown that they are willing to pay a huge premium for smart teams (throwing out a fun conspiracy theory just for the sake of it: Google is going to acqui-hire any team with smart people just to create a talent monopoly.) These companies all had built some cool tech, but the price paid really represented the scarcity of skills. I expect that scarcity to vanish gradually. RELATED POSTS * Interview with a Data Scientist: Erik Bernhardsson 2015-10-27 * How to build up a data team (everything I ever learned about recruiting) 2014-06-08 * Recurrent Neural Networks for Collaborative Filtering 2014-06-28 * Iterate or die 2016-03-01 © 2016 . All rights reserved.",Machine learning is often the enhancer of a product.,When machine learning matters · Erik Bernhardsson,Live,107 285,"Skip to content * Features * Business * Explore * Marketplace * Pricing This repository Sign in or Sign up * Watch 25 * Star 212 * Fork 33 IBM-WATSON-DATA-LAB / PIXIEDUST Code Issues 114 Pull requests 0 Projects 0 Wiki Insights Pulse GraphsTUTORIAL: USING NOTEBOOKS WITH PIXIEDUST FOR FAST, FLEXIBLE, AND EASIER DATA ANALYSIS AND EXPERIMENTATION va barbosa edited this page Aug 22, 2017 · 7 revisionsPAGES 9 * Home * How to write a new PixieDust visualization * Package Manager * PixieDust display API * Renderer developer notes * Setup: Install and Configure pixiedust * Tutorial: Extending the PixieDust Visualization * Tutorial: Using Notebooks with PixieDust for Fast, Flexible, and Easier Data Analysis and Experimentation * Using Scala language within a Python Notebook Clone this wiki locallyInteractive notebooks are powerful tools for fast and flexible experimentation and data analysis. Notebooks can contain live code, static text, equations and visualizations. In this lab, you create a notebook via the IBM Data Science Experience to explore and visualize data to gain insight. We will be using PixieDust, an open source Python notebook helper library, to visualize the data in different ways (e.g., charts, maps, etc.) with one simple call. OVERVIEW In this tutorial, you will be learning about and using: * IBM Data Science Experience (DSX) * Jupyter Notebooks * PixieDust * Las Vegas Open Data The tutorial can be followed from a local Jupyter Notebook environment. However, the instructions and screenshots here walk through the notebook in the DSX environment. A corresponding notebook is available here: https://gist.github.com/vabarbosa/dc1eeaa363e8534306a2f5e09270cfee You may access this tutorial at a later time and try it again at your own pace from here: http://ibm.biz/pixiedustlab Note : For best results, use the latest version of either Mozilla Firefox or Google Chrome. DSX DSX is an interactive, collaborative, cloud-based environment where data scientists, developers, and others interested in data science can use tools (e.g., RStudio, Jupyter Notebooks, Spark, etc.) to collaborate, share, and gather insight from their data. SIGN UP DSX is powered by IBM Bluemix, therefore your DSX login is same as your IBM Bluemix login. If you already have a Bluemix account or previously accessed DSX you may proceed to the Sign In section. Otherwise, you first need to sign up for an account. From your browser: 1. Go to the DSX site: http://datascience.ibm.com 2. Click on Sign Up 3. Enter your Email 4. Click Continue 5. Fill out the form to register for IBM Bluemix SIGN IN From your browser: 1. Go to the DSX site: http://datascience.ibm.com 2. Click on Sign In 3. Enter your IBMid or email 4. Click Continue 5. Enter your Password 6. Click Sign In JUPYTER NOTEBOOKS Jupyter Notebooks are a powerful tool for fast and flexible data analysis and can contain live code, equations, visualizations and explanatory text. CREATE A NEW NOTEBOOK You will need to create a noteboook to experiment with the data and a project to house your notebook. After signing into DSX: 1. On the upper right of the DSX site, click the + and choose Create project . 2. Enter a Name for your project 3. Select a Spark Service 4. Click Create From within the new project, you will create your notebook: 1. Click add notebooks 2. Click the Blank tab in the Create Notebook form 3. Enter a Name for the notebook 4. Select Python 2 for the Language 5. Select 2.0 for the Spark version 6. Select the Spark Service 7. Click Create Notebook You are now in your notebook and ready to start working. When you use a notebook in DSX, you can run a cell only by selecting it, then going to the toolbar and clicking on the Run Cell (▸) button. When a cell is running, an [*] is shown beside the cell. Once the cell has finished the asterisks is replaced by a number. If you don’t see the Jupyter toolbar showing the Run Cell (▸) button and other notebook controls, you are not in edit mode. Go to the dark blue toolbar above the notebook and click the edit (pencil) icon. PIXIEDUST PixieDust is an open source Python helper library that works as an add-on to Jupyter notebooks to extends the usability of notebooks. With interactive notebooks, a mundane task like creating a simple chart or saving data into a persistence repository requires mastery of complex code like this matplotlib snippet: To improve the notebook experience PixieDust simplifies much of this and provides a single display() API to visualize your data. UPDATE PIXIEDUST DSX already comes with the PixieDust library installed, but it is always a good idea to make sure you have the latest version: 1. In the first cell of the notebook enter: !pip install --upgrade pixiedust 2. Click on the Run Cell (▸) button After the cell completes, if instructed to restart the kernel, from the notebook toolbar menu: 1. Go to > Kernel > Restart 2. Click Restart in the confirmation dialog Note : The status of the kernel briefly flashes near the upper right corner, alerting when it is Not Connected , Restarting , Ready , etc. IMPORT PIXIEDUST Before, you can use the PixieDust library it must be imported into the notebook: 1. In the next cell enter: import pixiedust 2. Click on the Run Cell (▸) button Note : Whenever the kernel is restarted, the import pixiedust cell must be run before continuing. PixieDust has been updated and imported, you are now ready to play with your data! LAS VEGAS OPEN DATA You now need some data! Many cities are now making much of their data available. One such city is Las Vegas. Las Vegas Open Data is the online home of a large portion of the data the City of Las Vegas collects and makes available for citizens to see and use. It would be good to take a look at some data from the city of Las Vegas. More specifically, the Las Vegas Restaurant Inspections data. This dataset contains demerits, grades, etc from inspections of Las Vegas restaurants. LOAD THE DATA With PixieDust, you can easily load CSV data from a URL into a PySpark DataFrame in the notebook. In a new cell enter and run: inspections = pixiedust.sampleData(""https://opendata.lasvegasnevada.gov/resource/86jg-3buh.csv"") Remember to wait for the [*] indicator to turn into number, at which point the cell has completed running. Here, you are passing the URL of the Las Vegas Restaurant Inspections CSV file to PixieDust's sampleData API and store the resultant dataframe into an inspections variable. In the output, you will see logging from PixieDust as it downloads the files and creates the dataframe. VIEW THE DATA Now that you have the data into a dataframe in your notebook, it is time to take a look at it. With PixieDust's display API, you can easily view and visualize the data. In a new cell enter and run: display(inspections) The output from this cell is the PixieDust display output which includes a toolbar and a visualization area: By default, you will be presented with the Table View showing a sampling (100 rows max) of the data and the schema of the data, that is which columns are strings, integers, etc. FILTER THE DATA Looking at the restaurants data in the table, you may notice it contains entries for restaurants outside of Las Vegas. You can however, filter this to a subset of only Las Vegas restaurants. In a new cell enter and run: inspections.registerTempTable(""restaurants"") lasDF = sqlContext.sql(""SELECT * FROM restaurants WHERE city='Las Vegas'"") lasDF.count() Using a basic SQL query, you filtered the data and created a new dataframe with only restaurants in Las Vegas. The cell output is a count of entries specifically for Las Vegas. VISUALIZE THE DATA With your data ready to go, you can begin to visualize it as charts and not just a simple table. NUMBER OF RESTAURANTS BY CATEGORIES In a new cell enter and run: bycat = lasDF.groupBy(""category_name"").count() display(bycat) The result is a new table showing the number of entries by categories in the city of Las Vegas. From the PixieDust display output toolbar, you can view this data in multiple ways: 1. Click the Chart dropdown menu and choose Bar Chart 2. From the Chart Options dialog 1. Drag the category_name field and drop it into the Keys area 2. Drag the count field and drop it into the Values area 3. Set the # of Rows to Display to 1000 3. Click OK And just like that you have a bar chart showing the percentages of the entries with a given grade! RENDERING OPTIONS You can play around with the chart further to provide a better visual experience. PixieDust supports mulitple renderers, each with their own set of features and distinct look. The default renderer is matplotlib but you can easily switch to a different renderer. 1. Click the Renderer dropdown menu and choose bokeh 2. Toggle the Show Legend Bar Chart Option to show or hide the legend The result is a nice bar chart showing the count of the different categories of places to eat. It's probably no surprise that most are restaurants and bars. INSPECTION DEMERITS AND GRADES What if you wanted to visualize something a little more complex? What if you wanted to see the average number of inspection demerits per category clustered by the inspection grade? Give it a try! 1. In a new cell enter and run: display(lasDF) 2. Click the Chart dropdown menu and choose Bar Chart 3. From the Chart Options dialog 1. Drag the category_name field and drop it into the Keys area 2. Drag the inspection_demerits field and drop it into the Values area 3. Set the Aggregation to AVG 4. Set the # of Rows to Display to 1000 5. Click OK 4. Click the Renderer dropdown menu and choose bokeh 5. Click the Cluster By dropdown menu and choose inspection_grade 6. Click the Type dropdown menu and choose the desired bar type (e.g., stacked ) CURRENT DEMERITS VS INSPECTION DEMERITS You are not restricted to just bar charts. You can try other charts to gain additional insights and different perspective of the data. 1. Click the Options button to launch the Chart Options dialog 2. From the Chart Options dialog 1. Set the Keys to inspection_demerits 2. Set the Values to current_demerits 3. Set the # of Rows to Display to 1000 4. Click OK 3. Click the Chart dropdown menu and choose Scatter Plot 4. Select bokeh from the Renderer dropdown menu 5. Select inspection_grade from the Color dropdown menu What can be gathered from this chart? MAP THE DATA When looking at the sample data, you may have noticed it also includes the location data of the restaurants. Plotting these points on a map can also be done with PixieDust. ACCESS TOKEN For the Map renderers, a token is required for them to display properly. Currently, PixieDust has two map renderers (i.e, Google, MapBox). For this section of the tutorial, you will be using the MapBox renderer and thus a MapBox API Access Token will need to be created if you choose to continue. Open a new browser tab ( do not close the DSX browser tab ): 1. If you do not have an MapBox account, please Sign up for one: https://www.mapbox.com/studio/signup 2. If you are not already logged into MapBox, go to https://www.mapbox.com and Log in 3. Navigate to your MapBox account page: https://www.mapbox.com/studio/account 4. Click the API access tokens tab 5. Click Create a new token and give your new token a name 6. Click on Generate 7. Make note of your token 8. Return to your notebook in DSX but do not close the MapBox page just yet SHAPE THE DATA The current data includes the longitude/latitude in the location_1 field as a string like such: POINT (-114.923505 36.114434) However, the current Map renderers in PixieDust expect the longitude and latitude as separate number fields. The first thing you will need to do is parse the location_1 field into separate longitude and latitude number fields. Note : Python is indentation sensitive. Do not mix space and tab indentations. Either use strictly spaces or tabs for all indentations. The last character in the field name location_1 is the number 1 . In a new cell enter and run: from pyspark.sql.functions import udf from pyspark.sql.types import * def valueToLon(value): lon = float(value.split('POINT (')[1].strip(')').split(' ')[0]) return None if lon == 0 else lon if lon < 0 else (lon * -1) def valueToLat(value): lat = float(value.split('POINT (')[1].strip(')').split(' ')[1]) return None if lat == 0 else lat udfValueToLon = udf(valueToLon, DoubleType()) udfValueToLat = udf(valueToLat, DoubleType()) lonDF = lasDF.withColumn(""lon"", udfValueToLon(""location_1"")) lonlatDF = lonDF.withColumn(""lat"", udfValueToLat(""location_1"")) lonlatDF.printSchema() You should have a new dataframe ( lonlatDF ) with two new columns ( lon , lat ) which contain the longitude and latitude for the restaurant. VIEW THE MAP DATA You are ready to view the data on a map. 1. In a new cell enter and run: display(lonlatDF) 2. Click the Chart dropdown menu and choose Map 3. From the Chart Options dialog 1. Drag the lon field and the lat field and drop it into the Keys area 2. Drag the current_demerits field and drop it into the Keys area 3. Set the # of Rows to Display to 1000 4. Enter your access token from MapBox into the MapBox Access Token field. If you left the MapBox browser tab open you may return to it, copy the token and paste it here. 5. Click OK 4. Click the kind dropdown menu and choose choropleth You can move around the map and zoom into the various areas and get a quick glimpse of the restaurants current_demerits based on it's color on the map. SUMMARY Before finishing the tutorial and stepping away do not forget to sign out of DSX, MapBox, and close out of any additional tabs you opened up. In this tutorial, you covered some of the basics of visualizing data from a Jupyter Notebook with PixieDust in the IBM Data Science Experience. Visualization is just one aspect of PixieDust. PixieDust contains additional features such as a Package Manager , Spark Progress Monitor , and Scala Bridge to name a few. Likewise DSX has numerous tools to analyze your data. DSX tools make it easier to share, collaborate, and solve your toughest data challenges. Feel free to sign back into DSX at later time and continue with analyzing and visualizing this data further. Better yet, load and start experimenting with your own data. LINKS * PixieDust Hello World Lab http://ibm.biz/pixiedustlab * IBM Data Science Experience http://datascience.ibm.com * The Jupyter Notebook https://jupyter.org * I Am Not A Data Scientist https://medium.com/ibm-watson-data-lab/i-am-not-a-data-scientist-efe7ca6ceba2 * PixieDust https://ibm-watson-data-lab.github.io/pixiedust * Welcome to PixieDust https://apsportal.ibm.com/exchange/public/entry/view/5b000ed5abda694232eb5be84c3dd7c1 * Magic for Your Python Notebook https://developer.ibm.com/clouddataservices/2016/10/11/pixiedust-magic-for-python-notebook * Make Your Own Custom Visualization http://ibm.biz/pixiedustvis * FlightPredict II: The Sequel https://medium.com/ibm-watson-data-lab/flightpredict-ii-the-sequel-fb613afd6e91 * © 2017 GitHub , Inc. * Terms * Privacy * Security * Status * Help * Contact GitHub * API * Training * Shop * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.","Create a notebook using IBM Data Science Experience using PixieDust to explore and visualize data in different ways (e.g., charts, maps, etc.) with one simple call.","Using Notebooks with PixieDust for Fast, Flexible, and Easier Data Analysis and Experimentation",Live,108 286,"Study Group Deep Learning Curriculum Blog Newsletter ArchiveTENSORFLOW QUICK TIPS by Malte Baumann on February 19, 2017 TENSORFLOW WAS THE NEW KID ON THE BLOCK WHEN IT WAS INTRODUCED IN 2015 AND HAS BECOME THE MOST USED DEEP LEARNING FRAMEWORK LAST YEAR. I JUMPED ON THE TRAIN A FEW MONTHS AFTER THE FIRST RELEASE AND BEGAN MY JOURNEY INTO DEEP LEARNING DURING MY MASTER'S THESIS. IT TOOK A WHILE TO GET USED TO THE COMPUTATION GRAPH AND SESSION MODEL, BUT SINCE THEN I'VE GOT MY HEAD AROUND MOST OF THE QUIRKS AND TWISTS. THIS SHORT ARTICLE IS NO INTRODUCTION TO TENSORFLOW, BUT INSTEAD OFFERS SOME QUICK TIPS, MOSTLY FOCUSED ON PERFORMANCE, THAT REVEAL COMMON PITFALLS AND MAY BOOST YOUR MODEL AND TRAINING PERFORMANCE TO NEW LEVELS. WE'LL START WITH PREPROCESSING AND YOUR INPUT PIPELINE, VISIT GRAPH CONSTRUCTION AND MOVE ON TO DEBUGGING AND PERFORMANCE OPTIMIZATIONS. PREPROCESSING AND INPUT PIPELINES KEEP PREPROCESSING CLEAN AND LEAN ARE YOU BAFFLED AT HOW LONG IT TAKES TO TRAIN YOUR RELATIVELY SIMPLE MODEL? CHECK YOUR PREPROCESSING! IF YOU'RE DOING ANY HEAVY PREPROCESSING LIKE TRANSFORMING DATA TO NEURAL NETWORK INPUTS, THOSE CAN SIGNIFICANTLY SLOW DOWN YOUR INFERENCE SPEED. IN MY CASE I WAS CREATING SO-CALLED 'DISTANCE MAPS', GRAYSCALE IMAGES USED IN ""DEEP INTERACTIVE OBJECT SELECTION"" AS ADDITIONAL INPUTS, USING A CUSTOM PYTHON FUNCTION. MY TRAINING SPEED TOPPED OUT AT AROUND 2.4 IMAGES PER SECOND EVEN WHEN I SWITCHED TO A MUCH MORE POWERFUL GTX 1080. I THEN NOTICED THE BOTTLENECK AND AFTER APPLYING MY FIX I WAS ABLE TO TRAIN AT AROUND 50 IMAGES PER SECOND. IF YOU NOTICE SUCH A BOTTLENECK THE USUAL FIRST IMPULSE IS TO OPTIMIZE THE CODE. BUT A MUCH MORE EFFECTIVE WAY TO STRIP AWAY COMPUTATION TIME FROM YOUR TRAINING PIPELINE IS TO MOVE THE PREPROCESSING INTO A ONE-TIME OPERATION THAT GENERATES TFRECORD FILES. YOUR HEAVY PREPROCESSING IS ONLY DONE ONCE TO CREATE TFRECORDS FOR ALL YOUR TRAINING DATA AND YOUR PIPELINE BOILS DOWN TO LOADING THE RECORDS. EVEN IF YOU WANT TO INTRODUCE SOME KIND OF RANDOMNESS TO AUGMENT YOUR DATA, ITS WORTH TO THINK ABOUT CREATING THE DIFFERENT VARIATIONS ONCE INSTEAD OF BLOATING YOUR PIPELINE. WATCH YOUR QUEUES A WAY TO NOTICE EXPENSIVE PREPROCESSING PIPELINES ARE THE QUEUE GRAPHS IN TENSORBOARD. THESE ARE GENERATED AUTOMATICALLY IF YOU USE THE FRAMEWORKS QUEUERUNNERS AND STORE THE SUMMARIES IN A FILE. THE GRAPHS SHOW IF YOUR MACHINE WAS ABLE TO KEEP THE QUEUES FILLED. IF YOU NOTICE NEGATIVE SPIKES IN THE GRAPHS YOUR SYSTEM IS UNABLE TO GENERATE NEW DATA IN THE TIME YOUR MACHINE WANTS TO PROCESS ONE BATCH. ONE OF THE REASONS FOR THIS WAS ALREADY DISCUSSED IN THE PREVIOUS SECTION. THE MOST COMMON REASON IN MY EXPERIENCE IS LARGE MIN_AFTER_DEQUEUE VALUES. IF YOUR QUEUES TRY TO KEEP LOTS OF RECORDS IN MEMORY, THEY CAN EASILY SATURATE YOUR CAPACITIES, WHICH LEADS TO SWAPPING AND SLOWS DOWN YOUR QUEUES SIGNIFICANTLY. OTHER REASONS COULD BE HARDWARE ISSUES LIKE TOO SLOW DISKS OR JUST LARGER DATA THAN YOUR SYSTEM CAN HANDLE. WHATEVER IT IS, FIXING IT WILL SPEED UP YOUR TRAINING PROCESS. GRAPH CONSTRUCTION AND TRAINING FINALIZE YOUR GRAPH TENSORFLOWS SEPARATE GRAPH CONSTRUCTION AND GRAPH COMPUTATION MODEL IS QUITE RARE IN DAY TO DAY PROGRAMMING AND CAN CAUSE SOME CONFUSION FOR BEGINNERS. THIS APPLIES TO BUGS AND ERROR MESSAGES, WHICH CAN OCCUR IN THE CODE FOR THE FIRST TIME WHEN THE GRAPH IS BUILT, AND THEN AGAIN WHEN IT'S ACTUALLY EVALUATED, WHICH IS COUNTERINTUITIVE WHEN YOU ARE USED TO CODE BEING EVALUATED JUST ONCE. ANOTHER ISSUE IS GRAPH CONSTRUCTION IN COMBINATION WITH TRAINING LOOPS. THESE LOOPS ARE USUALLY 'STANDARD' PYTHON LOOPS AND CAN THEREFORE ALTER THE GRAPH AND ADD NEW OPERATIONS TO IT. ALTERING A GRAPH WHILE CONTINUOUSLY EVALUATING IT WILL CREATE A MAJOR PERFORMANCE LOSS, BUT IS RATHER HARD TO NOTICE AT FIRST. THANKFULLY THERE IS AN EASY FIX. JUST FINALIZE YOUR GRAPH BEFORE STARTING YOUR TRAINING LOOP BY CALLING TF.GETDEFAULTGRAPH().FINALIZE() . THIS WILL LOCK THE GRAPH AND ANY ATTEMPTS TO ADD A NEW OPERATION WILL THROW AN ERROR. EXACTLY WHAT WE WANT. PROFILE YOUR GRAPH A LESS PROMINENTLY ADVERTISED FEATURE OF TENSORFLOW IS PROFILING. THERE IS A MECHANISM TO RECORD RUN TIMES AND MEMORY CONSUMPTION OF YOUR GRAPHS OPERATIONS. THIS CAN COME IN HANDY IF YOU ARE LOOKING FOR BOTTLENECKS OR NEED TO FIND OUT IF A MODEL CAN BE TRAINED ON YOUR MACHINE WITHOUT SWAPPING TO THE HARD DRIVE. TO GENERATE PROFILING DATA YOU NEED TO PERFORM A SINGLE RUN THROUGH YOUR GRAPH WITH TRACING ENABLED: # COLLECT TRACING INFORMATION DURING THE FIFTH STEP. IF GLOBAL_STEP == 5: # CREATE AN OBJECT TO HOLD THE TRACING DATA RUN_METADATA = TF.RUNMETADATA() # RUN ONE STEP AND COLLECT THE TRACING DATA _, LOSS = SESS.RUN([TRAIN_OP, LOSS_OP], OPTIONS=TF.RUNOPTIONS(TRACE_LEVEL=TF.RUNOPTIONS.FULL_TRACE), RUN_METADATA=RUN_METADATA) # ADD SUMMARY TO THE SUMMARY WRITER SUMMARY_WRITER.ADD_RUN_METADATA(RUN_METADATA, 'STEP%D', GLOBAL_STEP) AFTERWARDS A TIMELINE.JSON FILE IS SAVED TO THE CURRENT FOLDER AND THE TRACING DATA BECOME AVAILABLE IN TENSORBOARD. YOU CAN NOW EASILY SEE, HOW LONG AN OPERATION TAKES TO COMPUTE AND HOW MUCH MEMORY IT CONSUMES. JUST OPEN THE GRAPH VIEW IN TENSORBOARD, SELECT YOUR LATEST RUN ON THE LEFT AND YOU SHOULD SEE PERFORMANCE DETAILS ON THE RIGHT. ON THE ONE HAND, THIS ALLOWS YOU TO ADJUST YOUR MODEL IN ORDER TO USE YOUR MACHINE AS MUCH AS POSSIBLE, ON THE OTHER HAND, IT LETS YOU FIND BOTTLENECKS IN YOUR TRAINING PIPELINE. IF YOU PREFER A TIMELINE VIEW, YOU CAN LOAD THE TIMELINE.JSON FILE IN GOOGLE CHROMES TRACE EVENT PROFILING TOOL . ANOTHER NICE TOOL IS TFPROF , WHICH MAKES USE OF THE SAME FUNCTIONALITY FOR MEMORY AND EXECUTION TIME PROFILING, BUT OFFERS MORE CONVENIENCE FEATURES. ADDITIONAL STATISTICS REQUIRE CODE CHANGES. WATCH YOUR MEMORY PROFILING, AS EXPLAINED IN THE PREVIOUS SECTION, ALLOWS YOU TO KEEP AN EYE ON THE MEMORY USAGE OF PARTICULAR OPERATIONS, BUT WATCHING YOUR WHOLE MODELS MEMORY CONSUMPTION IS EVEN MORE IMPORTANT. ALWAYS MAKE SURE, THAT YOU DON'T EXCEED YOUR MACHINE'S MEMORY, AS SWAPPING WILL MOST CERTAINLY SLOW DOWN YOUR INPUT PIPELINE AND YOUR GPU STARTS WAITING FOR NEW DATA. A SIMPLE TOP OR, AS EXPLAINED IN ONE OF THE PREVIOUS SECTIONS, THE QUEUE GRAPHS IN TENSORBOARD SHOULD BE SUFFICIENT FOR DETECTING SUCH BEHAVIOR. DETAILED INVESTIGATION CAN THEN BE DONE USING THE AFOREMENTIONED TRACING. DEBUGGING PRINT IS YOUR FRIEND MY MAIN TOOL FOR DEBUGGING ISSUES LIKE STAGNATING LOSS OR STRANGE OUTPUTS IS TF.PRINT . DUE TO THE NATURE OF NEURAL NETWORKS, LOOKING AT THE RAW VALUES OF TENSORS INSIDE OF YOUR MODEL USUALLY DOESN'T MAKE MUCH SENSE. NOBODY CAN INTERPRET MILLIONS OF FLOATING POINT NUMBERS AND SEE WHATS WRONG. BUT ESPECIALLY PRINTING OUT SHAPES OR MEAN VALUES CAN GIVE GREAT INSIGHTS. IF YOU ARE TRYING TO IMPLEMENT SOME EXISTING MODEL, THIS ALLOWS YOU TO COMPARE YOUR MODEL'S VALUES TO THE ONES IN THE PAPER OR ARTICLE AND CAN HELP YOU SOLVE TRICKY ISSUES OR EXPOSE TYPOS IN PAPERS. WITH TENSORFLOW 1.0 WE HAVE BEEN GIVEN THE NEW TFDEBUGGER , WHICH LOOKS VERY PROMISING. I HAVEN'T USED IT YET, BUT WILL DEFINITELY TRY IT OUT IN THE COMING WEEKS. SET AN OPERATION EXECUTION TIMEOUT YOU HAVE IMPLEMENTED YOUR MODEL, LAUNCH YOUR SESSION AND NOTHING HAPPENS? THIS IS USUALLY CAUSED BY EMPTY QUEUES, BUT IF YOU HAVE NO IDEA, WHICH QUEUE COULD BE RESPONSIBLE FOR THE MISHAP THERE IS AN EASY FIX: JUST ENABLE THE OPERATION EXECUTION TIMEOUT WHEN CREATING YOUR SESSION AND YOUR SCRIPT WILL CRASH WHEN AN OPERATION EXCEEDS YOUR LIMIT: CONFIG = TF.CONFIGPROTO() CONFIG.OPERATION_TIMEOUT_IN_MS=5000 SESS = TF.SESSION(CONFIG=CONFIG) USING THE STACK TRACE YOU CAN THEN FIND OUT, WHICH OP CAUSES YOUR HEADACHE, FIX THE ERROR AND TRAIN ON. -------------------------------------------------------------------------------- I HOPE I COULD HELP SOME OF MY FELLOW TENSORFLOW CODERS. IF YOU FOUND AN ERROR, HAVE MORE TIPS OR JUST WANT TO GET IN TOUCH, PLEASE SEND ME AN EMAIL! SIGN UP TO RECEIVE MORE CONTENT LIKE THIS PLUS INDUSTRY NEWS, CODE AND TUTORIALS EVERY WEEK FRESH TO YOUR INBOX. No spam. One-click unsubscribe.",A weekly newsletter about the latest developments in Deep Learning.,TensorFlow Quick Tips,Live,109 289,"PIXIEDUST: MAGIC FOR YOUR PYTHON NOTEBOOK David Taieb / October 11, 2016As any data scientist knows, Python notebooks are a powerful tool for fast and flexible data analysis. But the learning curve is steep, and it’s easy to get blank page syndrome when you’re starting from scratch. Thankfully, it's easy to save and share notebooks. However, even for seasoned data scientists or developers, modifying an existing notebook can be daunting. GOT SYNTAX? Data science notebooks were first popularized in academia, and there are some formalities to work through before you can get to your analysis. For example, in a Python interactive notebook, a mundane task like creating a simple chart or saving data into a persistence repository requires mastery of complex code like this matplotlib snippet: All this for a chart?Once you do create a notebook that provides great data insights, it's hard to share with business users, who don’t want to slog through all that dry, hard-to-read code, much less tweak it and collaborate. PixieDust to the rescue. To improve the notebook experience and ease collaboration, I created an open source Python helper library that works as an add-on to Jupyter notebooks. FRIENDLIER DATA SCIENCE NOTEBOOKS When I watched data scientists and developers work with Python noteboooks, I thought it shouldn't be so difficult. PixieDust fills feature gaps that made notebooks too challenging for certain users and scenarios. Six quick benefits of PixieDust (no sound).PixieDust extends the usability of notebooks with the following features: * packageManager lets you install spark packages inside a Python notebook. This is something that you can't do today on hosted Jupyter notebooks, which prevents developers from using a large number of spark package add-ons. * visualizations. One single API called display() lets you visualize your spark object in different ways: table, charts, maps, etc…. Much easier than matplotlib (but you can still use matplotlib, if you want). This module is designed to be extensible, providing an API that lets anyone easily contribute a new visualization plugin. This sample visualization plugin uses d3 to show the different flight routes for each airport: * Export. Share and save your data. Download to .csv, html, json, etc. locally on your laptop or into a variety of back-end data sources, like Cloudant, dashDB, GraphDB, etc. * Scala Bridge. Use scala directly in your Python notebook. Variables are automatically transfered from Python to Scala and vice-versa. * Extensibility. Create your own visualizations using the pixiedust APIs. If you know html and css, you can write and deliver amazing graphics without forcing notebook users to type one line of code. * Apps. Allow nonprogrammers to actively use notebooks. Transform a hard-to-read notebook into a polished graphic app for business users. Check out these preliminary sample apps: * An app can feature embedded forms and responses, like flightpredict , which lets users enter flight details to see the likelihood of landing on-time. * Or present a sophisticated workflow, like our twitter demo , which delivers a real-time feed of tweets, trending hashtags, and aggregated sentiment charts with Watson Tone Analyzer. TRY IT See for yourself. You can play with pixiedust right now online via IBM's Data Science Experience. To get a look at the features you just read about, follow these steps: 1. Visit the IBM Data Science Experience and log in with your Bluemix account credentials or sign up. 2. If prompted, create an instance of the Apache Spark service. Data Science Experience may generate a Spark instance for you automatically. If not, you'll be prompted to instantiate your own. You'll need it to run your Python code. 3. Create a new notebook. On the upper left of the screen, click the hamburger menu to reveal the left menu. Then click New > Notebook . Click From URL , enter a name, and in the Notebook URL field, enter https://github.com/ibm-cds-labs/pixiedust/raw/master/notebook/Intro%20to%20PixieDust.ipynb 4. Create or select your Spark instance. If you don't already have the Spark service up, Data Science Experience prompts you to instantiate it. You'll need it to run your Python code. 5. Your new notebook opens. Run each cell in order to see a few PixieDust features If you get an error: PixieDust is preinstalled on Data Science Experience. If you get an error in cell 2, insert a cell at the top of the notebook and enter and run the following code: !pip install --user --no-deps --upgrade pixiedust Then restart the kernel and run cells above. If you get other errors, it's always a good idea to restart the kernel and try again. JOIN US PixieDust is an open source project. Join the conversation and contribute. You'll find lots of guidance in our repo's wiki with more to come. Write your own app or visualization plugin. Pull requests welcome! Visit PixieDust's GitHub repo . Later this month, I’m speaking at World of Watson [ 1 , 2 ]. Join me there to learn more about pixiedust. SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Click to share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Apache Spark / data science / Data Science Experience / IBM Analytics for Apache Spark / IPython / Jupyter / matplotlib / Notebooks / PixieDust / Python Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Graph * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Object Storage * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","An open source helper library for your Jupyter Python notebook with easier data viz & export, package manager, and Scala context from within Python!",PixieDust: Magic for Your Python Notebook,Live,110 291,"Homepage Follow Sign in / Sign up Homepage * Home * Archive * Greg Filla Blocked Unblock Follow Following Product manager. Data scientist. I like coding for data stuff Oct 19 -------------------------------------------------------------------------------- TIDY UP YOUR JUPYTER NOTEBOOKS WITH SCRIPTS Over the past few years, we have seen the transition from scripts to notebooks for data scientists. Jupyter notebooks are quickly becoming the preferred data science IDE. These notebooks are perfect for writing short code blocks to interact with data, but what happens when your project grows? At Data Science Experience, we see notebooks as the primary way data scientists want to code… but not all code should stay in the notebook. Helper functions, Classes , messy visualization code — all the necessary bits that we do not need to include in a notebook that could be used for a presentation to communicate results. Let’s start cleaning up our notebooks. Image from Pixabay.com licensed under CC BY 2.0First, I will describe how to take an existing .py script or package and use it in IBM Data Science Experience (DSX). Then, I’ll show my approach for setting up projects to facilitate clean notebooks. IMPORTING EXISTING PYTHON SCRIPTS IN DSX DSX offers a collaborative enterprise data science environment in the cloud, but many times it’s necessary to migrate existing scripts for use in DSX projects. Here are options for using a locally developed script in DSX: 1. Copy/paste code from local file into a notebook cell. At the top of this cell add %%writefile .py - this will save the code as a Python file in your GPFS working directory (GPFS is the file system that comes with the DSX Spark Service). Any notebooks using the same Spark Service instance will be able to access this file for importing. 2. Load the Python script into Object Storage. You can use Insert to Code , then take the string and write to a file in GPFS that can then be accessed the same way. I recommend option 1 because it allows you to continue to tweak code and update the script written to GPFS from this notebook. I will go into this in more detail in the section below on setting up your project. IMPORTING EXISTING PACKAGES IN DSX The methods above work great if you just have a single script that you need to import from (or execute) from inside a notebook. If you have a Python package, the following options are available for importing in DSX: Pre-req: * Python — Package up your code ( Here is an example of a simple package I wrote) * R — Package up your code (great post from Hilary Parker) . Check out a simple R package example here 1. Put it in a repo and install. This can be accomplished from a public or private GitHub repository (I’m sure others as well, but I have only used GitHub). Pip installing from a public repo looks like this: !pip install git+https://github.com/gfilla/dsxtools.git If you need to install from a private GitHub repository, it looks like this: !pip install git+https://:@github..com//.git --ignore-installed You get your personal_access_token from Settings > Personal Access Tokens > Generate new token. You need to give repo access to this token. For R — use this syntax for installing a package from GitHub: install.packages('devtools') library(devtools) install_github('/') #installs the package library('') #loads the package for use 2. Zip it up and load from Object Storage. This is similar to option 2 above, this time we zip up the directory with the Python package and load in an Object Storage container. Here you can use this code to get/save the zip . Once you have installed/saved the package in GPFS, you are good to start importing inside your notebook! BRING THIS ALL TOGETHER IN A DSX PROJECT At this point, you should feel confident in importing existing Python code for use in a DSX project. Let’s build on this to review one method for building out a larger project. My notebooks at the start of a new projectWhen I start a new project and have a clear vision for my goals, I will start with a “Class” notebook. This will be the notebook I will work in mostly for the early stages of my project. This notebook will be the messiest of all notebooks through most of the project lifecycle, but at the end it will be the cleanest — only including the code for the classes I will use for the project. Each cell in this notebook contains a class, we can easily write each of these cells to a Python script using the %%writefile method described above. Other notebooks in this project will import these classes to access the methods to have overall much cleaner code. An example of a cell in my “Class” notebookYou may be asking yourself why you should use this method instead of just using an IDE intended for writing larger Python programs. That is a fair question — and some projects can definitely require that approach. I prefer staying inside notebooks for class development for the same reason I use them for data analysis. I can quickly tweak my class and have any experimental code in subsequent code cells to fix any bugs (this is where it can get messy). To complete the example, I’ll show how one of these classes is used in my other notebooks. At this point, if you are new to Python and have not used classes I recommend checking out the documentation to see how they can be incorporated in your code. After executing the cell where I write the Python class to GPFS, I can simply import using syntax from import So clean..Since cnnParser is the name of my class, I instantiate an instance in the cnn variable. A very nice benefit of using classes and hanging methods on the class is that Jupyter shortcuts are available to view all methods/attributes of the class (Shift + Tab to get the view in the screenshot). If you didn’t know about this shortcut — check out this post . -------------------------------------------------------------------------------- That should be enough to get started using scripts, packages, and notebooks together in a complementary way. If you know any tips/tricks I missed please let me know ! Happy coding :-) * Python * Dsx * Data Science * Jupyter Notebook Blocked Unblock Follow FollowingGREG FILLA Product manager. Data scientist. I like coding for data stuff FollowIBM DATA SCIENCE EXPERIENCE Master the art of data science * * * * Never miss a story from IBM Data Science Experience , when you sign up for Medium. Learn more Never miss a story from IBM Data Science Experience Get updates Get updates",Learn how to use scripts and external packages in Jupyter notebooks to facilitate code organization for larger projects.,Tidy up your Jupyter notebooks with scripts,Live,111 295,"Skip navigation Upload Sign in SearchLoading... Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE. WATCH QUEUE QUEUE Watch Queue Queue * Remove all * Disconnect 1. Loading... Watch Queue Queue __count__/__total__ Find out why CloseBUILDING CUSTOM MACHINE LEARNING ALGORITHMS WITH APACHE SYSTEMML Apache Spark Subscribe Subscribed Unsubscribe 15,637 15KLoading... Loading... Working... Add toWANT TO WATCH THIS AGAIN LATER? Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO? Sign in to report inappropriate content. Sign in * Transcript * Statistics 172 views 3LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 4 0DON'T LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1Loading... Loading... TRANSCRIPT The interactive transcript could not be loaded.Loading... Loading... Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Jun 16, 2016 * CATEGORY * Science & Technology * LICENSE * Standard YouTube License Loading... Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * Breakthroughs in Machine Learning - Google I/O 2016 - Duration: 28:28. Google Developers 7,190 views 28:28 -------------------------------------------------------------------------------- * Python+Machine Learning tutorial - Introduction - Duration: 1:11:53. Microsoft Research 26 views 1:11:53 * Toward Causal Machine Learning - Duration: 57:33. Microsoft Research 1 view 57:33 * Stuff machine learning, let’s talk about climate change. - Duration: 28:38. Microsoft Research 175 views 28:38 * Machine Learning Algorithms Workshop - Duration: 1:39:55. Microsoft Research 79 views 1:39:55 * Machine learning is not the future - Google I/O 2016 - Duration: 39:00. Google Developers 17,976 views 39:00 * Livy: A REST Web Service For Apache Spark - Duration: 21:29. Apache Spark 430 views 21:29 * Machine Learning Algorithms – Part 1 - Duration: 15:53. Microsoft Azure 74 views * New 15:53 * Machine learning for algorithmic trading w/ Bert Mouler - Duration: 1:03:17. Chat With Traders 6,586 views 1:03:17 * Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings - Duration: 29:13. Apache Spark 144 views 29:13 * Making Machine Learning Reproducible with CodaLab - Duration: 37:21. Microsoft Research 15 views 37:21 * Symposium: Deep Learning - Max Jaderberg - Duration: 20:09. Microsoft Research 149 views 20:09 * Smart Monitoring of Logs: ELK-Elastic Search, Logstash, Kibana: Anania M. and Edgar T. | Synergy - Duration: 48:24. Barcamp Yerevan 138 views 48:24 * #56 Data Science from Scratch - Duration: 51:04. Talk Python 36 views 51:04 * Elasticsearch And Apache Lucene For Apache Spark And MLlib - Duration: 33:44. Apache Spark 136 views 33:44 * Managed Dataframes And Dynamically Composable Analytics: The Bloomberg Spark Server - Duration: 28:34. Apache Spark 70 views 28:34 * Jose Quesada - A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons - Duration: 38:47. PyData 784 views 38:47 * Crate.io and CaseZero @Ticketmaster June 14, 2014 - Duration: 1:45:45. Carl Mullins 90 views 1:45:45 * GPU Computing With Apache Spark And Python - Duration: 17:35. Apache Spark 73 views 17:35 * Diving into Machine Learning - by Rob Craft, Group Product Manager at Google - Duration: 59:19. Startupfood 913 views 59:19 * Loading more suggestions... * Show more * Language: English * Country: Worldwide * Restricted Mode: Off History HelpLoading... Loading... Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Try something new! * Loading... Working... Sign in to add this to Watch LaterADD TO Loading playlists...",What is Apache SystemML? Demo! How to get SystemML.,Building Custom Machine Learning Algorithms With Apache SystemML,Live,112 297,"☰ * Login * Sign Up * Learning Paths * Courses * Our Courses * Partner Courses * Badges * Our Badges * BDU Badge Program * BLOG Welcome to the BDUBlog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * This Week in Data Science (February 28, 2017) * This Week in Data Science (February 21, 2017) * Learn how to use R with Databases * This Week in Data Science (February 14, 2017) * This Week in Data Science (February 7, 2017) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsBLOGROLL * RBloggers THIS WEEK IN DATA SCIENCE (FEBRUARY 28, 2017) Posted on February 28, 2017 by Janice Darling Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * http://www.ibmbigdatahub.com/blog/four-perspectives-data-lakes – The relation of architecture, value, innovation and governance to data lakes. * Fueling the Gold Rush: The Greatest Public Datasets for AI – A run down of some public datasets for Artificial Intelligence. * Pandas Cheat Sheet – Python for Data Science – Cheat sheet for one of the most popular data science packages. * 17 More Must-Know Data Science Interview Questions and Answers, Part 2 – Additional must-know questions for data science interviews. * IBM, Northern Trust partner on financial security blockchain tech – IBM and Northern Trust partner to develop blockchain technology for the management of private equity funds and services. * The Origins of Big Data – A perspective summary of the field and use of the term Big Data. * How is Deep Learning Changing Data Science Paradigms? – A look at the rise of Deep Learning and its effect on Data Science Paradigms. * Melbourne IBM Research team using Watson AI to identify glaucoma – Melbourne-based IBM research team trains Watson to identify eye abnormalities. * Removing Outliers Using Standard Deviation in Python – How to remove outliers using a well known but underutilized metric. * R Packages worth a look – A short list and summaries of R statistical and graphical packages. * 25 Big Data Terms Everyone Should Know – Big Data Terms and concepts as an introduction to the field. * Moving from R to Python: The Libraries You Need to Know – Python packages and their R contemporaries. * Predicting the 2017 Oscar Winners – Using Machine Learning to predict the winners at the 89th annual Academy of Motion Picture Arts and Sciences Awards. * How To Hire A Data Scientist: 5 Don’ts For Data Scientist Interview Questions – How hiring managers can land a proficient data scientist. * Artificial intelligence: Understanding how machines learn – The current limits of Artificial Intelligence and Machine Learning. UPCOMING DATA SCIENCE EVENTS * IBM Webinar: Are you getting enough value from your relational database? – March 1, 2017 @ 1:00 pm – 2:00 pm * IBM Webinar: Art of the Possible…and the Reality of Execution – March 2, 2017 @ 1:00 pm – 2:00 pm FEATURED COURSES FROM BDU * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out. * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used to detect patterns hidden in data. * Using R with Databases – Learn how to unleash the power of R when working with relational databases in our newest free course. COOL DATA SCIENCE VIDEOS * Deep Learning with TensorFlow Course Summary – A summary of our free course here at BDU Deep Learning with TensorFlow. * Deep Learning with Tensorflow – Deep Belief Networks – An overview of Deep Belief Networks. * Deep Learning with Tensorflow – Autoencoder Structure –An overview of the structure and applications of an Autoencoder. * Deep Learning with Tensorflow – Autoencoders with TensorFlow –Tutorial on how to implement an Autoencoder using TensorFlow. * Deep Learning with Tensorflow – Introduction to Autoencoders – The basic concepts of Autoencoders – a type of neural network. * SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * * RELATED Tags: analytics , Big Data , data science -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Events * Ambassador Program * Resources * FAQ * Legal Follow us * * * * * * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (February 28, 2017)",Live,113 298,"Compose The Compose logo Articles Sign in Free 30-day trialUSE ALL THE DATABASES - PART 1 Published Mar 2, 2017 graphql developing writestuff Use all the Databases - Part 1Loren Sands-Ramshaw, author of GraphQL: The New REST , shows how to combine data from multiple sources using GraphQL in this Write Stuff two-part series. Ever wanted to use a few different databases to build your app? Different types of databases are meant for different purposes, so it often makes sense to combine them. You might be hesitant due to the complexity of maintenance and coding, but it can be easy if you combine Compose and GraphQL: instead of writing a number of complex REST endpoints, each querying multiple databases, you set up a single GraphQL endpoint that provides whatever data the client wants using your simple data fetching functions. This tutorial is meant for anyone who provides or fetches data, whether it’s a backend dev writing an API (in any language) or a frontend web or mobile dev fetching data from the server. We’ll learn about the GraphQL specification, set up a GraphQL server, and fetch data from five different data sources. The code is in Javascript, but you’ll still get a good idea of GraphQL without knowing the language. In this first part, we'll look at the databases that will be involved. Then we'll introduce GraphQL before moving on to the query we want to make, the schema we need to create and how to setup the server to make that all happen. In part two, we'll look at resolving queries on SQL, Elasticsearch, MongoDB, Redis and REST data sources and a look at how to get the best performance before calling things done. Part 1 * The databases * GraphQL intro * The query * The schema * Server setup Part 2 * Resolvers * SQL * Elasticsearch * MongoDB * Redis * REST * Performance * Done! THE DATABASES We at Chirper Fictional, Inc. were building a Twitter clone, and decided to use these databases: * 💾 PostgreSQL : Because like most apps, our data was relational, and our boss said that the database we wanted to use (RethinkDB) was too new to be trusted 😔. * 💾 Redis : We wanted to cache frequently-used data, like the public feed, so we could get it quickly and reduce the read load on Postgres. * 💾 Elasticsearch : A database built for searching that would function better and scale better than searching Postgres. * 💾 REST : We wanted to show our users tweets from their area, and we didn't want to prompt for GPS permissions or pay for a MaxMind IP address database, so we found a REST API for geolocating IP addresses. * 💾 MongoDB : We wanted to track some user stats, and we didn't need them to be in the main app database. We put the intern on this, and while he could have just used a second Postgres DB, he used Mongo because he heard it was Web Scale. And we didn't mind because we didn't need ACID or JOINs for our stats. Now we need a way to combine the data from all of these sources together in whichever ways our clients want it, and the best way to do this is with GraphQL. GRAPHQL INTRO Gotta be honest here... for the first few months of GraphQL's short existence (it launched in July 2015), I thought GraphQL was a query language for accessing your Facebook friend graph 😳. Turns out that’s FQL, and GraphQL is a replacement for REST! And sorry REST, but GraphQL is kinda better than you for most things 😁. Here's why: * ✅ Easier to consume : The GraphQL client's job is super simple—just write the data fields you want filled in. When you send the query string like the one on the left side of the image, you get back the JSON response on the right, with the same structure you asked for. Instead of sending multiple REST requests (sometimes multiple round trips in series), you can send a single GraphQL request. And instead of getting more or less data than you need from the REST endpoints, you get exactly the data you ask for. * ✅ Easier to produce : On the GraphQL server you write resolvers —functions that resolve a field to its value; for instance for the above, there's a user() function that responds to the user(id: 1) query and returns user #1's SQL record. One nice thing is that they work at any place in the query—looking up the current user's first name ( user.firstName is ""Maurine"" in the above example) at the top level runs the same code as looking up the author of a tweet that mentions her name ( user.mentions[0].author.firstName happens to also be ""Maurine"" ), nested in the query heirarchy ( more info on this ). Also, sometimes with REST you have endpoints talking to multiple databases. A GraphQL server is more organized, since in most cases each resolver talks to a single data source. Credit: Jonas Helfer * ✅ Types and introspection : Each query has a typed schema ( User , Tweet , String , Int , etc). At first it may seem like extra work, but it means that you get better error messages, query linting , and automatic server response mocking . It also has introspection—a standard method of querying the server to ask what queries it supports (and their schemas)—which is what powers Graph i QL (with an i and pronounced, “graphical”), the in-browser auto-documented GraphQL IDE described later in this article. * ✅ Version free : Because the client decides what data it wants, you can easily support many different client versions. Instead of versioning your endpoints (eg GET /api/v2/user), when you add new features, you simply add more fields. When you sunset old features, the associated fields can be deprecated but continue to function. Fear not—you don't need to rewrite all your REST servers: you can instead add a simple GraphQL server in front of them, as we'll see with the REST data source example below. Note: you can of course also change data with GraphQL (with functions called mutations ), but I won't be covering that in this post. THE QUERY Let's figure out the query that we'll need for our app's home dashboard. First, here are the things we'd like to display: * Your name and photo (SQL) * Recent tweets that mention your name (Elasticsearch) * Most recent few tweets worldwide (Redis) * Recent tweets in your city (REST to geolocate and then SQL) * For each tweet, the number of times it has been viewed (Mongo) For each tweet, we'll want to display the text of the tweet, the author's name and photo, and when it was created. For the mentions and city feeds, we also want the number of times the tweets were viewed and from what city they were made. Now to make the query, we write out the pieces of data we need in order to display the above list, choosing names for each field and putting it in a JSON-like format! 😄 const queryString = ` { user(id: 1) { firstName lastName photo mentions { text author { firstName lastName photo } city views created } } publicFeed { text author { firstName lastName photo } created } cityFeed { text author { firstName lastName photo } city views created } } ` We'll put mentions as a field of the user query instead of at the top level because we'll need to the user's name in order to query Elasticsearch, and we'll have their name from the first step of the user query (we'll see how this looks when we implement it). Parentheses are used to pass arguments—for simplicity's sake, we're passing our own user id with (id: 1) . Usually when fetching the current user’s data, instead of passing your user id as an argument, you'd put your auth token in the Authorization header, and the server would authenticate you. This is done automatically for you by frameworks like Meteor . Our query should return the below JSON data. The data mirrors the query format, with values filled in, sometimes with arrays of objects: { ""data"": { ""user"": { ""firstName"": ""Maurine"", ""lastName"": ""Rau"", ""photo"": ""http://placekitten.com/200/139"", ""mentions"": [ { ""text"": ""Maurine Rau Eligendi in deserunt."", ""author"": { ""firstName"": ""Maurine"", ""lastName"": ""Rau"", ""photo"": ""http://placekitten.com/200/139"" }, ""city"": ""San Francisco"", ""views"": 82, ""created"": 1481757217713 } ] }, ""publicFeed"": [ { ""text"": ""Corporis qui impedit cupiditate rerum magnam nisi velit aliquam."", ""author"": { ""firstName"": ""Tia"", ""lastName"": ""Berge"", ""photo"": ""http://placekitten.com/200/139"" }, ""city"": ""New York"", ""views"": 91, ""created"": 1481757215183 }, ... ], ""cityFeed"": [ { ""text"": ""Edmond Jones Harum ullam pariatur quos est quod."", ""author"": { ""firstName"": ""Edmond"", ""lastName"": ""Jones"", ""photo"": ""http://placekitten.com/200/139"" }, ""city"": ""Mountain View"", ""views"": 69, ""created"": 1481757216723 }, ... ] } } Now let's write the simple GraphQL server that will return that data! THE SCHEMA The first thing your server needs is a schema. This is what the server will use to provide type safety and power the introspection and improved error messages. Since we've already written out what we'd like our queries to look like, this will be easy - we just need to list out the fields and their types. First, under type Query , we list the possible queries (top-level attributes in our query string ): type Query { user(id: Int!): User # A feed of the most recent tweets worldwide publicFeed: [Tweet] # A feed of the most recent tweets in your city cityFeed: [Tweet] } code Each query is followed by the type that is returned. Besides the basic types ( String , Int , Float , Boolean ), you can make your own types, which start with a capital letter. So the first line reads, ""One possible query is the user query, which takes one required argument (the exclamation point in Int! means required) named id of type Int and which returns something of type User ."" The last line reads, ""One possible query is the cityFeed query, which has no arguments and returns an array of Tweet s."" The # comments are descriptions , which show up in the GraphiQL IDE described later. Now to define the User and Tweet types, we'll list the fields we chose in our query string : type User { firstName: String lastName: String photo: String mentions: [Tweet] } type Tweet { text: String author: User city: String views: Int created: Float } code That's our schema! The schema goes into a string: // data/schema.js const schema = ` type User { ... type Tweet { ... type Query { ... schema { query: Query } `; export default schema; data/schema.js SERVER SETUP The reference implementation of the GraphQL specification is GraphQL-JS , and it's used by graphql-server-express , a GraphQL middleware for Express , the most popular Node.js web server. Here's how we set it up: // server.js: import express from 'express'; import { graphqlExpress, graphiqlExpress } from 'graphql-server-express'; import { makeExecutableSchema } from 'graphql-tools'; import bodyParser from 'body-parser'; import schema from './data/schema'; import resolvers from './data/resolvers'; const graphQLServer = express(); const executableSchema = makeExecutableSchema({ typeDefs: [schema], resolvers, }); graphQLServer.use('/graphql', bodyParser.json(), graphqlExpress({ schema: executableSchema, })); graphQLServer.use('/graphiql', graphiqlExpress({ endpointURL: '/graphql', })); const GRAPHQL_PORT = 8080; graphQLServer.listen(GRAPHQL_PORT, () = server.js * import schema from './data/schema' – the GraphQL schema that we wrote in the last section * import resolvers from './data/resolvers' – an object with our resolve functions, which will do the DB lookups (we'll do this in the next article) * /graphiql – GraphiQL , the IDE for GraphQL. If you visit this URL (for us it’s http://localhost:8080/graphiql in a browser, you'll see the UI shown in the first screenshot . * While you're typing the query string in the left side of the screen, it autocompletes query fields. * When you hit the run button or cmd-return , the response from the server is shown on the right. * There's also a docs sidebar that has automatic documentation of the available queries and data fields. If you’d like to run this server on your computer, first follow the repo’s setup instructions . Now you can start the server by running server.js : nodemon ./server.js --exec babel-node And make queries in GraphiQL: http://localhost:8080/graphiql When you edit the code, the server will restart itself, and you can re-run your query in GraphiQL. Reload the page in order to get the docs and autocompletion to update. You can try out the GraphiQL of the finished Twitter clone server (powered by Compose!) here: all-the-databases.graphql.guide/graphql The only differences between the hosted server and the code running on your own computer are environment variables that contain the database connection info that you get when you set up a new Compose database. We now have a working server and schema. The server setup was short, and specifying types for the schema was intuitive, but we haven’t done anything database-specific yet. In Part 2 we’ll write the server code that fetches the right data from SQL, Elasticsearch, MongoDB, Redis, and a REST API.s. Add Compose Articles to your feed reader to get the next part! -------------------------------------------------------------------------------- attribution Hyberbole and a half This article is licensed with CC-BY-NC-SA 4.0 by Compose. Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES Oct 11, 2016COMPOSE: NOW AVAILABLE ON IBM BLUEMIX The power of IBM's Bluemix cloud platform is now able to seamlessly harness Compose's databases, making Compose-configured Mo… Dj Walker-Morgan Sep 28, 2016POWERING SOCIAL FEEDS AND TIMELINES WITH ELASTICSEARCH Evolving from MongoDB and Redis to Elasticsearch, Campus Discounts' founder and CTO Don Omondi talks about how and why the co… Guest Author Dec 3, 2015CHOOSING THE RIGHT SOLUTION FOR YOU - COMPOSE PB&J If you're new to some of the databases that Compose offers, you might be wondering which ones you should choose for your proj… Lisa Smith Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Loren Sands-Ramshaw, author of GraphQL: The New REST, shows how to combine data from multiple sources using GraphQL in this Write Stuff two-part series.",Use all the Databases,Live,114 299,"MENU Close * Home Subscribe MenuFINDING THE USER IN DATA SCIENCE 03 June 2016When the IBM Design team began researching data scientists, we had a lot to learn; but what we found was our two disciplines had a lot in common. Without connecting people to data, it’s just a bunch of stuff The Data Science practice is amazing and complex. A solo data scientist has to form a relevant hypothesis, find a corresponding data set, clean it, and repeatedly build and edit a model to prove or disprove their hypothesis. The Data Science Experience grew from our attempts to understand data science as outsiders: as designers wanting to build a tool for data scientists. We were curious how data scientists distill something interesting from inchoate data. This curiosity catapulted us into a months-long research endeavor. We synthesized research conducted in our studios all over the world and had conversations with every data scientist we could find. This included hundreds of interviews, dozens of contextual inquiries, and the production of countless research artifacts. We were astounded by the practice we uncovered, and inspired by its creativity. We came to understand data science as storytelling — an act of cutting away the meaningless, and finding humanity in a series of digits. The data science process is an experiment, the adding and subtracting of elements to find just the right mix. It’s a fluid dance of trial and error, give and take, push and pull. We realized that the tools that data scientists currently use are not designed to support this fluid process of constant refinement—the tools operate in isolation. Data scientists constantly have to navigate away from their workspaces in order to advance and edit their product. This disconnection is where we found our opportunity Finding our principles Current tools only address single facets of data science — which means data scientists must toggle back-and-forth between research and development. Data Shaper is for cleaning data, Jupyter is for modeling, and MatPlotLib is for visualizing. These tools are designed to serve a linear process, but a data scientist’s process is not linear, it’s cyclical. Research artifact depicting the cyclical process of data science From this model, our first design principle emerged: A holistic approach to enable data scientists. As we discussed before, much of our research involved contextual inquiries. We watched a data scientist build a pipeline — sourcing assets from the web, comparing his code to others’, and constantly jumping from tool to tool. We loved this part of the research, as it helped us understand that each facet of the process requires unique research. Notes on contextual inquiry during pipeline construction We saw him use dozens of assets of many different types. We watched him organize and name them. At any given point, he needed a tutorial, an academic paper, or a data set to move to the next step in his process, and each of these assets had to be saved and interacted with in a different environment. The process he used to manage his resources helped us establish a tentative system for artifact classification. It was also enlightening to watch him browse for resources. Whether he was scrolling through lists in databases or scanning forums for code, he had criteria for assessing the value of these artifacts. We watched him pull code from several different projects and seek advice on API implementation from a forum. It became obvious that a data science project can’t just stand on its own. It needs support and validation from the community. An artifact, whether code snippet, API, or academic paper, is only as strong as the people who use it. The more an artifact is employed, the more people there are to discuss it. The public use of an artifact sharpens its quality. The value of an asset is determined by the discussion around it — its documentation, its versioning, and its critics. The evolution of data science is fueled by the collaborative processes of building off of each other’s work. This understanding led us to our second, and arguably most inspiring principle, Community first. The community is the strongest tool a data scientist can access. So why hasn’t it been factored in any of their current interfaces? Turning principles into practice We wanted to create an interface that was open and dynamic, just like the modeling process we observed. We determined that our concept must allow the data scientists to converse, learn, and research in the context of their software. We knew our design had to operate as a toolbox that was more dynamic than just a collection of software applications. In addition to providing data scientists with the full scope of software products that they need to complete their process, we need to address their need to validate and advance their work through research. This helped us design one of our first concepts: the maker palette. This feature developed from the idea that the community is a tool — just as important as a notebook or data set. The design treatment is just the same as any other resource--it appears in a panel that can be opened and closed at will. The benefit is that it’s not specific to a file format or tool, so it can be accessed in any part of the interface. A user test with the maker palette In the community palette, a data scientist can find data sets, access papers, view tutorials, and compare their code to others. When they’re uninspired or stuck, the community acts as both peer, tool, and teacher. Mixed content The practice of data science surrounds the building of a pipeline, which is a sequence of algorithms that process and learn from data. As we watched data scientists build their pipelines in notebooks, we likened the process to building a wall around a garden, brick by brick. Each brick must be tested to see if it fits the within the bricks that preceded it. These bricks, collected piecemeal throughout the process, slowly enclose the desired pieces of data. The implementation of these bricks requires supplemental materials, like documentation and user testimonials. While these materials will not be included in the pipeline, they need to be viewed in the context of the code. Although they manifest as different file types, these materials are building blocks also, and are just as necessary to the advancement of a project as an actual line of code. The brick building metaphor inspired the form of our design. We translated the modularity of pipeline construction into a card design paradigm for the interface. Having a uniform treatment for a variety of content types allowed us to streamline the search for resources. A key component of our maker palette was the ability to display mixed content in a singular environment. The data scientist can search for any type of asset inside of their workspace, and review and reference it in a singular, cohesive environment. The design of our cards was shaped by repeated user testing. The card-in-panel format gives the data scientist the ability to quickly test a variety of assets in their work. They can make off-the-cuff adjustments without having to make time commitments to deep research or additional tools. They can repeatedly complete the cycles of their work--ask, build, test, refine--in one unified experience. In data scientists, we see ourselves In IBM Design, we often discuss “the loop,” or the practice of continuous refinement of an idea through research and testing. Like the scientific method, we design a hypothesis, develop prototypes, test them, make observations, and adjust. As software designers, we’re constantly trying to find the storyline in “stuff.” Much like data scientists, we sift through the extraneous to find the human elements in products and processes. At the beginning, data science seemed complex and distant, and now, after all our research and a little self-reflection, it seems strangely familiar. Data Science Experience Creation Zoe Padgett and Eytan Davidovits's PictureZOE PADGETT AND EYTAN DAVIDOVITS Read more posts by this author. SHARE THIS POST Twitter Facebook Google+ IBM Data Science Experience Blog © 2016 Proudly published with Ghost","When the IBM Design team began researching data scientists, we had a lot to learn; but what we found was our two disciplines had a lot in common.",Finding the user in data science,Live,115 300,"* Be a better programmer CATEGORIES Toggle navigation * Algorithms * Competitive Programming * Internet of Things * Python * Machine Learning ×WANT A CAREER IN DATA SCIENCE / ANALYTICS ? Drop your email to get latest tutorials, career paths, projects, jobs in machine learning & data science. GET MORE STUFF Subscribe now to get the latest updates from the developer community in your inbox! PRACTICAL TUTORIAL ON RANDOM FOREST AND PARAMETER TUNING IN R Open Modal Open Modal Machine Learning R December 14, 2016 Share 120INTRODUCTION Treat ""forests"" well. Not for the sake of nature, but for solving problems too! Random Forest is one of the most versatile machine learning algorithms available today. With its built-in ensembling capacity, the task of building a decent generalized model (on any dataset) gets much easier. However, I've seen people using random forest as a black box model; i.e., they don't understand what's happening beneath the code. They just code. In fact, the easiest part of machine learning is coding . If you are new to machine learning, the random forest algorithm should be on your tips. Its ability to solve—both regression and classification problems along with robustness to correlated features and variable importance plot gives us enough head start to solve various problems. Most often, I've seen people getting confused in bagging and random forest. Do you know the difference? In this article, I'll explain the complete concept of random forest and bagging. For ease of understanding, I've kept the explanation simple yet enriching. I've used MLR, data.table packages to implement bagging, and random forest with parameter tuning in R. Also, you'll learn the techniques I've used to improve model accuracy from ~82% to 86%. TABLE OF CONTENTS 1. What is the Random Forest algorithm? 2. How does it work? (Decision Tree, Random Forest) 3. What is the difference between Bagging and Random Forest? 4. Advantages and Disadvantages of Random Forest 5. Solving a Problem * Parameter Tuning in Random Forest WHAT IS THE RANDOM FOREST ALGORITHM? Random forest is a tree-based algorithm which involves building several trees (decision trees), then combining their output to improve generalization ability of the model. The method of combining trees is known as an ensemble method. Ensembling is nothing but a combination of weak learners (individual trees) to produce a strong learner. Say, you want to watch a movie. But you are uncertain of its reviews. You ask 10 people who have watched the movie. 8 of them said "" the movie is fantastic."" Since the majority is in favor, you decide to watch the movie. This is how we use ensemble techniques in our daily life too. Random Forest can be used to solve regression and classification problems. In regression problems, the dependent variable is continuous. In classification problems, the dependent variable is categorical. Trivia: The random Forest algorithm was created by Leo Brieman and Adele Cutler in 2001. HOW DOES IT WORK? (DECISION TREE, RANDOM FOREST) To understand the working of a random forest, it's crucial that you understand a tree . A tree works in the following way: 1. Given a data frame (n x p), a tree stratifies or partitions the data based on rules (if-else). Yes, a tree creates rules. These rules divide the data set into distinct and non-overlapping regions. These rules are determined by a variable's contribution to the homogenity or pureness of the resultant child nodes (X2,X3). 2. In the image above, the variable X1 resulted in highest homogeneity in child nodes, hence it became the root node. A variable at root node is also seen as the most important variable in the data set. 3, But how is this homogeneity or pureness determined? In other words, how does the tree decide at which variable to split? * In regression trees (where the output is predicted using the mean of observations in the terminal nodes), the splitting decision is based on minimizing RSS. The variable which leads to the greatest possible reduction in RSS is chosen as the root node. The tree splitting takes a top-down greedy approach, also known as recursive binary splitting . We call it ""greedy"" because the algorithm cares to make the best split at the current step rather than saving a split for better results on future nodes. * In classification trees (where the output is predicted using mode of observations in the terminal nodes), the splitting decision is based on the following methods: * Gini Index - It's a measure of node purity. If the Gini index takes on a smaller value, it suggests that the node is pure. For a split to take place, the Gini index for a child node should be less than that for the parent node. * Entropy - Entropy is a measure of node impurity. For a binary class (a,b), the formula to calculate it is shown below. Entropy is maximum at p = 0.5. For p(X=a)=0.5 or p(X=b)=0.5 means, a new observation has a 50%-50% chance of getting classified in either classes. The entropy is minimum when the probability is 0 or 1. Entropy = - p(a)*log(p(a)) - p(b)*log(p(b)) In a nutshell, every tree attempts to create rules in such a way that the resultant terminal nodes could be as pure as possible. Higher the purity, lesser the uncertainity to make the decision. But a decision tree suffers from high variance. ""High Variance"" means getting high prediction error on unseen data. We can overcome the variance problem by using more data for training. But since the data set available is limited to us, we can use resampling techniques like bagging and random forest to generate more data. Building many decision trees results in a forest . A random forest works the following way: 1. First, it uses the Bagging (Bootstrap Aggregating) algorithm to create random samples. Given a data set D1 (n rows and p columns), it creates a new dataset (D2) by sampling n cases at random with replacement from the original data. About 1/3 of the rows from D1 are left out, known as Out of Bag(OOB) samples. 2. Then, the model trains on D2. OOB sample is used to determine unbiased estimate of the error. 3. Out of p columns, P << p columns are selected at each node in the data set. The P columns are selected at random. Usually, the default choice of P is p/3 for regression tree and P is sqrt(p) for classification tree. 4. Unlike a tree, no pruning takes place in random forest; i.e, each tree is grown fully. In decision trees, pruning is a method to avoid overfitting. Pruning means selecting a subtree that leads to the lowest test errror rate. We can use cross validation to determine the test error rate of a subtree. 5. Several trees are grown and the final prediction is obtained by averaging or voting. Each tree is grown on a different sample of original data. Since random forest has the feature to calculate OOB error internally, cross validation doesn't make much sense in random forest. WHAT IS THE DIFFERENCE BETWEEN BAGGING AND RANDOM FOREST? Many a time, we fail to ascertain that bagging is not same as random forest. To understand the difference, let's see how bagging works: 1. It creates randomized samples of the data set (just like random forest) and grows trees on a different sample of the original data. The remaining 1/3 of the sample is used to estimate unbiased OOB error. 2. It considers all the features at a node (for splitting). 3. Once the trees are fully grown, it uses averaging or voting to combine the resultant predictions. Aren't you thinking, ""If both the algorithms do same thing, what is the need for random forest? Couldn't we have accomplished our task with bagging?"" NO! The need for random forest surfaced after discovering that the bagging algorithm results in correlated trees when faced with a data set having strong predictors. Unfortunately, averaging several highly correlated trees doesn't lead to a large reduction in variance. But how do correlated trees emerge? Good question! Let's say a data set has a very strong predictor , along with other moderately strong predictors. In bagging, a tree grown every time would consider the very strong predictor at its root node, thereby resulting in trees similar to each other. The main difference between random forest and bagging is that random forest considers only a subset of predictors at a split. This results in trees with different predictors at top split, thereby resulting in decorrelated trees and more reliable average output. That's why we say random forest is robust to correlated predictors. ADVANTAGES AND DISADVANTAGES OF RANDOM FOREST Advantages are as follows: 1. It is robust to correlated predictors. 2. It is used to solve both regression and classification problems. 3. It can be also used to solve unsupervised ML problems. 4. It can handle thousands of input variables without variable selection. 5. It can be used as a feature selection tool using its variable importance plot. 6. It takes care of missing data internally in an effective manner. Disadvantages are as follows: 1. The Random Forest model is difficult to interpret. 2. It tends to return erratic predictions for observations out of range of training data. For example, the training data contains two variable x and y. The range of x variable is 30 to 70. If the test data has x = 200, random forest would give an unreliable prediction. 3. It can take longer than expected time to computer a large number of trees. SOLVING A PROBLEM (PARAMETER TUNING) Let's take a data set to compare the performance of bagging and random forest algorithms. Along the way, I'll also explain important parameters used for parameter tuning. In R, we'll use MLR and data.table package to do this analysis. I've taken the Adult dataset from the UCI machine learning repository. You can download the data from here . This data set presents a binary classification problem to solve. Given a set of features, we need to predict if a person's salary is <=50K or >=50k. Since the given data isn't well structured, we'll need to make some modification while reading the data set. #set working directory > path <- ""~/December 2016/RF_Tutorial"" > setwd(path) #load libraries > library(data.table) > library(mlr) > library(h2o) #set variable names setcol <- c(""age"", ""workclass"", ""fnlwgt"", ""education"", ""education-num"", ""marital-status"", ""occupation"", ""relationship"", ""race"", ""sex"", ""capital-gain"", ""capital-loss"", ""hours-per-week"", ""native-country"", ""target"") #load data > train <- read.table(""adultdata.txt"",header = F,sep = "","",col.names = setcol,na.strings = c("" ?""),stringsAsFactors = F) > test <- read.table(""adulttest.txt"",header = F,sep = "","",col.names = setcol,skip = 1, na.strings = c("" ?""),stringsAsFactors = F) After we've loaded the data set, first we'll set the data class to data.table. data.table is the most powerful R package made for faster data manipulation. > setDT(train) > setDT(test) Now, we'll quickly look at given variables, data dimensions, etc. > dim(train) > dim(test) > str(train) > str(test) As seen from the output above, we can derive the following insights: 1. The train data set has 32,561 rows and 15 columns. 2. The test data has 16,281 rows and 15 columns. 3. Variable target is the dependent variable. 4. The target variable in train and test data is different. We'll need to match them. 5. All character variables have a leading whitespace which can be removed. We can check missing values using: #check missing values > table(is.na(train)) FALSE TRUE 484153 4262 > sapply(train, function(x) sum(is.na(x))/length(x))*100 > table(is.na(test)) FALSE TRUE 242012 2203 > sapply(test, function(x) sum(is.na(x))/length(x))*100 As seen above, both train and test datasets have missing values. The sapply function is quite handy when it comes to performing column computations. Above, it returns the percentage of missing values per column. Now, we'll preprocess the data to prepare it for training. In R, random forest internally takes care of missing values using mean/ mode imputation. Practically speaking, sometimes it takes longer than expected for the model to run. Therefore, in order to avoid waiting time, let's impute the missing values using median / mode imputation method; i.e., missing values in the integer variable will be imputed with median and factor variables will be imputed with mode (most frequent value). We'll use the impute function from MLR package, which is enabled with several unique methods for missing value imputation: > imp1 <- impute(data = train,target = ""target"",classes = list(integer=imputeMedian(), factor=imputeMode())) > imp2 <- impute(data = test,target = ""target"",classes = list(integer=imputeMedian(), factor=imputeMode())) > train <- imp1$data > test <- imp2$data Being a binary classification problem, you are always advised to check if the data is imbalanced or not. We can do it in the following way: > setDT(train)[,.N/nrow(train),target] target V1 1: <=50K 0.7591904 2: >50K 0.2408096 > setDT(test)[,.N/nrow(test),target] target V1 1: <=50K. 0.7637737 2: >50K. 0.2362263 If you observe carefully, the value of the target variable is different in test and train. For now, we can consider it a typo error and correct all the test values. Also, we see that 75% of people in train data have income <=50K. Imbalanced classification problems are known to be more skewed with a binary class distribution of 90% to 10%. Now, let's proceed and clean the target column in test data. > test[,target := substr(target,start = 1,stop = nchar(target)-1)] We've used the substr function to return the subtring from a specified start and end position. Next, we'll remove the leading whitespaces from all character variables. We'll use str_trim function from stringr package. > library(stringr) > char_col <- colnames(train)[sapply(train,is.character)] > for(i in char_col) set(train,j=i,value = str_trim(train[[i]],side = ""left"")) Using sapply function, we've extracted the column names which have character class. Then, using a simple for - set loop we traversed all those columns and applied the str_trim function. Before we start model training, we should convert all character variables to factor. MLR package treats character class as unknown. > fact_col <- colnames(train)[sapply(train,is.character)] >for(i in fact_col) set(train,j=i,value = factor(train[[i]])) >for(i in fact_col) set(test,j=i,value = factor(test[[i]])) Let's start with modeling now. MLR package has its own function to convert data into a task, build learners, and optimize learning algorithms. I suggest you stick to the modeling structure described below for using MLR on any data set. #create a task > traintask <- makeClassifTask(data = train,target = ""target"") > testtask <- makeClassifTask(data = test,target = ""target"") #create learner > bag <- makeLearner(""classif.rpart"",predict.type = ""response"") > bag.lrn <- makeBaggingWrapper(learner = bag,bw.iters = 100,bw.replace = TRUE) I've set up the bagging algorithm which will grow 100 trees on randomized samples of data with replacement. To check the performance, let's set up a validation strategy too: #set 5 fold cross validation > rdesc <- makeResampleDesc(""CV"",iters=5L) For faster computation, we'll use parallel computation backend. Make sure your machine / laptop doesn't have many programs running at backend. #set parallel backend (Windows) > library(parallelMap) > library(parallel) > parallelStartSocket(cpus = detectCores()) For linux users, the function parallelStartMulticore(cpus = detectCores()) will activate parallel backend. I've used all the cores here. r <- resample(learner = bag.lrn ,task = traintask ,resampling = rdesc ,measures = list(tpr,fpr,fnr,fpr,acc) ,show.info = T) #[Resample] Result: # tpr.test.mean=0.95, # fnr.test.mean=0.0505, # fpr.test.mean=0.487, # acc.test.mean=0.845 Being a binary classification problem, I've used the components of confusion matrix to check the model's accuracy. With 100 trees, bagging has returned an accuracy of 84.5%, which is way better than the baseline accuracy of 75%. Let's now check the performance of random forest. #make randomForest learner > rf.lrn <- makeLearner(""classif.randomForest"") > rf.lrn$par.vals <- list(ntree = 100L, importance=TRUE) ) > r <- resample(learner = rf.lrn ,task = traintask ,resampling = rdesc ,measures = list(tpr,fpr,fnr,fpr,acc) ,show.info = T) # Result: # tpr.test.mean=0.996, # fpr.test.mean=0.72, # fnr.test.mean=0.0034, # acc.test.mean=0.825 On this data set, random forest performs worse than bagging. Both used 100 trees and random forest returns an overall accuracy of 82.5 %. An apparent reason being that this algorithm is messing up classifying the negative class. As you can see, it classified 99.6% of the positive classes correctly, which is way better than the bagging algorithm. But it incorrectly classified 72% of the negative classes. Internally, random forest uses a cutoff of 0.5; i.e., if a particular unseen observation has a probability higher than 0.5, it will be classified as <=50K. In random forest, we have the option to customize the internal cutoff. As the false positive rate is very high now, we'll increase the cutoff for positive classes (<=50K) and accordingly reduce it for negative classes (>=50K). Then, train the model again. #set cutoff > rf.lrn$par.vals <- list(ntree = 100L, importance=TRUE, cutoff = c(0.75,0.25)) > r <- resample(learner = rf.lrn ,task = traintask ,resampling = rdesc ,measures = list(tpr,fpr,fnr,fpr,acc) ,show.info = T) #Result: tpr.test.mean=0.934, # fpr.test.mean=0.43, # fnr.test.mean=0.0662, # acc.test.mean=0.846 As you can see, we've improved the accuracy of the random forest model by 2%, which is slightly higher than that for the bagging model. Now, let's try and make this model better. Parameter Tuning: Mainly, there are three parameters in the random forest algorithm which you should look at (for tuning): * ntree - As the name suggests, the number of trees to grow. Larger the tree, it will be more computationally expensive to build models. * mtry - It refers to how many variables we should select at a node split. Also as mentioned above, the default value is p/3 for regression and sqrt(p) for classification. We should always try to avoid using smaller values of mtry to avoid overfitting. * nodesize - It refers to how many observations we want in the terminal nodes. This parameter is directly related to tree depth. Higher the number, lower the tree depth. With lower tree depth, the tree might even fail to recognize useful signals from the data. Let get to the playground and try to improve our model's accuracy further. In MLR package, you can list all tuning parameters a model can support using: > getParamSet(rf.lrn) #set parameter space params <- makeParamSet( makeIntegerParam(""mtry"",lower = 2,upper = 10), makeIntegerParam(""nodesize"",lower = 10,upper = 50) ) #set validation strategy rdesc <- makeResampleDesc(""CV"",iters=5L) #set optimization technique ctrl <- makeTuneControlRandom(maxit = 5L) #start tuning > tune <- tuneParams(learner = rf.lrn ,task = traintask ,resampling = rdesc ,measures = list(acc) ,par.set = params ,control = ctrl ,show.info = T) [Tune] Result: mtry=2; nodesize=23 : acc.test.mean=0.858 After tuning, we have achieved an overall accuracy of 85.8%, which is better than our previous random forest model. This way you can tweak your model and improve its accuracy. I'll leave you here. The complete code for this analysis can be downloaded from Github . SUMMARY Don't stop here! There is still a huge scope for improvement in this model. Cross validation accuracy is generally more optimistic than true test accuracy. To make a prediction on the test set, minimal data preprocessing on categorical variables is required. Do it and share your results in the comments below. My motive to create this tutorial is to get you started using the random forest model and some techniques to improve model accuracy. For better understanding, I suggest you read more on confusion matrix. In this article, I've explained the working of decision trees, random forest, and bagging. Did I miss out anything? Do share your knowledge and let me know your experience while solving classification problems in comments below. Share 120ABOUT THE AUTHOR Manish Saraswat * * Making an effort to help people understand Machine Learning. I believe your educational background doesn't stop you to pursue ML & Data Science. Earned Masters in F/M, a self taught data science professional. Previously worked at Analytics Vidhya. Now solving ML & Growth challenges at HackerEarth!AUTHOR POST Machine Learning R DEEP LEARNING & PARAMETER TUNING WITH MXNET, H2O PACKAGE IN R Jan 30, 2017 Machine Learning R PRACTICAL GUIDE TO CLUSTERING ALGORITHMS & EVALUATION IN R Jan 19, 2017 Machine Learning R HOW CAN R USERS LEARN PYTHON FOR DATA SCIENCE ? Jan 12, 2017 Machine Learning R PRACTICAL GUIDE TO LOGISTIC REGRESSION ANALYSIS IN R Jan 5, 2017 Machine Learning R EXCLUSIVE SQL TUTORIAL ON DATA ANALYSIS IN R Dec 28, 2016 Machine Learning R BEGINNERS TUTORIAL ON XGBOOST AND PARAMETER TUNING IN R Dec 20, 2016 Machine Learning R DEEP LEARNING & PARAMETER TUNING WITH MXNET, H2O PACKAGE IN R Jan 30, 2017 Machine Learning R PRACTICAL GUIDE TO CLUSTERING ALGORITHMS & EVALUATION IN R Jan 19, 2017 Machine Learning R HOW CAN R USERS LEARN PYTHON FOR DATA SCIENCE ? Jan 12, 2017 Machine Learning R PRACTICAL GUIDE TO LOGISTIC REGRESSION ANALYSIS IN R Jan 5, 2017 Machine Learning R EXCLUSIVE SQL TUTORIAL ON DATA ANALYSIS IN R Dec 28, 2016 Machine Learning R BEGINNERS TUTORIAL ON XGBOOST AND PARAMETER TUNING IN R Dec 20, 2016 Please enable JavaScript to view the comments powered by Disqus. x LIVE WEBINAR Machine Learning in a Live Production Environment Register NowABOUT US * Blog * Engineering Blog * Updates & Releases * Team * Careers * In the Press TOP CATEGORIES * Hiring * Placements * Hackathons * Community * Competitive Programming * Culture RESOURCES * Webinars * Podcasts * CodeTable * Hackathon Handbook * Complete Reference to Competitive Programming * How to get started with Open Source FOR COMPANIES * Recruit * Assessment * Sourcing * Host Hackathons * Interview © 2017 HackerEarth Share 120","In this tutorial, the complete concept of random forest and bagging is explained.",Practical Tutorial on Random Forest and Parameter Tuning in R,Live,116 301,"* Home * Community * Projects * Blog * About * Resources * Code * Contributions * University * IBM Design * Apache SystemML * Apache Spark™ SPARK.TC ☰ * Community * Projects * Blog * About * Resources * Code * Contributions * University * IBM Design * Apache SystemML * Apache Spark™ APACHE SPARK™ 2.0: MIGRATING APPLICATIONS Many excellent fixes, enhancements and new features are available in Apache Spark TM 2.0 as highlighted in What's New in Apache Spark TM 2.0 . High-level descriptions for migrating applications to Apache Spark TM 2.0 can be found at Apache Spark TM SQL Programming Guide and Apache Spark TM MLlib Guide . This post provides a brief summary of sample code changes to migrate a Java application from Apache Spark TM 1.6 to Apache Spark TM 2.0. The migration effort is dependent upon the Apache Spark TM APIs a given application uses. Note a few breaking API changes introduced in 2.0 release can result in compilation errors for an application compatible with previous releases. The most common compilation errors when initially updating a Java application for 2.0 release are as follows. * DataFrame cannot be resolved to a type. The import org.apache.spark.sql.DataFrame cannot be resolved. * The methods fit, transform, train must override or implement a supertype method * The return type is incompatible with PairFlatMapFunction .call(Iterator<... ). Resolving each one is straightforward by applying a group of code changes as follows. Replace DataFrame variable declarations and references with Dataset< Row .For Java applications, the type org.apache.spark.sql.DataFrame type no longer exists because in Scala it has been redefined as a type alias for Dataset[Row]. So in general, for each Java class that uses DataFrame, apply the following pattern. Replace: import org.apache.spark.sql.DataFrame; With: import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; Change: DataFrame df; To: Dataset df) transform(Dataset df) train(Dataset df) This change is related to SPARK-14500 Accept Dataset[] instead of DataFrame in MLlib APIs . In Scala MLlib APIs, DataFrame was replaced by Dataset[_]. For Java, this requires using Dataset< ? instead. Replace Iterable< with Iterator< for classes implementing PairFlatMapFunction.If a Java class implements PairFlatMapFunction (or other variations of FlatMapFunction), compiling against 2.0 API reports an error like the following: The return type is incompatible with PairFlatMapFunction>.call(Iterator<...>). To resolve, change the declared return type from Iterable to Iterator in the call() method override and import java.util.Iterator. In addition, modify the return value to return an iterator() of the collection instead of the collection itself. Below is a partial code fragment to illustrate what to modify for a class that implements FlatMapFunction and corresponding call() method. Change: public class CustomFlatMapFunction implements FlatMapFunction>, String> { @Override public Iterable call(Tuple2> arg0) throws Exception { ArrayList = new ArrayList>, String> { @Override public Iterator call(Tuple2> arg0) throws Exception { ArrayList = new ArrayList> <> which looks something like this: ./spark-submit.sh --vcap ./vcaps.json --deploy-mode cluster --class com.ibm.cds.spark.samples.HelloSpark --master https://169.54.219.20:8443 ./helloSpark-assembly-2.1.jar When the script finishes, it displays the location of the log file where you can find more information about your job. Done downloading from workdir/driver-20160408121238-0001-98831756-2640-427d-a7a2-b30ebd91b8f2/stderr to stderr_1460135564N Log file can be found at spark-submit_1460135564N.log 3. To access the driver machine logs for your application, look at the end in the spark-submit_XXXX.log and locate the curl command that lets you download the stdout:curl -D ""stdout_1460140266N.header"" -v -X GET --insecure -u sd73-de3b55cc941e55-4137fa4057f6:c8d92cd6-d13d-435e-b7a9-a4a7b96c0b79 -H ""X-Spark-service-instance-id: 3fd28f1b-cedc-4b50-bd73-de3b55cc941e"" https://169.54.219.20/tenant/data/workdir/driver-20160408133059-0003-ec49b480-f62a-4b76-b5b7-eb03b095dd0d/stdout Run it, and this command should return the following results: Hello Spark Demo. Compute the mean and variance of a collection Results: Mean: 250000.0 Variance: 2.083325E10 Note: Easy access to log messages from different Spark executors, called Spark History , is coming soon. When it’s available, I’ll write a follow-up describing it indetail. Stay tuned. In the meantime, this tutorial covers a quick way to check status or cancel a job .SPARK-SUBMIT JOB USING PYTHONTo submit a job using Python, follow the same pattern as in Scala except thatyou’re using a py script instead of a jar: 1. Create a py script called helloSpark.py (or download it from here ) as follows:import sys from pyspark import SparkContext def computeStatsForCollection(sc,countPerPartitions=100000,partitions=5): totalNumber = min( countPerPartitions * partitions, sys.maxsize) rdd = sc.parallelize( range(totalNumber),partitions) return (rdd.mean(), rdd.variance()) if __name__ == ""__main__"": sc = SparkContext(appName=""Hello Spark"") print(""Hello Spark Demo. Compute the mean and variance of a collection"") stats = computeStatsForCollection(sc); print("" Results: "") print(""Mean: "" + str(stats[0])); print(""Variance: "" + str(stats[1])); sc.stop() 2. Invoke the spark-submit.sh script as follows:./spark-submit.sh --vcap ./vcaps.json --deploy-mode cluster --master https://169.54.219.20:8443 <> GET STATUS AND CANCEL A LONG-RUNNING JOBWhen you’re dealing with long-running jobs, you may want to query the status.Also, sometimes you can’t wait for a long Spark job to complete and want to killthe job before it finishes. You can handle both of these tasks using thespark-submit.sh script. (The following steps work whether you used Scala orPython to launch the job.)Open the log file and locate the Submission ID, which looks something like this:Submission ID : driver-20160408121238-0001-98831756-2640-427d-a7a2-b30ebd91b8f2and use the value in the following commands: * To get a job status:./spark-submit.sh --vcap ./vcaps.json --master https://169.54.219.20:8443 --status driver-20160408121238-0001-98831756-2640-427d-a7a2-b30ebd91b8f2 * To kill the job:./spark-submit.sh --vcap ./vcaps.json --master https://169.54.219.20:8443 --kill driver-20160408121238-0001-98831756-2640-427d-a7a2-b30ebd91b8f2 CONCLUSIONIn this tutorial, you learned how to use spark-submit.sh to run a Spark batchjob using Scala and Python. We also looked at how to monitor the job, check thestatus, and kill the job. You can find more information on spark-submitfunctionality here .Stay tuned for an upcoming post on how to use Spark History which will provide anice UI with an aggregated view of all log messages produced by each executor inthe cluster.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",How to run Spark batch jobs programmatically. See examples in both Scala and Python that launch a Hello World Spark job via spark-submit.,Launch a Spark job using spark-submit,Live,119 306,"Homepage Follow Sign in / Sign up John Thomas Blocked Unblock Follow Following IBM Distinguished Engineer. #Cloud, #Analytics, #Cognitive, #zSystems, #ITEconomics. Chess, Food, Travel (60+ countries). Tweets are personal opinions. Jun 12 -------------------------------------------------------------------------------- MACHINE LEARNING & APACHE SPARK: A DYNAMIC DUO The Machine Learning revolution is underway and is changing industries and delivering outcomes that were unimaginable a few years ago. In this video of John J. Thomas’s keynote at ApacheCon on May 17, 2017, learn how Apache Spark and other related projects are being used by innovative companies to remake products and services and enabling data-driven decision making. For more information, visit the Data Science Experience . Video courtesy of the The Linux Foundation ( https://www.linuxfoundation.org/ ) via YouTube. * Apache Spark * Machine Learning Blocked Unblock Follow FollowingJOHN THOMAS IBM Distinguished Engineer. #Cloud, #Analytics, #Cognitive, #zSystems, #ITEconomics. Chess, Food, Travel (60+ countries). Tweets are personal opinions. FollowINSIDE MACHINE LEARNING Deep-dive articles about machine learning and data. Curated by IBM Analytics. * Share * * * * Never miss a story from Inside Machine learning , when you sign up for Medium. Learn more Never miss a story from Inside Machine learning Get updates Get updates",The Machine Learning revolution is underway and is changing industries and delivering outcomes that were unimaginable a few years ago. In this video of John J. Thomas’s keynote at ApacheCon on May 17…,A Dynamic Duo – Inside Machine learning – Medium,Live,120 307,"PICKING SQL OR NOSQL? – A COMPOSE VIEW Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 20, 2016Here's a question we hear a lot - Should I use SQL databases or NoSQL databases? It's a question that gets asked because often underlying it is another question - What's broken in SQL databases that NoSQL databases fixes? The answer to that one is much easier. Nothing is broken because they are different approaches to creating databases in the same way that assembler and higher level languages are to creating applications. Think of a typical high level language. It abstracts away all the ideas of machine code – of scheduling, memory management, interrupts, processor stacks and and buffers – into a different intellectual framework that is the language. You write a program in the language and a compiler or interpreter steps in and turns your code into digestible chunks of machine code (or intermediate code) to be run on some actual hardware. You don't care about that though, all you care is that your code can go into any machine and right things happen. You can think of this as akin to SQL; you write your high level query which is generally portable between different SQL databases and the database's internal compiler or interpreter turns it into executable operations which it can then run to give the results you are expecting. There's a whole query engine in your database that looks for the optimal way to turn your SQL query into the optimal set of operations to get your results. You usually only care about what it's doing when your queries aren't running as fast as you'd hope, in the same way that you only care about your compiler when it generates slow code for your application. Now think of assembler. Assembler is unique to the processor family it runs on. These are the smallest operations the processor will let you program it with and they all run as fast as the processor can. They do exactly what they say and no more. High level language compilers convert programs into assembler (eventually) so they can be run, but writing in bare assembler can be even more efficient as long as you can take into account all the internal ""moving parts"" of the processor. The downside is that you can't move your assembler code to a different processor family. And now think of NoSQL like that. The query engine and low level operations of a database exposed through an API to give you a more intimate control of your database operations. For databases that's something like find a record by a key, update a record with that key, construct a query from a chain of operands. These small operations can be combined by applications to create powerful applications. NoSQL emerged in a world of SQL not to replace it but to allow people to experiment with new ways of working with databases and optimising databases to particular tasks. The same deal with assembler applies with NoSQL; you get direct control of the underlying system, you have to worry about managing that system a lot more - selecting indexes, creating reliable operations which don't crash into each other, making sure you aren't locking out other operations - these are things you will, at any scale, have to think about at some point. The good news is that NoSQL databases have matured so the underlying mechanisms are more resilient and reliable to these issues. NoSQL databases have also focused on particular data types or arrangements - JSON document, columnar storage, graphs - and on different architectures - in-memory, sharded, distributed, replicated - to create databases which are very powerful for particular use cases. SQL is a language of general purpose utility. It sets out with a relational, table centric structure and you rely on the database to make optimal decisions in interpreting your intent and coming up with the best path to get your results. Because of that SQL also shaped how the underlying databases operated and how they developed over time. To jump ship to another analogy temporarily, NoSQL is like RISC processors were to the CISC processors in the 80s and 90s. RISC processors gave chip designers a whole new way to approach problems of scale and moved the task of building optimised instruction pipelines up to the compilers used to create code for the RISC chips. Some even went as far as turning CISC instructions into RISC instructions on the fly. The two approaches often found themselves facing off over performance. Where are we now? The lessons learnt from RISC processors are embedded in CISC designs while a new class of more complex RISC chip is to be found optimized for power consumption in a billion devices - the biggest niche ever. Here's the cool part. Those billion RISC devices interoperate with all the CISC and other RISC devices out there over the internet and through the millions of servers in the cloud. It's not an either/or choice. It's a best-for-the-task selections. When you go out and buy a computer in 2016, you pick it for suitability for a task, not whether it has a RISC or CISC design philosophy at the heart of its CPU. In the same way, when picking a database, or databases, for a task you should select for suitability for that task. Which brings us back to the assembler/higher level language analogy in this analogy inception. What this analogy offers us is a simple rule of thumb for thinking about how SQL and NoSQL impact on that decision of suitability. A NoSQL database will tend to be optimized for a class of problems and it's important to understand what those problems are. SQL can always be, at least in theory, compiled into the operations of a NoSQL database and there are tools out there which will do this for you. You'll usually find them in the Business Intelligence and Analytics aisle. Some NoSQL databases are internalising the same ideas to offer subsets of SQL too, raising the bar on NoSQL's assembler-ness in this analogy to something closer to the capabilities of a high level language. Opt for NoSQL and you get handed the keys to the database, along with a specially selected set of components and the freedom to assemble them how you wish. Opt for SQL, you'll get access to an often feature rich semi-autonomous car which will take you from A to B efficiently every day. As a developer you'd never say ""I'll just use assembler for all my apps"" or ""I'll use only this high level languages""; you would keep all options open. The best solution? Opt for whatever is best for your task; not just one but as many as you need. If your application stack needs an in-memory database or messaging bus binding together applications using a document database for client facing applications, a database for backend analytics and a JSON document search database, then thats the architecture you should go for. That and the ability to deploy production grade versions of all those databases whenever you need them. Image by Davide Ragusa Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose",Here's a question we hear a lot - Should I use SQL databases or NoSQL databases? It's a question that gets asked because often underlying it is another question - What's broken in SQL databases that NoSQL databases fixes? The answer to that one is much easier: Nothing.,Picking SQL or NoSQL? – A Compose View,Live,121 309,"Homepage IBM Watson Data Lab Follow Sign in / Sign up Homepage * Home * Cognitive Computing * Data Science * Web Dev * Mark Watson Blocked Unblock Follow Following Developer Advocate, IBM Watson Data Platform Oct 18 -------------------------------------------------------------------------------- WATSON MACHINE LEARNING FOR DEVELOPERS UNDERSTANDING THE BASIC PROBLEMS AND WORKFLOW (PART 1) I am not a Data Scientist , but I am a developer interested in data science and machine learning. I hope you are here because you are as well! This is the first installment in a series of posts aimed at introducing developers like me and you to the basic machine learning concepts and tools required to get an ML system up and running. I will not be spending a lot of time talking about how to clean and analyze data, or the finer points of how machine learning works, but I will introduce you to fundamental concepts that you will need to get your first system up and running. Let’s start by understanding when and why you would use machine learning. We’ll eventually use the Watson ML service to deploy our model, but the problems and workflow I describe here apply broadly to machine learning.PREDICTIONS The ultimate goal of a machine learning system is to make a prediction. Here are some examples you may be familiar with: 1. Predict whether an image is a cat or dog 2. Predict the value of a home 3. Predict which products to recommend to a user 4. Predict which users share the same interests 5. Predict when to turn, accelerate, or apply the brakes in a self-driving car Machine learning is all about predictions. If you have a use case where you need to make predictions (and a lot of data), machine learning may be a good fit. How do ML systems make predictions? It all starts with the data. ML libraries and platforms can make predictions by analyzing massive amounts of data and finding patterns or mathematical formulas that “explain” the data. The data is the most crucial component to a successful ML system. You need to have a lot of it, and it has to be good. Bad data in = bad predictions out. Let’s go a tad deeper to get a better understanding of how machine learning works. DATA It can’t be said enough. It all starts with the data, and it has to be good data. Let’s start with a simple, well-known machine learning example: predicting house prices. Let’s say we have a data set of known houses and their associated prices: Square Feet # Bedrooms Color Price ----------- ---------- ----- ----- 2,100 3 White $100,000 2,300 4 White $125,000 2,500 4 Brown $150,000 Obviously this is not a lot of data, and not good data, but ignore that for now. Our goal is to build a machine learning system to predict house prices using this data set. Predicting the price of a house is a supervised machine learning problem — that means we know the outcome for a subset of use cases (i.e., we know what the prices are for the houses listed above), and we can use those outcomes to train a ML system to predict outcomes for new use cases (i.e., predict the price for a house that is not in the list). An unsupervised ML problem is one where the system learns from the data, rather than being trained by the data. We’ll cover unsupervised learning in a future post. Specifically, this is a regression problem. A regression problem is one in which you want to predict a real number, like the price of a house. We will also cover binary and multiclass classification (when you want to predict a class or category from a predefined list of values) and clustering (when you want to group data that is similar). When we build a supervised ML model, we need to specify which variables we want to use to make our predictions. These variables are referred to as features . We know that when a house is 2,100 square feet, has 3 bedrooms, and is the color white, then the price is $100,000. In this example, color is not important to predicting the price of a home, but you could reason that both square footage and the number of bedrooms are. So, it makes sense that we choose Square Feet and # Bedrooms as our features. The value we want to predict is the Price . This is referred to as our label . We’ll use the features in our data set to build a model that can predict the label (Price). That process looks a little like this: 1. Choose a ML algorithm. We’ll cover some of the common algorithms used in machine learning. 2. Instruct our ML algorithm to use Square Feet and # Bedrooms as our features and Price as our label (the value we want to predict). 3. Feed the data set to our ML algorithm to train an ML model that can make predictions. The algorithm will use the data set that you feed it to come up with a mathematical formula for predicting new outcomes. 4. To predict a price, we feed our ML model a set of features (square footage and number of bedrooms) and in response receive a predicted price. Now that we have data, and I’ve outlined the general steps from getting from the data to a prediction, let’s see what tools can help us get there. TOOLS We’ll focus on the tools provided by the IBM Data Science Experience (DSX). Many of the tools are open source and can be run locally or on other platforms, and the general concepts should apply to other hosted machine learning offerings. Jupyter Notebooks : Notebooks are used by data scientists to clean, visualize, and understand data. DSX uses Jupyter Notebooks, but notebooks come in different flavors. In DSX you code your notebooks in Python or Scala. Apache Spark ™: Spark is a cluster computing platform for analyzing massive amounts of data in-memory. For machine learning to be effective, you need lots of data, so it only makes sense that you have a platform like Spark to help. Apache Spark ML : Spark ML is a library for building ML pipelines on top of Apache Spark. Spark ML includes algorithms and APIs for supervised and unsupervised machine learning problems. IBM Watson ML : Watson ML is a service for deploying ML models and making predictions at runtime. Watson ML provides a REST API to your ML models which can be called directly from your application or your middleware. Let’s see how all these tools work together. WORKFLOW Here is the typical path I take when building and hosting a machine learning model: 1. Identify a prediction you want to make and the data set that can help you make it. 2. Create a Jupyter Notebook and import, clean, and analyze the data. 3. Use Apache Spark ML to build and test a machine learning model. 4. Deploy the model to Watson ML. 5. Call the Watson ML scoring endpoint (REST API) to make predictions from a client application or backend service. This path works for supervised and unsupervised machine learning, and I’ll use it to show you how you can solve regression, classification, and clustering ML problems. NEXT STEPS In this post, I gave an overview of what you can use machine learning for, a tool chain that you can use to build end-to-end ML systems, and the path I follow to build them. In part two , we’ll follow this path to build an ML system to predict housing prices. I’ll show you how to get from a raw data set to a REST API with just a few lines of code. * Machine Learning * Ibm Watson * Apache Spark * Ibm Bluemix * Data Science A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out. Blocked Unblock Follow FollowingMARK WATSON Developer Advocate, IBM Watson Data Platform FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform. * * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",This is the first installment in a series of posts aimed at introducing developers like me and you to the basic machine learning concepts and tools required to get an ML system up and running.,Watson Machine Learning for Developers,Live,122 312,"WHAT’S ALL THE HOOPLA ABOUT GRAPH DATABASES? Lauren Schaefer / October 7, 2016When it’s time to choose the database technology for your app, the choices can be overwhelming. Should you choose SQL or NoSQL? Open source or proprietary? Self-hosted or hosted? If you’re not already familiar with graph databases, you might be tempted to ignore them as an option. But that could be a mistake. Here’s why: If you want to try a graph database, getting started can get very complicated. Check out my latest video that shows you how to quickly and easily try a graph database: Happy graphing! SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Click to share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Bluemix / database / graph / graph databases / IBM Graph / NoSQL Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Object Storage * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","When it's time to choose the database technology for your app, the choices can be overwhelming. Here's why you should consider graph databases.",What's all the hoopla about graph databases?,Live,123 315,"Homepage Sign in / Sign up 9 * Share * 9 * * Never miss a story from Karlijn Willems , when you sign up for Medium. Learn more Never miss a story from Karlijn Willems Blocked Unblock Follow Join Medium Karlijn Willems Blocked Unblock Follow Following Data Science Journalist @DataCamp 2 days ago -------------------------------------------------------------------------------- PYTHON MACHINE LEARNING: SCIKIT-LEARN TUTORIAL Originally published at https://www.datacamp.com/community/tutorials/machine-learning-python Machine learning studies the design of algorithms that can learn. The hope that this discipline brings with itself is that the inclusion of experience into its tasks will eventually improve the learning. However, this improvement needs to happen in such a way that the learning itself becomes automatic so that humans don’t need to interfere anymore is the ultimate goal. You’ll probably have already heard that machine learning has close ties between this discipline and Knowledge Discovery, Data Mining, Artificial Intelligence (AI) and Statistics. Typical use cases of machine learning rnage from scientific knowledge discovery and more commercial ones: from the “Robot Scientist” to anti-spam filtering and recommender systems. Or maybe, if you haven’t heard about this discipline, you’ll find it vaguely familiar as one of the 8 topics that you need to master if you want to excel in data science. This scikit-learn tutorial will introduce you to the basics of Python machine learning: step-by-step, it will show you how to use Python and its libraries to explore your data with the help of matplotlib , work with the well-known algorithms KMeans and Support Vector Machines (SVM) to construct models, to fit the data to these models, to predict values and to validate the models that you have build. Note that the code chunks have been left out for convenience. If you want to follow and practice with code, go here . If you’re more interested in an R tutorial, check out our Machine Learning with R for Beginners tutorial LOADING YOUR DATA The first step to about anything in data science is loading in your data. This is also the starting point of this tutorial. If you’re new to this and you want to start problems on your own, finding data sets might prove to be a challenge. However, you can typically find good data sets at the UCI Machine Learning Repository or on the Kaggle website. Also, check out this KD Nuggets list with resources . For now, you just load in the digits dataset that comes with a Python library, called scikit-learn . No need to go and look for datasets yourself. Fun fact: did you know the name originates from the fact that this library is a scientific toolbox built around SciPy? By the way, there is more than just one scikit out there. This scikit contains modules specifically for machine learning and data mining, which explains the second component of the library name. :) To load in the data, you import the module datasets from sklearn . Then, you can use the load_digits() method from datasets to load in the data. Note that the datasets module contains other methods to load and fetch popular reference datasets, and you can also count on this module in case you need artificial data generators. In addition, this data set is also available through the UCI Repository that was mentioned above: you can find the data here . You’ll load in this data with the help of the pandas library. When you first start working with a dataset, it’s always a good idea to go through the data description and see what you can already learn. When it comes to scikit-learn , you don’t immediately have this information readily available, but in the case where you import data from another source, there's usually a data description present, which will already be a sufficient amount of information to gather some insights into your data. However, these insights are not merely deep enough for the analysis that you are going to perform. You really need to have a good working knowledge about the data set. Performing an exploratory data analysis (EDA) on a data set like the one that this tutorial now has might seem difficult. You should start with gathering the basic information: you already have knowledge of things such as the target values and the description of your data. You can access the digits data through the attribute data . Similarly, you can also access the target values or labels through the target attribute and the description through the DESCR attribute. To see which keys you have available to already get to know your data, you can just run digits.keys() . The next thing that you can (double)check is the type of your data. If you used read_csv() to import the data, you would have had a data frame that contains just the data. There wouldn’t be any description component, but you would be able to resort to, for example, head() or tail() to inspect your data. In these cases, it’s always wise to read up on the data description folder! However, this tutorial assumes that you make use of the library’s data and the type of the digits variable is not that straightforward if you’re not familiar with the library. Look at the print out in the first code chunk. You’ll see that digits actually contains numpy arrays! This is already quite some important information. But how do you access these arays? It’s very easy, actually: you use attributes to access the relevant arrays. Remember that you have already seen which attributes are available when you printed digits.keys() . For instance, you have the data attribute to isolate the data, target to see the target values and the DESCR for the description, … But what then? The first thing that you should know of an array is its shape. That is, the number of dimensions and items that is contained within an array. The array’s shape is a tuple of integers that specify the sizes of each dimension. Now let’s try to see what the shape is of these three arrays that you have distinguished (the data , target and DESCR arrays). Use first the data attribute to isolate the numpy array from the digits data and then use the shape attribute to find out more. You can do the same for the target and DESCR . There’s also the images attribute, which is basically the data in images. To recap: by inspecting digits.data , you see that there are 1797 samples and that there are 64 features. Because you have 1797 samples, you also have 1797 target values. But all those target values contain 10 unique values, namely, from 0 to 9. In other words, all 1797 target values are made up of numbers that lie between 0 and 9. This means that the digits that your model will need to recognize are numbers from 0 to 9. Lastly, you see that the images data contains three dimensions: there are 1797 instances that are 8 by 8 pixels big. Then, you can take your exploration up a notch by visualizing the images that you’ll be working with. You can use one of Python’s data visualization libraries, such as matplotlib : On a more simple note, you can also visualize the target labels with an image: Now you know a very good idea of the data that you’ll be working with! But is there no other way to visualize the data? As the digits data set contains 64 features, this might prove to be a challenging task. You can imagine that it’s very hard to understand the structure and keep the overview of the digits data. In such cases, it is said that you’re working with a high dimensional data set. High dimensionality of data is a direct result of trying to describe the objects via a collection of features. Other examples of high dimensional data are, for example, financial data, climate data, neuroimaging, … But, as you might have gathered already, this is not always easy. In some cases, high dimensionality can be problematic, as your algorithms will need to take into account too many features. In such cases, you speak of the curse of dimensionality. Because having a lot of dimensions can also mean that your data points are far away from virtually every other point, which makes the distances between the data points uninformative. Dont’ worry, though, because the curse of dimensionality is not simply a matter of counting the number of features. There are also cases in which the effective dimensionality might be much smaller than the number of the features, such as in data sets where some features are irrelevant. In addition, you can also understand that data with only two or three dimensions is easier to grasp and can also be visualized easily. That all explains why you’re going to visualize the data with the help of one of the Dimensionality Reduction techniques, namely Principal Component Analysis (PCA). The idea in PCA is to find a linear combination of the two variables that contains most of the information. This new variable or “principal component” can replace the two original variables. In short, it’s a linear transformation method that yields the directions (principal components) that maximize the variance of the data. Remember that the variance indicates how far a set of data points lie apart. If you want to know more, go to this page . You can easily apply PCA do your data with the help of scikit-learn. Tip : you have used the RandomizedPCA() here because it performs better when there’s a high number of dimensions. Try replacing the randomized PCA model or estimator object with a regular PCA model and see what the difference is. Note how you explicitly tell the model to only keep two components. This is to make sure that you have two-dimensional data to plot. Also, note that you don’t pass the target class with the labels to the PCA transformation because you want to investigate if the PCA reveals the distribution of the different labels and if you can clearly separate the instances from each other. You can now build a scatterplot to visualize the data: Again you use matplotlib to visualize the data. It’s good for a quick visualization of what you’re working with, but you might have to consider something a little bit more fancy if you’re working on making this part of your data science portfolio. Also note that the last call to show the plot ( plt.show() ) is not necessary if you’re working in Jupyter Notebook, as you’ll want to put the images inline. When in doubt, you can always check out our Definitive Guide to Jupyter Notebook . WHERE TO GO NOW? Now that you have even more information about your data and you have a visualization ready, it does seem a bit like the data points sort of group together, but you also see there is quite some overlap. This might be interesting to investigate further. Do you think that, in a case where you knew that there are 10 possible digits labels to assign to the data points, but you have no access to the labels, the observations would group or “cluster” together by some criterion in such a way that you could infer the lables? Now this is a research question! In general, when you have acquired a good understanding of your data, you have to decide on the use cases that would be relevant to your data set. In other words, you think about what your data set might teach you or what you think you can learn from your data. From there on, you can think about what kind of algorithms you would be able to apply to your data set in order to get the results that you think you can obtain. Tip: the more familiar you are with your data, the easier it will be to assess the use cases for your specific data set. The same also holds for finding the appropriate machine algorithm. However, when you’re first getting started with scikit-learn , you’ll see that the amount of algorithms that the library contains is pretty vast and that you might still want additional help when you’re doing the assessment for your data set. That’s why this scikit-learn machine learning map will come in handy. Note that this map does require you to have some knowledge about the algorithms that are included in the scikit-learn library. This, by the way, also holds some truth for taking this next step in your project: if you have no idea what is possible, it will be very hard to decide on what your use case will be for the data. As your use case was one for clustering, you can follow the path on the map towards “KMeans”. You’ll see the use case that you have just thought about requires you to have more than 50 samples (“check!”), to have labeled data (“check!”), to know the number of categories that you want to predict (“check!”) and to have less than 10K samples (“check!”). But what exactly is the K-Means algorithm? It is one of the simplest and widely used unsupervised learning algorithms to solve clustering problems. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters that you have set before you run the algorithm. This number of clusters is called k and you select this number at random. Then, the k-means algorithm will find the nearest cluster center for each data point and assign the data point closest to that cluster. Once all data points have been assigned to clusters, the cluster centers will be recomputed. In other words, new cluster centers will emerge from the average of the values of the cluster data points. This process is repeated until most data points stick to the same cluster. The cluster membership should stabilize. You can already see that, because the k-means algorithm works the way it does, the initial set of cluster centers that you give up can have a big effect on the clusters that are eventually found. You can, of course, deal with this effect, as you will see further on. However, before you can go into making a model for your data, you should definitely take a look into preparing your data for this purpose. As you have read in the previous section, before modeling your data, you’ll do well by preparing it first. This preparation step is called “preprocessing”. The first thing that we’re going to do is preprocessing the data. You can standardize the digits data by, for example, making use of the scale() method. By scaling the data, you shift the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance). In order to assess your model’s performance later, you will also need to divide the data set into two parts: a training set and a test set. The first is used to train the system, while the second is used to evaluate the learned or trained system. In practice, the division of your data set into a test and a training sets is disjoint: the most common splitting choice is to take 2/3 of your original data set as the training set, while the 1/3 that remains will compose the test set. You will try to do this also here. You see in the code chunk below that this ‘traditional’ splitting choice is respected: in the arguments of the train_test_split() method, you clearly see that the test_size is set to 0.25 . You’ll also note that the argument random_state has the value 42 assigned to it. With this argument, you can guarantee that your split will always be the same. That is particularly handy if you want reproducible results. After you have split up your data set into train and test sets, you can quickly inspect the numbers before you go and model the data: You’ll see that the training set X_train now contains 1347 samples, which is exactly 2/3d of the samples that the original data set contained, and 64 features, which hasn’t changed. The y_train training set also contains 2/3d of the labels of the original data set. This means that the test sets X_train and y_train contain 450 samples. After all these preparation steps, you have made sure that all your known (training) data is stored. No actual model or learning was performed up until this moment. Now, it’s finally time to find those clusters of your training set. Use KMeans() from the cluster module to set up your model. You’ll see that there are three arguments that are passed to this method: init , n_clusters and the random_state . You might still remember this last argument from before when you split the data into training and test sets. This argument basically guaranteed that you got reproducible results. The init indicates the method for initialization and even though it defaults to ‘k-means++’ , you see it explicitly coming back in the code. That means that you can leave it out if you want. Try it out in the DataCamp Light chunk above! Next, you also see that the n_clusters argument is set to 10 . This number not only indicates the number of clusters or groups you want your data to form, but also the number of centroids to generate. Remember that a cluster centroid is the middle of a cluster. Do you also still remember how the previous section described this as one of the possible disadvantages of the K-Means algorithm? That is, that the initial set of cluster centers that you give up can have a big effect on the clusters that are eventually found? Usually, you try to deal with this effect by trying several initial sets in multiple runs and by selecting the set of clusters with the minimum sum of the squared errors (SSE). In other words, you want to minimize the distance of each point in the cluster to the mean or centroid of that cluster. By adding the n-init argument to KMeans() , you can determine how many different centroid configurations the algorithm will try. Note again that you don’t want to insert the test labels when you fit the model to your data: these will be used to see if your model is good at predicting the actual classes of your instances! You can also visualize the images that make up the cluster centers: If you want to see another example that visualizes the data clusters and their centers, go here . The next step is to predict the labels of the test set. You predict the values for the test set, which contains 450 samples. You store the result in y_pred . You also print out the first 100 instances of y_pred and y_test and you immediately see some results. In addition, you can study the shape of the cluster centers: you immediately see that there are 10 clusters with each 64 features. But this doesn’t tell you much because we set the number of clusters to 10 and you already knew that there were 64 features. Maybe a visualization would be more helpful: Tip : run the code from above again, but use the PCA reduction method: At first sight, the visualization doesn’t seem to indicate that the model works well. This needs some further investigation. And this need for further investigation brings you to the next essential step, which is the evaluation of your model’s performance. In other words, you want to analyze the degree of correctness of the model’s predictions. You should look at the confusion matrix. Then, you should try to figure out something more about the quality of the clusters by applying different cluster quality metrics. That way, you can judge the goodness of fit of the cluster labels to the correct labels. There are quite some metrics to consider: * The homogeneity score * The completeness score * The V-measure score * The adjusted Rand score * The Adjusted Mutual Info (AMI) score * The silhouette score But also these scores aren’t fantastic. Clearly, you should consider another estimator to predict the labels for the digits data. When you recapped all of the information that you gathered out of the data exploration, you saw that you could build a model to predict which group a digit belongs to without you knowing the labels. And indeed, you just used the training data and not the target values to build your KMeans model. Let’s assume that you depart from the case where you use both the digits training data and the corresponding target values to build your model. If you follow the algorithm map, you’ll see that the first model that you meet is the linear SVC. Let’s apply this to our data. You see here that you make use of X_train and y_train to fit the data to the SVC model. This is clearly different from clustering. Note also that in this example, you set the value of gamma manually. It is possible to automatically find good values for the parameters by using tools such as grid search and cross validation. Even though this is not the focus of this tutorial, you will see how you could have gone about this if you would have made use of grid search to adjust your parameters. For a walkthrough on how you should apply grid search, I refer you to the original tutorial . You see that in the SVM classifier has a kernel argument that specifies the kernel type that you’re going to use in the algorithm. By default, this is rbf . In other cases, you can specify others such as linear , poly , … But what is a kernel exactly? A kernel is a similarity function, which is used to compute similarity between the training data points. When you provide a kernel to an algorithm, together with the training data and the labels, you will get a classifier, as is the case here. You will have trained a model that assigns new unseen objects into a particular category. For the SVM, you will typicall try to linearly divide your data points. You can now visualize the images and their predicted labels. This plot is very similar to the plot that you made when you were exploring the data: But now the biggest question: how does this model perform? You clearly see that this model performs a whole lot better than the clustering model that you used earlier. You can also see it when you visualize the predicted and the actual labels: You’ll see that this visualization confirms your classification report, which is very good news. :) WHAT’S NEXT IN YOUR DATA SCIENCE JOURNEY? Congratulations, you have reached the end of this scikit-learn tutorial, which was meant to introduce you to Python machine learning! Now it’s your turn. Start your own digit recognition project with different data. One dataset that you can already use is the MNIST data, which you can download here . The steps that you will need to take are very similar to the ones that you have gone through with this tutorial, but if you still feel that you can use some help, you should check out this page , which works with the MNIST data and applies the KMeans algorithm. Working with the digits dataset was the first step in classifying characters with scikit-learn . If you’re done with this, you might consider trying out an even more challenging problem, namely, classifying alphanumeric characters in natural images. A well-known dataset that you can use for this problem is the Chars74K dataset, which contains more than 74,000 images of digits from 0 to 9 and the both lowercase and higher case letters of the English alphabet. You can download the dataset here . Whether you’re going to start with the projects that have been mentioned above or not, this is definitely not the end of your journey of data science with Python. If you choose not to widen your view just yet, consider deepening your data visualization and data manipulation knowledge: don’t miss out on DataCamp’s Interactive Data Visualization with Bokeh course to make sure you can impress your peers with a stunning data science portfolio or DataCamp’s pandas Foundation course , to learn more about working with data frames in Python. -------------------------------------------------------------------------------- Originally published at www.datacamp.com . Data Science Machine Learning Python Scikit Learn 9 Blocked Unblock Follow FollowingKARLIJN WILLEMS Data Science Journalist @DataCamp",Machine learning studies the design of algorithms that can learn. The hope that this discipline brings with itself is that the inclusion of experience into its tasks will eventually improve the…,Python Machine Learning: Scikit-Learn Tutorial,Live,124 316,"Christopher Roach * Articles * Node * Other * Statistics * Feeds * All * Node * Other * Statistics * About * About Christopher * GitHub * Twitter * LinkedIn STATISTICS FOR HACKERS 1. 18 January 2017 2. Statistics MOTIVATION ¶ There's no shortage of absolutely magnificent material out there on the topics of data science and machine learning for an autodidact, such as myself, to learn from. In fact, so many great resources exist that an individual can be forgiven for not knowing where to begin their studies, or for getting distracted once they're off the starting block. I honestly can't count the number of times that I've started working through many of these online courses and tutorials only to have my attention stolen by one of the multitudes of amazing articles on data analysis with Python, or some great new MOOC on Deep Learning. But this year is different! This year, for one of my new year's resolutions, I've decided to create a personalized data science curriculum and stick to it. This year, I promise not to just casually sign up for another course, or start reading yet another textbook to be distracted part way through. This year, I'm sticking to the plan. As part of my personalized program of study, I've chosen to start with Harvard's Data Science course . I'm currently on week 3 and one of the suggested readings for this week is Jake VanderPlas' talk from PyCon 2016 titled ""Statistics for Hackers"". As I was watching the video and following along with the slides , I wanted to try out some of the examples and create a set of notes that I could refer to later, so I figured why not create a Jupyter notebook. Once I'd finished, I realized I'd created a decently-sized resource that could be of use to others working their way through the talk. The result is the article you're reading right now, the remainder of which contains my notes and code examples for Jake's excellent talk. So, enjoy the article, I hope you find this resource useful, and if you have any problems or suggestions of any kind, the full notebook can be found on github , so please send me a pull request , or submit an issue , or just message me directly on Twitter . PRELIMINARIES ¶ In [1]:importnumpyasnpimportmatplotlib.pyplotaspltimportpandasaspdimportseabornassns# Suppress all warnings just to keep the notebook nice and clean. # This must happen after all imports since numpy actually adds its# RankWarning class back in.importwarningswarnings.filterwarnings(""ignore"")# Setup the look and feel of the notebooksns.set_context(""notebook"",font_scale=1.5,rc={""lines.linewidth"":2.5})sns.set_style('whitegrid')sns.set_palette('deep')# Create a couple of colors to use throughout the notebookred=sns.xkcd_rgb['vermillion']blue=sns.xkcd_rgb['dark sky blue']fromIPython.displayimportdisplay%matplotlib inline %config InlineBackend.figure_format = 'retina' WARM-UP ¶ The talk starts off with a motivating example that asks the question ""If you toss a coin 30 times and see 22 heads, is it a fair coin?"" We all know that a fair coin should come up heads roughly 15 out of 30 tosses, give or take, so it does seem unlikely to see so many heads. However, the skeptic might argue that even a fair coin could show 22 heads in 30 tosses from time-to-time. This could just be a chance event. So, the question would then be ""how can you determine if you're tossing a fair coin?"" THE CLASSIC METHOD ¶ The classic method would assume that the skeptic is correct and would then test the hypothesis (i.e., the Null Hypothesis ) that the observation of 22 heads in 30 tosses could happen simply by chance. Let's start by first considering the probability of a single coin flip coming up heads and work our way up to 22 out of 30. $$ P(H) = \frac{1}{2} $$As our equation shows, the probability of a single coin toss turning up heads is exactly 50% since there is an equal chance of either heads or tails turning up. Taking this one step further, to determine the probability of getting 2 heads in a row with 2 coin tosses, we would need to multiply the probability of getting heads by the probability of getting heads again since the two events are independent of one another. $$ P(HH) = P(H) \cdot P(H) = P(H)^2 = \left(\frac{1}{2}\right)^2 = \frac{1}{4} $$From the equation above, we can see that the probability of getting 2 heads in a row from a total of 2 coin tosses is 25%. Let's now take a look at a slightly different scenario and calculate the probability of getting 2 heads and 1 tails with 3 coin tosses. $$ P(HHT) = P(H)^2 \cdot P(T) = \left(\frac{1}{2}\right)^2 \cdot \frac{1}{2} = \left(\frac{1}{2}\right)^3 = \frac{1}{8} $$The equation above tells us that the probability of getting 2 heads and 1 tails in 3 tosses is 12.5%. This is actually the exact same probability as getting heads in all three tosses, which doesn't sound quite right. The problem is that we've only calculated the probability for a single permutation of 2 heads and 1 tails; specifically for the scenario where we only see tails on the third toss. To get the actual probability of tossing 2 heads and 1 tails we will have to add the probabilities for all of the possible permutations, of which there are exactly three: HHT, HTH, and THH. $$ P(2H,1T) = P(HHT) + P(HTH) + P(THH) = \frac{1}{8} + \frac{1}{8} + \frac{1}{8} = \frac{3}{8} $$Another way we could do this is to calculate the total number of permutations and simply multiply that by the probability of each event happening. To get the total number of permutations we can use the binomial coefficient . Then, we can simply calculate the probability above using the following equation. $$ P(2H,1T) = \binom{3}{2} \left(\frac{1}{2}\right)^{3} = 3 \left(\frac{1}{8}\right) = \frac{3}{8} $$While the equation above works in our particular case, where each event has an equal probability of happening, it will run into trouble with events that have an unequal chance of taking place. To deal with those situations, you'll want to extend the last equation to take into account the differing probabilities. The result would be the following equation, where $N$ is number of coin flips, $N_H$ is the number of expected heads, $N_T$ is the number of expected tails, and $P_H$ is the probability of getting heads on each flip. $$ P(N_H,N_T) = \binom{N}{N_H} \left(P_H\right)^{N_H} \left(1 - P_H\right)^{N_T} $$Now that we understand the classic method, let's use it to test our null hypothesis that we are actually tossing a fair coin, and that this is just a chance occurrence. The following code implements the equations we've just discussed above. In [2]:deffactorial(n):""""""Calculates the factorial of `n` """"""vals=list(range(1,n+1))iflen(vals)0:return1prod=1forvalinvals:prod*=valreturnproddefn_choose_k(n,k):""""""Calculates the binomial coefficient """"""returnfactorial(n)/(factorial(k)*factorial(n-k))defbinom_prob(n,k,p):""""""Returns the probability of see `k` heads in `n` coin tosses Arguments: n - number of trials k - number of trials in which an event took place p - probability of an event happening """"""returnn_choose_k(n,k)*p**k*(1-p)**(n-k) Now that we have a method that will calculate the probability for a specific event happening (e.g., 22 heads in 30 coin tosses), we can calculate the probability for every possible outcome of flipping a coin 30 times, and if we plot these values we'll get a visual representation of our coin's probability distribution. In [3]:# Calculate the probability for every possible outcome of tossing # a fair coin 30 times.probabilities=[binom_prob(30,k,0.5)forkinrange(1,31)]# Plot the probability distribution using the probabilities list # we created above.plt.step(range(1,31),probabilities,where='mid',color=blue)plt.xlabel('number of heads')plt.ylabel('probability')plt.plot((22,22),(0,0.1599),color=red);plt.annotate('0.8%',xytext=(25,0.08),xy=(22,0.08),multialignment='right',va='center',color=red,size='large',arrowprops={'arrowstyle':','lw':2,'color':red,'shrinkA':10}); The visualization above shows the probability distribution for flipping a fair coin 30 times. Using this visualization we can now determine the probability of getting, say for example, 12 heads in 30 flips, which looks to be about 8%. Notice that we've labeled our example of 22 heads as 0.8%. If we look at the probability of flipping exactly 22 heads, it looks likes to be a little less than 0.8%, in fact if we calculate it using the binom_prob function from above, we get 0.5% In [4]:print(""Probability of flipping 22 heads: %0.1f%%""%(binom_prob(30,22,0.5)*100)) Probability of flipping 22 heads: 0.5% So, then why do we have 0.8% labeled in our probability distribution above? Well, that's because we are showing the probability of getting at least 22 heads, which is also known as the p-value. WHAT'S A P-VALUE? ¶ In statistical hypothesis testing we have an idea that we want to test, but considering that it's very hard to prove something to be true beyond doubt, rather than test our hypothesis directly, we formulate a competing hypothesis, called a null hypothesis , and then try to disprove it instead. The null hypothesis essentially assumes that the effect we're seeing in the data could just be due to chance. In our example, the null hypothesis assumes we have a fair coin, and the way we determine if this hypothesis is true or not is by calculating how often flipping this fair coin 30 times would result in 22 or more heads. If we then take the number of times that we got 22 or more heads and divide that number by the total of all possible permutations of 30 coin tosses, we get the probability of tossing 22 or more heads with a fair coin. This probability is what we call the p-value . The p-value is used to check the validity of the null hypothesis. The way this is done is by agreeing upon some predetermined upper limit for our p-value, below which we will assume that our null hypothesis is false. In other words, if our null hypothesis were true, and 22 heads in 30 flips could happen often enough by chance, we would expect to see it happen more often than the given threshold percentage of times. So, for example, if we chose 10% as our threshold, then we would expect to see 22 or more heads show up at least 10% of the time to determine that this is a chance occurrence and not due to some bias in the coin. Historically, the generally accepted threshold has been 5%, and so if our p-value is less than 5%, we can then make the assumption that our coin may not be fair. The binom_prob function from above calculates the probability of a single event happening, so now all we need for calculating our p-value is a function that adds up the probabilities of a given event, or a more extreme event happening. So, as an example, we would need a function to add up the probabilities of getting 22 heads, 23 heads, 24 heads, and so on. The next bit of code creates that function and uses it to calculate our p-value. In [5]:defp_value(n,k,p):""""""Returns the p-value for the given the given set """"""returnsum(binom_prob(n,i,p)foriinrange(k,n+1))print(""P-value: %0.1f%%""%(p_value(30,22,0.5)*100)) P-value: 0.8% Running the code above gives us a p-value of roughly 0.8%, which matches the value in our probability distribution above and is also less than the 5% threshold needed to reject our null hypothesis, so it does look like we may have a biased coin. THE EASIER METHOD ¶ That's an example of using the classic method for testing if our coin is fair or not. However, if you don't happen to have at least some background in statistics, it can be a little hard to follow at times, but luckily for us, there's an easier method... Simulation! The code below seeks to answer the same question of whether or not our coin is fair by running a large number of simulated coin flips and calculating the proportion of these experiments that resulted in at least 22 heads or more. In [6]:M=0n=50000foriinrange(n):trials=np.random.randint(2,size=30)if(trials.sum()=22):M+=1p=M/nprint(""Simulated P-value: %0.1f%%""%(p*100)) Simulated P-value: 0.8% The result of our simulations is 0.8%, the exact same result we got earlier when we calculated the p-value using the classical method above. So, it definitely looks like it's possible that we have a biased coin since the chances of seeing 22 or more heads in 30 tosses of a fair coin is less than 1%. FOUR RECIPES FOR HACKING STATISTICS ¶ We've just seen one example of how our hacking skills can make it easy for us to answer questions that typically only a statistician would be able to answer using the classical methods of statistical analysis. This is just one possible method for answering statistical questions using our coding skills, but Jake's talk describes four recipes in total for ""hacking statistics"", each of which is listed below. The rest of this article will go into each of the remaining techniques in some detail. 1. Direct Simulation 2. Shuffling 3. Bootstrapping 4. Cross Validation In the Warm-up section above, we saw an example direct simulation, the first recipe in our tour of statistical hacks. The next example uses the Shuffling method to figure out if there's a statistically significant difference between two different sample populations. SHUFFLING ¶ In this example, we look at the Dr. Seuss story about the Star-belly Sneetches. In this Seussian world, a group of creatures called the Sneetches are divided into two groups: those with stars on their bellies, and those with no ""stars upon thars"". Over time, the star-bellied sneetches have come to think of themselves as better than the plain-bellied sneetches. As researchers of sneetches, it's our job to uncover whether or not star-bellied sneetches really are better than their plain-bellied cousins. The first step in answering this question will be to create our experimental data. In the following code snippet we create a dataframe object that contains a set of test scores for both star-bellied and plain-bellied sneetches. In [7]:importpandasaspddf=pd.DataFrame({'star':[1,1,1,1,1,1,1,1]+[0,0,0,0,0,0,0,0,0,0,0,0],'score':[84,72,57,46,63,76,99,91]+[81,69,74,61,56,87,69,65,66,44,62,69]})df Out[7]: score star 0 84 1 1 72 1 2 57 1 3 46 1 4 63 1 5 76 1 6 99 1 7 91 1 8 81 0 9 69 0 10 74 0 11 61 0 12 56 0 13 87 0 14 69 0 15 65 0 16 66 0 17 44 0 18 62 0 19 69 0If we then take a look at the average scores for each group of sneetches, we will see that there's a difference in scores of 6.6 between the two groups. So, on average, the star-bellied sneetches performed better on their tests than the plain-bellied sneetches. But, the real question is, is this a significant difference? In [8]:star_bellied_mean=df[df.star==1].score.mean()plain_bellied_mean=df[df.star==0].score.mean()print(""Star-bellied Sneetches Mean: %2.1f""%star_bellied_mean)print(""Plain-bellied Sneetches Mean: %2.1f""%plain_bellied_mean)print(""Difference: %2.1f""%(star_bellied_mean-plain_bellied_mean)) Star-bellied Sneetches Mean: 73.5 Plain-bellied Sneetches Mean: 66.9 Difference: 6.6 To determine if this is a signficant difference, we could perform a t-test on our data to compute a p-value, and then just make sure that the p-value is less than the target 0.05. Alternatively, we could use simulation instead. Unlike our first example, however, we don't have a generative function that we can use to create our probability distribution. So, how can we then use simulation to solve our problem? Well, we can run a bunch of simulations where we randomly shuffle the labels (i.e., star-bellied or plain-bellied) of each sneetch, recompute the difference between the means, and then determine if the proportion of simulations in which the difference was at least as extreme as 6.6 was less than the target 5%. If so, we can conclude that the difference we see is, in fact, one that doesn't occur strictly by chance very often and so the difference is a significant one. In other words, if the proportion of simulations that have a difference of 6.6 or greater is less than 5%, we can conclude that the labels really do matter, and so we can conclude that star-bellied sneetches are ""better"" than their plain-bellied counterparts. In [9]:df['label']=df['star']num_simulations=10000differences=[]foriinrange(num_simulations):np.random.shuffle(df['label'])star_bellied_mean=df[df.label==1].score.mean()plain_bellied_mean=df[df.label==0].score.mean()differences.append(star_bellied_mean-plain_bellied_mean) Now that we've ran our simulations, we can calculate our p-value, which is simply the proportion of simulations that resulted in a difference greater than or equal to 6.6. $$ p = \frac{N_{ 6.6}}{N_{total}} = \frac{1512}{10000} = 0.15 $$ In [10]:p_value=sum(diff=6.6fordiffindifferences)/num_simulationsprint(""p-value: %2.2f""%p_value) p-value: 0.15 The following code plots the distribution of the differences we found by running the simulations above. We've also added an annotation that marks where the difference of 6.6 falls in the distribution along with its corresponding p-value. In [11]:plt.hist(differences,bins=50,color=blue)plt.xlabel('score difference')plt.ylabel('number')plt.plot((6.6,6.6),(0,700),color=red);plt.annotate('%2.f%%'%(p_value*100),xytext=(15,350),xy=(6.6,350),multialignment='right',va='center',color=red,size='large',arrowprops={'arrowstyle':','lw':2,'color':red,'shrinkA':10}); We can see from the histogram above---and from our simulated p-value, which was greater than 5%---that the difference that we are seeing between the populations can be explained by random chance, so we can effectively dismiss the difference as not statistically significant. In short, star-bellied sneetches are no better than the plain-bellied ones, at least not from a statistical point of view. For further discussion on this method of simulation, check out John Rauser's keynote talk ""Statistics Without the Agonizing Pain"" from Strata + Hadoop 2014. Jake mentions that he drew inspiration from it in his talk, and it is a really excellent talk as well; I wholeheartedly recommend it. BOOTSTRAPPING ¶ In this example, we'll be using the story of Yertle the Turtle to explore the bootstrapping recipe. As the story goes, in the land of Sala-ma-Sond, Yertle the Turtle was the king of the pond and he wanted to be the most powerful, highest turtle in the land. To achieve this goal, he would stack turtles as high as he could in order to stand upon their backs. As observers of this curious behavior, we've recorded the heights of 20 turtle towers and we've placed them in a dataframe in the following bit of code. In [12]:df=pd.DataFrame({'heights':[48,24,51,12,21,41,25,23,32,61,19,24,29,21,23,13,32,18,42,18]}) The questions we want to answer in this example are: what is the mean height of Yertle's turtle stacks, and what is the uncertainty of this estimate? THE CLASSIC METHOD ¶ The classic method is simply to calculate the sample mean... $$ \bar{x} = \frac{1}{N} \sum_{i=1}^{N} x_i = 28.9 $$...and the standard error of the mean. $$ \sigma_{\bar{x}} = \frac{1}{ \sqrt{N}}\sqrt{\frac{1}{N - 1} \sum_{i=1}^{N} (x_i - \bar{x})^2 } = 3.0 $$But, being hackers, we'll be using simulation instead. Just like in our last example, we are once again faced with the problem of not having a generative model, but unlike the last example, we're not comparing two groups, so we can't just shuffle around labels here, instead we'll use something called bootstrap resampling . Bootstrap resampling is a method that simulates several random sample distributions by drawing samples from the current distribution with replacement, i.e., we can draw the same data point more than once. Luckily, pandas makes this super easy with its sample function. We simply need to make sure that we pass in True for the replace argument to sample from our dataset with replacement. In [13]:sample=df.sample(20,replace=True)display(sample)print(""Mean: %2.2f""%sample.heights.mean())print(""Standard Error: %2.2f""%(sample.heights.std()/np.sqrt(len(sample)))) heights 9 61 13 21 8 32 6 25 10 19 17 18 4 21 14 23 12 29 4 21 17 18 13 21 4 21 6 25 6 25 11 24 9 61 17 18 3 12 3 12Mean: 25.35 Standard Error: 2.93 More than likely the mean and standard error from our freshly drawn sample above didn't exactly match the one that we calculated using the classic method beforehand. But, if we continue to resample several thousand times and take a look at the average (mean) of all those sample means and their standard deviation, we should have something that very closely approximates the mean and standard error derived from using the classic method above. In [14]:xbar=[]foriinrange(10000):sample=df.sample(20,replace=True)xbar.append(sample.heights.mean())print(""Mean: %2.1f""%np.mean(xbar))print(""Standard Error: %2.1f""%np.std(xbar)) Mean: 28.8 Standard Error: 2.9 CROSS VALIDATION ¶ For the final example, we dive into the world of the Lorax. In the story of the Lorax, a faceless creature sales an item that (presumably) all creatures need called a Thneed. Our job as consultants to Onceler Industries is to project Thneed sales. But, before we can get started forecasting the sales of Thneeds, we'll first need some data. Lucky for you, I've already done the hard work of assembling that data in the code below by ""eyeballing"" the data in the scatter plot from the slides of the talk. So, it may not be exactly the same, but it should be close enough for our example analysis. In [15]:df=pd.DataFrame({'temp':[22,36,36,38,44,45,47,43,44,45,47,49,52,53,53,53,54,55,55,55,56,57,58,59,60,61,61.5,61.7,61.7,61.7,61.8,62,62,63.4,64.6,65,65.6,65.6,66.4,66.9,67,67,67.4,67.5,68,69,70,71,71,71.5,72,72,72,72.7,73,73,73,73.3,74,75,75,77,77,77,77.4,77.9,78,78,79,80,82,83,84,85,85,86,87,88,90,90,91,93,95,97,102,104],'sales':[660,433,475,492,302,345,337,479,456,440,423,269,331,197,283,351,470,252,278,350,253,253,343,280,200,194,188,171,204,266,275,171,282,218,226,187,184,192,167,136,149,168,218,298,199,268,235,157,196,203,148,157,213,173,145,184,226,204,250,102,176,97,138,226,35,190,221,95,211,110,150,152,37,76,56,51,27,82,100,123,145,51,156,99,147,54]}) Now that we have our sales data in a pandas dataframe, we can take a look to see if any trends show up. Plotting the data in a scatterplot, like the one below, reveals that a relationship does seem to exist between temperature and Thneed sales. In [16]:# Grab a reference to fig and axes object so we can reuse themfig,ax=plt.subplots()# Plot the Thneed sales dataax.scatter(df.temp,df.sales)ax.set_xlim(xmin=20,xmax=110)ax.set_ylim(ymin=0,ymax=700)ax.set_xlabel('temprature (F)')ax.set_ylabel('thneed sales (daily)'); We can see what looks like a relationship between the two variables temperature and sales, but how can we best model that relationship so we can accurately predict sales based on temperature? Well, one measure of a model's accuracy is the Root-Mean-Square Error (RMSE) . This metric represents the sample standard deviation between a set of predicted values (from our model) and the actual observed values. In [17]:defrmse(predictions,targets):returnnp.sqrt(((predictions-targets)**2).mean()) We can now use our rmse function to measure how well our models' accurately represent the Thneed sales dataset. And, in the next cell, we'll give it a try by creating two different models and seeing which one does a better job of fitting our sales data. In [18]:# 1D Polynomial Fitd1_model=np.poly1d(np.polyfit(df.temp,df.sales,1))d1_predictions=d1_model(range(111))ax.plot(range(111),d1_predictions,color=blue,alpha=0.7)# 2D Polynomial Fitd2_model=np.poly1d(np.polyfit(df.temp,df.sales,2))d2_predictions=d2_model(range(111))ax.plot(range(111),d2_predictions,color=red,alpha=0.5)ax.annotate('RMS error = %2.1f'%rmse(d1_model(df.temp),df.sales),xy=(75,650),fontsize=20,color=blue,backgroundcolor='w')ax.annotate('RMS error = %2.1f'%rmse(d2_model(df.temp),df.sales),xy=(75,580),fontsize=20,color=red,backgroundcolor='w')display(fig); In the figure above, we plotted our sales data along with the two models we created in the previous step. The first model (in blue) is a simple linear model, i.e., a first-degree polynomial . The second model (in red) is a second-degree polynomial, so rather than a straight line, we end up with a slight curve. We can see from the RMSE values in the figure above that the second-degree polynomial performed better than the simple linear model. Of course, the question you should now be asking is, is this the best possible model that we can find? To find out, let's take a look at the RMSE of a few more models to see if we can do any better. In [19]:rmses=[]fordeginrange(15):model=np.poly1d(np.polyfit(df.temp,df.sales,deg))predictions=model(df.temp)rmses.append(rmse(predictions,df.sales))plt.plot(range(15),rmses)plt.ylim(45,70)plt.xlabel('number of terms in fit')plt.ylabel('rms error')plt.annotate('$y = a + bx$',xytext=(14.2,70),xy=(1,rmses[1]),multialignment='right',va='center',arrowprops={'arrowstyle':'-|','lw':1,'shrinkA':10,'shrinkB':3})plt.annotate('$y = a + bx + cx^2$',xytext=(14.2,64),xy=(2,rmses[2]),multialignment='right',va='top',arrowprops={'arrowstyle':'-|','lw':1,'shrinkA':35,'shrinkB':3})plt.annotate('$y = a + bx + cx^2 + dx^3$',xytext=(14.2,58),xy=(3,rmses[3]),multialignment='right',va='top',arrowprops={'arrowstyle':'-|','lw':1,'shrinkA':12,'shrinkB':3}); We can see, from the plot above, that as we increase the number of terms (i.e., the degrees of freedom) in our model we decrease the RMSE, and this behavior can continue indefinitely, or until we have as many terms as we do data points, at which point we would be fitting the data perfectly. The problem with this approach though, is that as we increase the number of terms in our equation, we simply match the given dataset closer and closer, but what if our model were to see a data point that's not in our training dataset? As you can see in the plot below, the model that we've created, though it has a very low RMSE, it has so many terms that it matches our current dataset too closely. In [20]:# Remove everything but the datapointsax.lines.clear()ax.texts.clear()# Changing the y-axis limits to match the figure in the slidesax.set_ylim(0,1000)# 14 Dimensional Modelmodel=np.poly1d(np.polyfit(df.temp,df.sales,14))ax.plot(range(20,110),model(range(20,110)),color=sns.xkcd_rgb['sky blue'])display(fig) The problem with fitting the data too closely, is that our model is so finely tuned to our specific dataset, that if we were to use it to predict future sales, it would most likely fail to get very close to the actual value. This phenomenon of too closely modeling the training dataset is well known amongst machine learning practitioners as overfitting and one way that we can avoid it is to use cross-validation . Cross-validation avoids overfitting by splitting the training dataset into several subsets and using each one to train and test multiple models. Then, the RMSE's of each of those models are averaged to give a more likely estimate of how a model of that type would perform on unseen data. So, let's give it a try by splitting our data into two groups and randomly assigning data points into each one. In [21]:df_a=df.sample(n=len(df)/2)df_b=df.drop(df_a.index) We can get a look at the data points assigned to each subset by plotting each one as a different color. In [22]:plt.scatter(df_a.temp,df_a.sales,color='red')plt.scatter(df_b.temp,df_b.sales,color='blue')plt.xlim(0,110)plt.ylim(0,700)plt.xlabel('temprature (F)')plt.ylabel('thneed sales (daily)'); Then, we'll find the best model for each subset of data. In this particular example, we'll fit a second-degree polynomial to each subset and plot both below. In [23]:# Create a 2-degree model for each subset of datam1=np.poly1d(np.polyfit(df_a.temp,df_a.sales,2))m2=np.poly1d(np.polyfit(df_b.temp,df_b.sales,2))fig,(ax1,ax2)=plt.subplots(nrows=1,ncols=2,sharex=False,sharey=True,figsize=(12,5))x_min,x_max=20,110y_min,y_max=0,700x=range(x_min,x_max+1)# Plot the df_a groupax1.scatter(df_a.temp,df_a.sales,color='red')ax1.set_xlim(xmin=x_min,xmax=x_max)ax1.set_ylim(ymin=y_min,ymax=y_max)ax1.set_xlabel('temprature (F)')ax1.set_ylabel('thneed sales (daily)')ax1.plot(x,m1(x),color=sns.xkcd_rgb['sky blue'],alpha=0.7)# Plot the df_b groupax2.scatter(df_b.temp,df_b.sales,color='blue')ax2.set_xlim(xmin=x_min,xmax=x_max)ax2.set_ylim(ymin=y_min,ymax=y_max)ax2.set_xlabel('temprature (F)')ax2.plot(x,m2(x),color=sns.xkcd_rgb['rose'],alpha=0.5); Finally, we'll compare models across subsets by calculating the RMSE for each model using the training set for the other model. This will give us two RMSE scores which we'll then average to get a more accurate estimate of how well a second-degree polynomial will perform on any unseen data. In [24]:print(""RMS = %2.1f""%rmse(m1(df_a.temp),df_a.sales))print(""RMS = %2.1f""%rmse(m2(df_b.temp),df_b.sales))print(""RMS estimate = %2.1f""%np.mean([rmse(m1(df_a.temp),df_a.sales),rmse(m2(df_b.temp),df_b.sales)])) RMS = 55.3 RMS = 49.4 RMS estimate = 52.4 Then, we simply repeat this process for as long as we so desire. The following code repeats the process described above for polynomials up to 14 degrees and plots the average RMSE for each one against the non-cross-validated RMSE's that we calculated earlier. In [25]:rmses=[]cross_validated_rmses=[]fordeginrange(15):# df_a the model on the whole dataset and calculate its# RMSE on the same set of datamodel=np.poly1d(np.polyfit(df.temp,df.sales,deg))predictions=model(df.temp)rmses.append(rmse(predictions,df.sales))# Use cross-validation to create the model and df_a itm1=np.poly1d(np.polyfit(df_a.temp,df_a.sales,deg))m2=np.poly1d(np.polyfit(df_b.temp,df_b.sales,deg))p1=m1(df_b.temp)p2=m2(df_a.temp)cross_validated_rmses.append(np.mean([rmse(p1,df_b.sales),rmse(p2,df_a.sales)]))plt.plot(range(15),rmses,color=blue,label='RMS')plt.plot(range(15),cross_validated_rmses,color=red,label='cross validated RMS')plt.ylim(45,70)plt.xlabel('number of terms in fit')plt.ylabel('rms error')plt.legend(frameon=True)plt.annotate('Best model minimizes the\ncross-validated error.',xytext=(7,60),xy=(2,cross_validated_rmses[2]),multialignment='center',va='top',color='blue',size=25,backgroundcolor='w',arrowprops={'arrowstyle':'-|','lw':3,'shrinkA':12,'shrinkB':3,'color':'blue'}); According to the graph above, going from a 1-degree to a 2-degree polynomial gives us quite a large improvement overall. But, unlike the RMSE that we calculated against the training set, when using cross-validation we can see that adding more degrees of freedom to our equation quickly reduces the effectiveness of the model against unseen data. This is overfitting in action! In fact, from the looks of the graph above, it would seem that a second-degree polynomial is actually our best bet for this particular dataset. 2-FOLD CROSS-VALIDATION ¶ Several different methods for performing cross-validation exist, the one we've just seen is called 2-fold cross-validation since the data is split into two subsets. Another close relative is a method called $k$-fold cross-validation . It differs slightly in that the original dataset is divided into $k$ subsets (instead of just 2), one of which is reserved strictly for testing and the other $k - 1$ subsets are used for training models. This is just one example of an alternate cross-validation method, but more do exist and each one has advantages and drawbacks that you'll need to consider when deciding which method to use. CONCLUSION ¶ Ok, so that covers nearly everything that Jake covered in his talk. The end of the talk contains a short overview of some important areas that he didn't have time to cover, and there's a nice set of Q&A at the end, but I'll simply direct you to the video for those parts of the talk. Hopefully, this article/notebook has been helpful to anyone working their way through Jake's talk, and if for some reason, you've read through this entire article and haven't watched the video of the talk yet, I encourage you to take 40 minutes out of your day and go watch it now ---it really is a fantastic talk! FOUND AN ERROR WITH MY ANALYSIS OR A BUG IN MY CODE? Everything on this site is avaliable on GitHub. Head on over and submit an issue. You can also message me directly on Twitter . All work is available on GitHub . Copyright © Christopher Roach, 2017 . Site powered by pelican , theme crafted by Chris Albon ( GitHub ).","Musings on data science and software engineering (and at times, economics as well)",Statistics for Hackers,Live,125 321,"Enterprise Pricing Articles Sign in Free 30-Day TrialRETHINKDB JOINERY Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jul 21, 2016One of the great things about RethinkDB is that it has join functionality baked in as part of the query engine. This is, compared to MongoDB, where the ""lookup"" function has been added to the aggregation framework, a much more useful capability which gives a lot more flexibility in designing your data models. That said, there's some pretty important things to bear in mind when you start joining in RethinkDB and top of that list is... THE FASTEST JOIN IS EQJOIN If you are associating documents in one RethinkDB table with documents in another then the most efficient way is for the document on the left hand side of the association to refer, by id, to the document on the right hand side. That's because the id field for any document is indexed by default, so it's faster to look up. Looking up the value also, by definition, means it's equal. That's what eqJoin (or eg_join if you are working in Python) does, an ""equals join"". Let's work with a solid example; we're going to be using JavaScript and Node.js 6 for these examples. In the Github repository for this article is a node program called populate.js. It assumes you've creates a database called spystuff and two tables, agents and orgs . When you run it, it'll insert organization records that look like this: { ""org"": ""MI6"", ""alignment"": { country: ""UK"", ""side"": ""west"", } } into the orgs table, get all the id's of those orgs and then update the agents data which looks like this: { ""name"": ""James Bond"", ""org"": ""MI6"", ""skill"": [""assassination""] } to include the appropriate organization id numbers (as ""org_id""), remove the ""org"" field and insert the result into the agents table. Now, we're set up with a problem joining the tables is made for. We want to get all the agents' data with their organization data in the same document. This is where we'll use the eqJoin function. The orgs table is already indexed by id and if we look the command at the core of eqjoin.js we find this: r.table(""agents"").eqJoin(""org_id"", r.table(""orgs"")) This starts with the agents table and applies an eqJoin to it, telling it to use the agents' org_id field and look it up in the orgs table. It'll default to using the tables primary key. That command gives records back like this: { ""left"": { ""id"": ""58662b62-6a09-422b-88ad-c4acaabaa29b"", ""name"": ""John Drake"", ""org_id"": ""192711a9-f73b-40f4-886e-07b9a188c47e"", ""skill"": [ ""investigation"" ] }, ""right"": { ""alignment"": { ""country"": ""UK"", ""side"": ""west"" }, ""id"": ""192711a9-f73b-40f4-886e-07b9a188c47e"", ""org"": ""M9"" } } Yes, as it comes out of the join commands, the left and the right side are still kept separate. To take care of this, RethinkDB recommends the zip function. If we add a .zip() to our query, like so: r.table(""agents"").eqJoin(""org_id"", r.table(""orgs"")).zip() We get this: { ""alignment"": { ""country"": ""UK"", ""side"": ""west"" }, ""id"": ""192711a9-f73b-40f4-886e-07b9a188c47e"", ""name"": ""John Drake"", ""org"": ""M9"", ""org_id"": ""192711a9-f73b-40f4-886e-07b9a188c47e"", ""skill"": [ ""investigation"" ] } Which looks great until you look a little closer and notice that the id from the organization document has wiped out the id belonging to the agent document. Not a problem, as there's the without function too and that can get rid of that right hand side id field with .without({""right"": {""id"":true}}) like so: r.table(""agents"").eqJoin(""org_id"", r.table(""orgs"")).without({""right"": {""id"":true}}).zip() and now we get: { ""alignment"": { ""country"": ""UK"", ""side"": ""west"" }, ""id"": ""58662b62-6a09-422b-88ad-c4acaabaa29b"", ""name"": ""John Drake"", ""org"": ""M9"", ""org_id"": ""192711a9-f73b-40f4-886e-07b9a188c47e"", ""skill"": [ ""investigation"" ] } Even though we were deleting the id field on the right hand side, we've retained the organization id field which we were joining on. The RethinkDB documentation on joins shows a couple of other ways you could mitigate this overwriting, but this is the simplest way for simple eqJoins . INDEXES AND EQJOIN The eqJoin function is actually very simple at it's core. It moves through the left hand table using the specified field and simply looks up the value in the index it's been given on the right hand side. By default, that's the id field as that is the primary key and index on the right hand side. But it doesn't have to be that index. You can point at any index that exists for the right hand side and as long as there are values there that match up with the left hand side values, you'll get results for that query. Let's add a table of assets to our spystuff database - you'll find the code for this in populate2.js in the repository . It adds records like this: { ""type"": ""Black Helicopter"", ""use"": [ ""stealth"", ""investigation"" ], ""designer"": ""UN"" } First, let's unite these assets with their organizations. Let's assume that if a country designed an asset, then organizations in that country can use that asset. We'll need to create an index on the designer field of the assets which we can use with eqjoin . We'll do that in the populate2.js file with: r.table(""assets"").indexCreate(""designer"") Now we can do our join - you'll find it in eqjoinindex.js . r.table(""orgs"").eqJoin( r.row(""alignment"")(""country""), r.table(""assets""),{ index:""designer""} ).without({""right"": { ""id"": true }}) .zip() So, taking that step by step, we start with the orgs table and we apply an eqJoin to it. The field we want to join on isn't a top level field so we pass r.row(""alignment"")(""country"") so we can access it. We then tell eqJoin we want to join with the assets table. Here's the new bit; in the last options parameter, we pass a { index:""designer"" } to tell eqJoin to use that index to lookup on, so we're now joining the alignment.country of organizations with the designer of assets which gets us records like this: { ""alignment"": { ""country"": ""USA"", ""side"": ""west"" }, ""designer"": ""USA"", ""id"": ""938201f0-6a24-4f0a-91ee-cc1751df23a4"", ""org"": ""CIA"", ""type"": ""Laser Pen"", ""use"": [ ""management"", ""combat"" ] } Now we can see the CIA has access to laser pens. It's also quite a good example of why you may not want to zip records at all. Let's show another aspect of this secondary index joining; multi-indexes. Those are indexes where the field being indexed is an array of values; when the indexer is told to index, it indexes the record for each one of these values. So how can we use that? Say we want to match our agent's primary skills with the assets they can use. We'll want to index that ""use"" field first. The example code does just that with r.table(""assets"").indexCreate(""use"",{ multi: true } ) . The multi:true part lets the index work with the array as discrete values. With that index in place, let's make a join query: r.table('agents').eqJoin(r.row(""skill"")(0),r.table(""assets""),{ index:""use"" }).zip() There's a whole lot of things happening here. The r.row(""skill"")(0) is referring to the first value in the array of values in the ""skill"" field. This is closer to being a function than a reference, and it is worth noting that eqJoin can take a function to create the value to match with. We point at the ""assets"" table as the right hand side and we telling it to use the index we created with { index:""use"" } . There is another other option by the way; ""ordered"" which when set to true will sort according to the left hand side's input - we're just not using it here. Anyway now the effect of this is to make it seem that when the first skill of an agent is present in the array of ""use"" in the asset document, the two documents will be joined and we've added a zip to merge the fields to get something like: { ""designer"": ""Global"", ""id"": ""cd8aefc6-5442-499d-84bc-9fb85172b6f8"", ""name"": ""Chuck Bartowski"", ""org_id"": ""11a662d3-0477-4c96-a0d4-3ceebc0c29a4"", ""skill"": [ ""investigation"", ""stealth"" ], ""type"": ""Microdrone"", ""use"": [ ""investigation"", ""stealth"", ""assassination"" ] } We could do another eqJoin against the orgs table - three way joins are easy enough - but that's demonstrated the flexibility of the eqJoin function. WHAT OF INNER_JOIN AND OUTER_JOIN? There are other join functions - innerJoin and outerJoin – but they are slower and less efficient than eqJoin . Both use a function which evaluates true or false. That means though that there's no scanning of the left hand side and index lookups for the right hand side - it's all scanning and evaluating the function for the right hand side. So it's slower. On the up side, if it's a join you want to do that isn't based on a simple equality of fields, these are the functions you are looking for. We could do something similar to the previous eqJoin command, without the index like so: r.table('agents').innerJoin(r.table('assets'),(agrow,asrow) = } ) What we are doing here is a set intersection between the agents skill and the assets use arrays and returning true if two or more items are in the intersection which turns out to be one agent with two assets. Powerful, but you'll take a hit in terms of performance. Remember these tiny tables we're using are living in the cache, probably the processor cache even - when scaled up, you could really pay the price in performance. JOIN POWER So we've looked at RethinkDB's join functions and as you can see they deliver what we typically need from a join function; a simple binding between records based on the equality of fields. It's simple, quick and clear. Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose",Learn about JOINs in the RethinkDB document database.,RethinkDB Joinery,Live,126 322,"The couchdb package is a Meteor package available onAtmosphere. The package is a full stack databasedriver that provides functionality to work with Apache CouchDB in Meteor.* an efficient Livequery implementation providing real-timeupdates from the database by consuming the CouchDB _changes feed* Distributed Data Protocol (DDP) RPC end-points for updating the data from clients connected over the wire* Serialization and deserialization of updates to the DDP formatThis Readme covers the followingAdd this package to your Meteor app:meteor add cloudant:couchdbSince Apache CouchDB is not shipped with Meteor or this package, you need to have a running CouchDB/Cloudant server and a url to connect to it.Note: The JSON query syntax used is 'Cloudant Query', initially developed by Cloudant and contributed back to Apache CouchDB version 2.0. Pre-built binaries of Apache CouchDB 2.0 are not yet available, so the easiest way to use this module is with Cloudant DBaas or LocalTo configure the Apache CouchDB/Cloudant server connection information, pass its url as the COUCHDB_URLenvironment variable to the Meteor server process.$export COUCHDB_URL=https://username:password@username.cloudant.comJust like Mongo.Collection, you will work with CouchDB.Database for CouchDB data.You can instantiate a CouchDB.Database on both client and on the server.var Tasks = new CouchDB.Database(""tasks"");The database wraps the Cloudant Query commands. If a callback is passed then the commands execute asynchronously. If no callback is passed, on the server, the call is executed synchronously (technically this uses fibers and only appears to be synchronous, so it does not block the event-loop). If you're on the client and don't pass a callback, the call executes asynchronously and you won't be notified of the result.One can publish a cursor on the server and the client subscribe to it.if (Meteor.isServer) {// This code only runs on the serverMeteor.publish(""tasks"", function () {return Tasks.find();if (Meteor.isClient) {// This code only runs on the clientMeteor.subscribe(""tasks"");This way data will be automatically synchronized to all subscribed clients.Latency compensation works with all supported commands used either at the client or client's simulations.Once you remove the insecure package, you can allow/deny database modifications from the client//make sure no extra properties besides postContent are included in the insert operationTasks.allow({insert: function (userId, doc) {return _.without(_.keys(doc), 'postContent').length === 0;Apache CouchDB stores data in Databases. To get started, declare a database with new CouchDB.Database.new CouchDB.Database(name, [options])Constructor for a DatabaseArgumentsname StringThe name of the database. If null, creates an unmanaged (unsynchronized) local database.Optionsconnection ObjectThe server connection that will manage this database. Uses the default connection if not specified. Pass the return value of calling DDP.connect to specify a different server. Pass null to specify no connection. Unmanaged (name is null) databases cannot specify a connection.idGeneration StringThe method of generating the _id fields of new documents in this database. Possible values:'STRING': random stringsThe default id generation technique is 'STRING'.Calling this function sets up a database (a storage space for records, or ""documents"") that can be used to store a particular type of information that matters to your application. Each document is a JSON object. It includes an _id property whose value is unique in the database, which Meteor will set when you first create the document.// common code on client and server declares a DDP-managed couchdb// database.Chatrooms = new CouchDB.Database(""chatrooms"");Messages = new CouchDB.Database(""messages"");The function returns an object with methods to insert documents in the database, update their properties, and remove them, and to find the documents in the database that match arbitrary criteria. The way these methods work is compatible with the popular CouchDB JSON Query syntax. The same database API works on both the client and the server (see below).// return array of my messagesvar myMessages = Messages.find({userId: Session.get('myUserId')}).fetch();// create a new messagevar id = Messages.insert({text: ""Hello, world!""});// mark my first message as ""important""Messages.update({_id: id, text: 'Hello, world!', important: true });If you pass a name when you create the database, then you are declaring a persistent database — one that is stored on the server and seen by all users. Client code and server code can both access the same database using the same API.Specifically, when you pass a name, here's what happens:On the server (if you do not specify a connection), a database with that name is created on the backend CouchDB server. When you call methods on that database on the server, they translate directly into normal CouchDB operations (after checking that they match your access control rules).On the client (and on the server if you specify a connection), Meteor's Minimongo is reused i.e. Minimongo instance is created. Queries (find) on these databases are served directly out of this cache, without talking to the server.When you write to the database on the client (insert, update, remove), the command is executed locally immediately, and, simultaneously, it's sent to the server and executed there too. This happens via stubs, because writes are implemented as methods.When, on the server, you write to a database which has a specified connection to another server, it sends the corresponding method to the other server and receives the changed values back from it over DDP. Unlike on the client, it does not execute the write locally first.If you pass null as the name, then you're creating a local database. It's not synchronized anywhere; it's just a local scratchpad that supports find, insert, update, and remove operations. (On both the client and the server, this scratchpad is implemented using Minimongo.)Find the documents in a database that match the selector.Argumentsselector : Selector specifier, or StringA query describing the documents to find.optionssort Sort specifierSort orderskip NumberNumber of results to skip at the beginninglimit NumberMaximum number of results to returnfields : Field specifierfields to returnfind returns a cursor. It does not immediately access the database or return documents. Cursors provide fetch to return all matching documents, map and forEach to iterate over all matching documents, and observe and observeChanges to register callbacks when the set of matching documents changes.Cursors are not query snapshots. Cursors are a reactive data source. Any change to the database that changes the documents in a cursor will trigger a recomputation.Finds the first document that matches the selector, as ordered by sort and skip options.Argumentsselector : Selector specifier, or StringA query describing the documents to find.optionssort Sort SpecifierSort orderskip NumberNumber of results to skip at the beginninglimit NumberMaximum number of results to returnfields : Field specifierfields to returnInsert a document in the database. Returns its unique _id.Argumentsdoc ObjectThe document to insert. May not yet have an _id attribute, in which case Meteor will generate one for you.callback FunctionOptional. If present, called with an error object as the first argument and, if no error, the _id as the second.Add a document to the database. A document is just an object, and its fields can contain any combination of compatible datatypes (arrays, objects, numbers, strings, null, true, and false).insert will generate a unique ID for the object you pass, insert it in the database, and return the ID.Replace a document in the database. Returns 1 if document updated, 0 if not.Argumentsdoc JSON document with _id fieldthe _id field in this doc specifies which document in the database is to be replaced by this document's content.optionsupsert BooleanTrue to insert a document if no matching document is found.callback FunctionOptional. If present, called with an error object as the first argument and, if no error, returns 1 as the second.Replace a document that matches the _id field. This is done on the Apache CouchDB Server via a updateHandler ignoring the _rev field (Hence behaviour is same as last-writer-wins)Returns 1 from the update call if successful and you don't pass a callback.You can use update to perform a upsert by setting the upsert option to true. You can also use the upsert method to perform an upsert that returns the _id of the document that was inserted (if there was one)Replace a document in the database, or insert one if no matching document were found. Returns an object with keys numberAffected (1 if successful, otherwise 0) and insertedId (the unique _id of the document that was inserted, if any).Argumentsdoc JSON document with _id fieldthe _id field in this doc specifies which document in the database is to be replaced by this document's content if exists. If doesnt exist document is insertedcallback FunctionOptional. If present, called with an error object as the first argument and, if no error, returns 1 as the second.Replace a document that matches the _id of the document, or insert a document if no document matched the _id. This is done on the Apache CouchDB Server via a updateHandler ignoring the _rev field (hence behaviour is same as last-writer-wins). upsert is the same as calling update with the upsert option set to true, except that the return value of upsert is an object that contain the keys numberAffected and insertedId. (update returns only 1 if successful or 0 if not)Remove a document from the database.Argumentsid_id value of the document to be removedcallback FunctionOptional. If present, called with an error object as the first argument and, if no error, returns 1 as the second.Delete the document whose _id matches the specified value them from the database. This is done on the Apache CouchDB Server via a updateHandler ignoring the _rev field1 will be returned when successful otherwise 0, if you don't pass a callback.Allow users to write directly to this database from client code, subject to limitations you define.optionsinsert, update, remove FunctionFunctions that look at a proposed modification to the database and return true if it should be allowed.fetch Array of StringsOptional performance enhancement. Limits the fields that will be fetched from the database for inspection by your update and remove functions.When a client calls insert, update, or remove on a database, the database's allow and deny callbacks are called on the server to determine if the write should be allowed. If at least one allow callback allows the write, and no deny callbacks deny the write, then the write is allowed to proceed.These checks are run only when a client tries to write to the database directly, for example by calling update from inside an event handler. Server code is trusted and isn't subject to allow and deny restrictions. That includes methods that are called with Meteor.call — they are expected to do their own access checking rather than relying on allow and deny.You can call allow as many times as you like, and each call can include any combination of insert, update, and remove functions. The functions should return true if they think the operation should be allowed. Otherwise they should return false, or nothing at all (undefined). In that case Meteor will continue searching through any other allow rules on the database.The available callbacks are:* insert(userId, doc)The user userId wants to insert the document doc into the database. Return true if this should be allowed. doc will contain the _id field if one was explicitly set by the client. You can use this to prevent users from specifying arbitrary _id fields.* update(userId, doc, modifiedDoc) The user userId wants to update a document doc. (doc is the current version of the document from the database, without the proposed update.) Return true to permit the change. modifiedDoc is the doc submitted by the user.* remove(userId, doc) The user userId wants to remove doc from the database. Return true to permit this.When calling update or remove Meteor will by default fetch the entire document doc from the database. If you have large documents you may wish to fetch only the fields that are actually used by your functions. Accomplish this by setting fetch to an array of field names to retrieve.If you never set up any allow rules on a database then all client writes to the database will be denied, and it will only be possible to write to the database from server-side code. In this case you will have to create a method for each possible write that clients are allowed to do. You'll then call these methods with Meteor.call rather than having the clients call insert, update, and remove directly on the database.Override allow rules.optionsinsert, update, remove FunctionFunctions that look at a proposed modification to the database and return true if it should be denied, even if an allow rule says otherwise.This works just like allow, except it lets you make sure that certain writes are definitely denied, even if there is an allow rule that says that they should be permitted.When a client tries to write to a database, the Meteor server first checks the database's deny rules. If none of them return true then it checks the database's allow rules. Meteor allows the write only if no deny rules return true and at least one allow rule returns true.To create a cursor, use database.find. To access the documents in a cursor, use forEach, map, or fetch.Call callback once for each matching document, sequentially and synchronously.Argumentscallback FunctionFunction to call. It will be called with three arguments: the document, a 0-based index, and cursor itself.thisArg AnyAn object which will be the value of this inside callback.When called from a reactive computation, forEach registers dependencies on the matching documents.Map callback over all matching documents. Returns an Array.Argumentscallback FunctionFunction to call. It will be called with three arguments: the document, a 0-based index, and cursor itself.thisArg AnyAn object which will be the value of this inside callback.When called from a reactive computation, map registers dependencies on the matching documents.On the server, if callback yields, other calls to callback may occur while the first call is waiting. If strict sequential execution is necessary, use forEach instead.Return all matching documents as an Array.When called from a reactive computation, fetch registers dependencies on the matching documents.Returns the number of documents that match a query.Unlike the other functions, count registers a dependency only on the number of matching documents. (Updates that just change or reorder the documents in the result set will not trigger a recomputation.)Watch a query. Receive callbacks as the result set changes.Argumentscallbacks ObjectFunctions to call to deliver the result set as it changesThis follow same behaviour of mongo-livedata driverWatch a query. Receive callbacks as the result set changes. Only the differences between the old and new documents are passed to the callbacks.Argumentscallbacks ObjectFunctions to call to deliver the result set as it changesThis follow same behaviour of mongo-livedata driverThe simplest selectors are just a string. These selectors match the document with that value in its _id field.A slightly more complex form of selector is an object containing a set of keys that must match in a document:// Matches all documents where the name and cognomen are as given{name: ""Rhialto"", cognomen: ""the Marvelous""}// Matches every documentBut they can also contain more complicated tests:// Matches documents where age is greater than 18{age: {$gt: 18}}Sorts maybe specified using the Cloudant sort syntax//Example[{""Actor_name"": ""asc""}, {""Movie_runtime"": ""desc""}]JSON array following the field syntax, described below. This parameter lets you specify which fields of an object should be returned. If it is omitted, the entire object is returned.// Example include only Actor_name, Movie_year and _id[""Actor_name"", ""Movie_year"", ""_id""]",Meteor database driver for CouchDB and Cloudant,cloudant/meteor-couchdb,Live,127 325,"Inside every Cloudant account is a world-class search engine based on Apache Lucene™. We've recently added some powerful features to search, and -- since we pride ourselves on making the difficult seem easy -- we've made it simple to use.In this post, I'll take you through a demo of Cloudant's new faceted search capabilities. You don't have to be a search expert. We'll take this step-by-step.Create a free account and have fun with Cloudant faceted searchAt the most basic level, a text search engine does two things:Finds results (docs, Web pages, emails, etc.) that contain the searched-for termDisplays those results in order of relevanceAs my high-school physics teacher liked to say, ""You don't need to know how to build a telephone to use a telephone."" How engines find and rank these results is outside the scope of this post. If you're interested in the ""building the telephone"" details you can find plenty of references [1], [2], [3], [4].Before setting up any search indexes, we need a rich dataset with a large number of documents similar in format but different in content to test. This allows us to take advantage of both text search and Cloudant's new faceting functionality.Cloudant has a number of open data sets in the Cloudant/examples directory including the public dataset on government lobbyists. There are a number of interesting fields worth searching. This database is world-readable, so you too can replicate it into your account to try it out.I replicated this database to my account by adding the following JSON doc to my _replicator DB (simplified instructions follow; if this isn't immediately obvious, don't worry):Note: we have to set the use_checkpoints field to false for this replication to work.Don't have a _replicator database? The first step is to create one via the Cloudant dashboard:_replicator is a special database that contains your replication jobs. Now that you have one in place, pull up the command line terminal. We'll use curl to send commands to Cloudant. The command below will POST the necessary JSON to _replicator (substitute your account permissions where appropriate):$ curl -X POST 'https://Now we have the data set in our own Cloudant database -- a large number of docs of varying size and content, all of which contain information about registered lobbyists in D.C. and their activities.Time to get my first index cooking!Let's start out very simply and index the ""Type"" field of each document. Before writing our index function, let's look at the structure of an example JSON document in the database to know what we're working with. (Keep in mind that each document has its own self-defining structure, which can vary from doc to doc.)Now, look at a couple references on Cloudant Search and defining indexes in design docs. You'll see that setting up an index like this is fairly simple, and simpler still using the indexing functionality in the new dash. We can create a new search index called ""type"" in the lobbyist DB using this JavaScript function:Functions that define indexes need to be stored in special JSON documents called design documents. Start by going into your lobbyists database via the dashboard and creating a new search index. Here, you can choose a design document to save the index function to, or create a new one. You'll want to do the latter and name it ""SearchTest"". Name the index ""type"" and enter the function provided above. Here's what yours should look like after you create the index and go back in to edit it:We’ve chosen the ""Standard"" analyzer from the dropdown list. Use it for this tutorial, and visit our For Developers site for a list of other generic and language-specific analyzers included in Cloudant Search, if you're curious. So, after hitting save we have a design doc that looks like this:Note: Cloudant will generate this JSON for you. You won't need to copy, paste, and modify it. To find it, navigate to ""All design docs"" in your Cloudant dashboard and edit the SearchTest design doc to see the code.Like all documents in Cloudant, this one is JSON. Don't worry about the various metadata pieces, just focus on the index definition. Here it is again, with the newline characters (\n) parsed for readability:The index definition is central to getting search working in Cloudant, so let's take it piece-by-piece.First, we declare a function that takes a single argument: a JSON document. We set up a simple if statement to make sure that only docs with the “Type” field actually get sent to the indexer.Then we call index(""type"", doc.Type), which takes at least two arguments:""indexName"", which is the index name you'll have to specify in your search query parameters in Cloudant. Here, we pass in the string ""type"".The second argument, which is the part of a JSON doc that you want to index. Generically, you can think of it as doc.key. In the case of our index definition, doc.key is the ""Type"" field in our JSON documents, hence the argument doc.Type. IMPORTANT! Only strings and numbers can be indexed in Cloudant full-text search indexes. Nested fields, objects, etc. cannot be indexed as full text; however, secondary database indexes in Cloudant can handle these structures.Save the new design doc, and Cloudant goes to work indexing every JSON document in the database. After a brief time, the index is ready for querying. (Give it a few minutes for the 1.2 GB in our example data set.) Let's search for all ""THIRD QUARTER"" reports, which looks like:Note: You can also POST your query in JSON form to the _search API endpoint using the following syntax:$ curl -u ""The response yields the total number of rows and the document IDs of the first 25 results. (The ""bookmark"" field is a value your application can use to paginate the rest of the results.) Here's some of the output:{ ""rows"": [ { ""fields"": {}, ""order"": [ 0.19044028222560883, 0 ], ""id"": ""315e4a1d10a025b62de23cd7c725bca4"" }, { ""fields"": {}, ""order"": [ 0.19044028222560883, 1 ], ""id"": ""315e4a1d10a025b62de23cd7c7249988"" }, ... { ""fields"": {}, ""order"": [ 0.19044028222560883, 55 ], ""id"": ""315e4a1d10a025b62de23cd7c74d535a"" } ], ""bookmark"": ""g2wAAAABaANkACFkYmNvcmVAZGI2Lm1vb25zaGluZS5jbG91ZGFudC5uZXRsAAAAAmEAYj____9qaAJGP8hgWOAAAABhN2o"", ""total_rows"": 83146 }And huzzah! We've done our first search query in Cloudant!Ok, now let's dig into something a little more interesting.Many of the documents in the lobbyist DB have a field called ""Amount"", indicating the amount of money the lobbyist(s) in question spent on their efforts over some time period. This is certainly more interesting than the third quarter reports we queried for earlier! And with Cloudant's new range facets (released to multitenant customers on April 10), we can easily find the number of transactions of a certain size.Note: Range facets can only be used on numbers, and count facets can only be used on strings. Remember this, and you'll be fine.The index function needed for this is slightly more complicated than the one above, but not much! Again, fire up your Cloudant dashboard and create a new search index as follows (I'm going to save mine as ""amountSearch""):Make sure to associate the new index with your SearchTest design doc. Here's what that looks like in the Cloudant dashboard:The new design doc for SearchTest now looks like this:The initial index we created called ""type"" is still there, but now there is a new index function called ""amountSearch"". I converted doc.Amount to an integer to ensure that I can use Cloudant's range faceting functionality (in case some documents store the value associated with the ""Amount"" field as a string and others as an integer, as is the case in our lobbyists database). Finally, I added a third argument to the index function, {""facet"":true}, to enable faceting.Now let's see who is paying whom, and how much. We can query the amounts similarly to querying the types of reports stored in the database:This query will return the first 25 docs in which the ""Amount"" field is exactly $500,000. I specified the field name in our query parameters (?q=amount:500000) because our index function explicitly names this field, passed into it as its first argument.Now let's get fancy and try out range facets. Say we want to split all records into ""cheap"" lobbyists (less than $25,000) and ""expensive"" lobbyists (greater than $25,000.) This can be done using the range query parameter:Note: Be aware that the URL below must be properly escaped before it will work via curl. Empty spaces should be replaced by ""%20"". You can tell curl to parse square brackets and curly brackets by disabling globbing via the ""-g"" flag. (See this handy post from our friend Glynn Bird (@glynn_bird) for more.) OK. We'll write this one for you, minus your credentials, of course ;-)$ curl -g -u ""Back to our example, the first query q=\*:* simply ensures that every doc in the DB is returned. Note that the range definitions use inclusive/exclusive syntax to define the range boundaries. Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets.The response includes IDs for the first 25 hits plus the following output:Now we can see a fairly even split between cheap and expensive lobbyists.Note: The results of your output will vary based on how quickly you're progressing through these examples. It normally takes a few minutes to replicate the 1.2 GB lobbyists database and then to build the search indexes. Your query could take a couple minutes to return if Cloudant is still building an index, unless you append stale=ok to your query parameters. The stale=ok parameter indicates that your application would rather have low-latency responses than a completely up-to-date index. Here's how that would look:curl -g -u ""Before moving on, I should point out something you may have missed upon first reading: We didn't have to declare these ranges at index time. We didn't have to declare anything at index time other than the intention to eventually use range facets in our queries (by passing the {""facet"":true} argument into our index function.) That's it! That's all we have to do to enable this powerful query-time faceting functionality.Count facets allow us to quickly count by category (think author counts for a bookstore). Let's take a quick look!What really interested me in this database was the ""paper trail"" of the government agencies that lobbyists have been visiting. Unfortunately, these agencies are expressed in a nested structure within each document, so the index() function will be a little more complicated. Certified Cloudant Sherpa Max Thayer (@garbados) helped me out with the JavaScript in this example, so shout-out to Max (thank you!):This function will loop over all entities in a ""GovernmentEntities"" object and index them. We can form a search query to find all the government entities visited and how many times they were visited:This search query will yield a long list of agencies and the corresponding number of visits. Here's what a portion of this list looks like:{ ""counts"": { ""entity"": { ... ""Federal Aviation Administration (FAA)"": 7172, ""Federal Bureau of Investigation (FBI)"": 1037, ""Federal Communications Commission (FCC)"": 12470, ""Federal Deposit Insurance Commission (FDIC)"": 2164, ""Federal Election Commission (FEC)"": 305, ""Federal Emergency Management Agency (FEMA)"": 3962, ""Federal Energy Regulatory Commission (FERC)"": 3993, ""Federal Highway Administration (FHA)"": 2647, ""Federal Housing Finance Board (FHFB)"": 924, ""Federal Labor Relations Authority (FLRA)"": 55, ""Federal Law Enforcement Training Center"": 9, ""Federal Management Service"": 9, ""Federal Maritime Commission"": 640, ""Federal Mediation & Conciliation Service"": 16, ""Federal Mine Safety Health Review Commission (FMSH"": 4, ""Federal Motor Carrier Safety Administration"": 459, ""Federal Railroad Administration"": 1546, ""Federal Reserve System"": 3603, ""Federal Retirement Thrift Investment Board"": 46, ""Federal Trade Commission (FTC)"": 6169, ""Federal Transit Administration (FTA)"": 2556, ""Financial Crimes Enforcement Network (FinCEN)"": 70, ""Financial Management Service (FMS)"": 35, ""Food & Drug Administration (FDA)"": 10300, ... } }, ... }And we see that the FCC is fairly popular compared to, say, the Federal Election Commission. The task of calculating how much money was spent on each entity is left as an exercise for the reader.With very few deviations (mostly trips into the URL encoding quagmire), this post tracks exactly how I first started learning about search and search facets with Cloudant. Hopefully it gives you the tools you need to make use of this feature.Speaking of tools, if you use curl regularly and you'd rather not enter your Cloudant username and password for every HTTP request, consider configuring acurl, a tool that many Cloudant engineers use. Check out the post ""Authorized curl, a.k.a acurl"" for instructions.For a more fully featured example app that archives and indexes email from an IMAP server and makes it searchable, see this GitHub repo from Cloudant Developer Advocates Benjamin Young (@bigbluehat) and Jason Smith (@_jhs).If you have questions, we at Cloudant are always happy to help. Please ping us on IRC, email support@cloudant.com, or (better yet) use our awesome new support portal in the Cloudant dashboard.","Cloudant Search is based on Apache Lucene which allows facets of your data to be aggregated and counted during the search process. Facets allow your customers to drill-down into the search results, filtering in an powerful and intuitive way.",Search Faceting from Scratch [Tutorial],Live,128 329,"DATALAYER: GRAPHQL - TRANSLATING BACKEND DATA TO FRONTEND NEEDS Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 28, 2016Engineers working on backend data services are often focused on operational concerns like data consistency, reliability, uptime, and storage efficiency. Because each situation calls for a specific set of tradeoffs, a single organization can end up with a diverse set of backend databases and services. For the people building the UI and frontend API layers, this diversity can quickly become an issue, especially if the same client needs to call into multiple backends or fetch related objects across different data sources. GraphQL is a language-agnostic API gateway technology designed precisely to solve this mismatch between backend and frontend requirements. It provides a highly structured, yet flexible API layer that lets the client specify all of its data requirements in one GraphQL query, without needing to know about the backend services being accessed. Better yet, because of the structured, strongly typed nature of both GraphQL queries and APIs, it's possible to quickly get critical information, such as which objects and fields are accessed by which frontends, which clients will be affected by specific changes to the backend, and more. In this talk, Sasko Stubailo of Meteor explains what GraphQL is, what data management problems it can solve in an organization, and how you can try it today. Sashko Stubailo is passionate about building technologies that help developers build great apps. Sashko graduated with a CS degree from MIT in 2014 and has worked on a declarative reactive charting library at Palantir, an interactive i18n middleware for Rails at Panjiva, front end technology and build tooling in the Meteor framework, and is now leading the new Apollo project to build a next-generation GraphQL data platform. -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the beach, reading, spending time with his wife and daughter and tinkering. Love this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud * Subscribe Join 75,000 other devs on our weekly database newsletter. © 2016 Compose","Sasko Stubailo of Meteor explains what GraphQL is, what data management problems it can solve in an organization, and how you can try it today.",DataLayer Conference: Translating Backend Data to Frontend Needs,Live,129 331,"Skip to contentData, what now? Making sense of all that mess. MAIN NAVIGATION Menu * Blog * About Me FEATURE IMPORTANCE AND WHY IT’S IMPORTANT Vinko Kodžoman April 20, 2017 April 20, 2017I have been doing Kaggle’s Quora Question Pairs competition for about a month now, and by reading the discussions on the forums, I’ve noticed a recurring topic that I’d like to address. People seem to be struggling with getting the performance of their models past a certain point. The usual approach is to use XGBoost, ensembles and stacking. While those can generally give good results, I’d like to talk about why it is still important to do feature importance analysis. DATA EXPLORATION As an example, I will be using the Quora Question Pairs dataset . The dataset has 404,290 pairs of questions, and 37% of them are semantically the same (“duplicates”). The goal is to find out which ones. Initial steps; loading the dataset and data exploration: # Load the dataset train = pd.read_csv('train.csv', dtype={'question1': str, 'question2': str}) print('Training dataset row number:', len(train)) # 404290 print('Duplicate question pairs ratio: %.2f' % train.is_duplicate.mean()) # 0.37 Examples of duplicate and non-duplicate question pairs are shown below. question1 question2 is_duplicate What is the step by step guide to invest in share market in india? What is the step by step guide to invest in share market? 0 How can I be a good geologist? What should I do to be a great geologist? 1 How can I increase the speed of my internet connection while using a VPN? How can Internet speed be increased by hacking through DNS? 0 How do I read and find my YouTube comments? How do I read and find my YouTube comments? 1This is the word cloud inspired by a Kaggle kernel for data exploration . The cloud shows which words are popular (most frequent). The word cloud is created from words used in both questions. As you can see, the prevalent words are ones you would expect to find in a question (e.g. “best way”, “lose weight”, “difference”, “make money”, etc.) We now have some idea about what our dataset looks like. FEATURE ENGINEERING I created 24 features, some of which are shown below. All code is written in python using the standard machine learning libraries (pandas, sklearn, numpy). You can get the full code from my github notebook . Examples of some features: * q1_word_num – number of words in question1 * q2_length – number of characters in question2 * word_share – ratio of shared words between the questions * same_first_word – 1 if both questions share the same first word, else 0 def word_share(row): q1_words = set(word_tokenize(row['question1'])) q2_words = set(word_tokenize(row['question2'])) return len(q1_words.intersection(q2_words)) / (len(q1_words.union(q2_words))) def same_first_word(row): q1_words = word_tokenize(row['question1']) q2_words = word_tokenize(row['question2']) return float(q1_words[0].lower() == q2_words[0].lower()) # A sample of the features train['word_share'] = train.apply(word_share, axis=1) train['q1_word_num'] = train.question1.apply(lambda x: len(word_tokenize(x))) train['q2_word_num'] = train.question2.apply(lambda x: len(word_tokenize(x))) train['word_num_difference'] = abs(train.q1_word_num - train.q2_word_num) train['q1_length'] = train.question1.apply(lambda x: len(x)) train['q2_length'] = train.question2.apply(lambda x: len(x)) train['length_difference'] = abs(train.q1_length - train.q2_length) train['q1_has_fullstop'] = train.question1.apply(lambda x: int('.' in x)) train['q2_has_fullstop'] = train.question2.apply(lambda x: int('.' in x)) train['q1_has_math_expression'] = train.question1.apply(lambda x: int('[math]' in x)) train['q2_has_math_expression'] = train.question2.apply(lambda x: int('[math]' in x)) train['same_first_word'] = train.apply(same_first_word, axis=1) BASELINE MODEL PERFORMANCE To get the model performance, we first split the dataset into the train and test set. The test set contains 20% of the total data. To evaluate the model’s performance, we use the created test set (X_test and y_test). X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) The model is evaluated with the logloss function. It is the same metric which is used in the competition. $logloss = \frac{1}{N} \displaystyle\sum_{i=1}^{N} \displaystyle\sum_{j=1}^{M} y_{i,j} * log(p_{i,j})$ To test the model with all the features, we use the Random Forest classifier. It is a powerful “out of the box” ensemble classifier. No hyperparameter tuning was done – they can remain fixed because we are testing the model’s performance against different feature sets. A simple model gives a logloss score of 0.62923, which would put us at the 1371th place of a total of 1692 teams at the time of writing this post. Now let’s see if doing feature selection could help us lower the logloss. model = RandomForestClassifier(50, n_jobs=8) model.fit(X_train, y_train) predictions_proba = model.predict_proba(X_test) predictions = model.predict(X_test) log_loss_score = log_loss(y_test, predictions_proba) acc = accuracy_score(y_test, predictions) f1 = f1_score(y_test, predictions) print('Log loss: %.5f' % log_loss_score) # 0.62923 print('Acc: %.5f' % acc) # 0.70952 print('F1: %.5f' % f1) # 0.59173 FEATURE IMPORTANCE To get the feature importance scores, we will use an algorithm that does feature selection by default – XGBoost. It is the king of Kaggle competitions. If you are not using a neural net, you probably have one of these somewhere in your pipeline. XGBoost uses gradient boosting to optimize creation of decision trees in the ensemble. Each tree contains nodes, and each node is a single feature. The number of instances of a feature used in XGBoost decision tree’s nodes is proportional to its effect on the overall performance of the model. model = XGBClassifier(n_estimators=500) model.fit(X, y) feature_importance = model.feature_importances_ plt.figure(figsize=(16, 6)) plt.yscale('log', nonposy='clip') plt.bar(range(len(feature_importance)), feature_importance, align='center') plt.xticks(range(len(feature_importance)), features, rotation='vertical') plt.title('Feature importance') plt.ylabel('Importance') plt.xlabel('Features') plt.show() Looking at the graph below, we see that some features are not used at all, while some (word_share) impact the performance greatly. We can reduce the number of features by taking a subset of the most important features. Using the feature importance scores, we reduce the feature set. The new pruned features contain all features that have an importance score greater than a certain number. In our case, the pruned features contain a minimum importance score of 0.05. def extract_pruned_features(feature_importances, min_score=0.05): column_slice = feature_importances[feature_importances['weights'] > min_score] return column_slice.index.values pruned_featurse = extract_pruned_features(feature_importances, min_score=0.01) X_train_reduced = X_train[pruned_featurse] X_test_reduced = X_test[pruned_featurse] def fit_and_print_metrics(X_train, y_train, X_test, y_test, model): model.fit(X_train, y_train) predictions_proba = model.predict_proba(X_test) log_loss_score = log_loss(y_test, predictions_proba) print('Log loss: %.5f' % log_loss_score) MODEL PERFORMANCE WITH FEATURE IMPORTANCE ANALYSIS As a result of using the pruned features, our previous model – Random Forest – scores better. With little effort, the algorithm gets a lower loss, and it also trains more quickly and uses less memory because the feature set is reduced. model = RandomForestClassifier(50, n_jobs=8) # LogLoss 0.59251 fit_and_print_metrics(X_train_reduced, y_train, X_test_reduced, y_test, model) # LogLoss 0.63376 fit_and_print_metrics(X_train, y_train, X_test, y_test, model) Playing a bit more with feature importance score (plotting the logloss of our classifier for a certain subset of pruned features) we can lower the loss even more. In this particular case, Random Forest actually works best with only one feature! Using only the feature “word_share” gives a logloss of 0.55305. If you are interested to see this step in detail, the full version is in the notebook . CONCLUSION As I have shown, utilising feature importance analysis has a potential to increase the model’s performance. While some models like XGBoost do feature selection for us, it is still important to be able to know the impact of a certain feature on the model’s performance because it gives you more control over the task you are trying to accomplish. The “no free lunch” theorem (there is no solution which is best for all problems) tells us that even though XGBoost usually outperforms other models, it is up to us to discern whether it is really the best solution. Using XGBoost to get a subset of important features allows us to increase the performance of models without feature selection by giving that feature subset to them. Using feature selection based on feature importance can greatly increase the performance of your models. Categories Data Science , Deep Learning , Machine Learning Tags feature engineering , feature importance , features , machine learning , pythonLEAVE A REPLY CANCEL REPLY Your email address will not be published. Required fields are marked * Comment Name * Email * Website Notify me of follow-up comments by email. Notify me of new posts by email. PRIMARY SIDEBAR Toggle Sidebar Search for:NEWSLETTER RECENT POSTS * Feature importance and why it’s important ARCHIVES * April 2017 Weenkus (Vinko Kodžoman) Vinko Kodžoman Weenkus Zagreb, Croatia vinko.kodzoman@yahoo.com Joined on Oct 07, 2014 8 Followers 19 Following 26 Public Repositories ansiweather Blog book_problems cards cats_vs_dogs_redux_kaggle Competition Deep-Learning-University-of-Zagreb digit_recognizer_kaggle FM-index GTEngine hello_app identicon_generator InverseMatrixCaching LabDump leaf_classification_kaggle LearnOpenGL_tutorial Machine-Learning-University-of-Washington Machine-Learning-University-of-Zagreb My-personal-webpage One-Hump-Iterator-Visualization on_power_efficient_virtual_network_function_placement_algorithm Reference-Genome-Index Rentals Search-Engine Sexual-Predator-Classification-Using-Ensemble-Classifiers toy_app 0 Public GistsData, what now? © 2017 . All Rights Reserved",Feature importance in machine learning using examples in Python with xgboost. Getting better performance from a model with feature pruning.,Feature importance and why it's important,Live,130 332,"Toggle navigation * * About * * Archives * * PRACTICAL BUSINESS PYTHON Taking care of business, one python script at a time Sun 26 October 2014SIMPLE GRAPHING WITH IPYTHON AND PANDAS Posted by Chris Moffitt in articles INTRODUCTION This article is a follow on to my previous article on analyzing data with python. I am going to build on my basic intro of IPython , notebooks and pandas to show how to visualize the data you have processed with these tools. I hope that this will demonstrate to you (once again) how powerful these tools are and how much you can get done with such little code. I ultimately hope these articles will help people stop reaching for Excel every time they need to slice and dice some files. The tools in the python environment can be so much more powerful than the manual copying and pasting most people do in excel. I will walk through how to start doing some simple graphing and plotting of data in pandas. I am using a new data file that is the same format as my previous article but includes data for only 20 customers. If you would like to follow along, the file is available here . GETTING STARTED As described in the previous article , I’m using an IPython notebook to explore my data. First we are going to import pandas, numpy and matplot lib. I am also showing the pandas version I’m using so you can make sure yours is compatible. importpandasaspdimportnumpyasnpimportmatplotlib.pyplotaspltpd.__version__ '0.14.1' Next, enable IPython to display matplotlib graphs. %matplotlibinline We will read in the file like we did in the previous article but I’m going to tell it to treat the date column as a date field (using parse_dates ) so I can do some re-sampling later. sales=pd.read_csv(""sample-salesv2.csv"",parse_dates=['date'])sales.head() account number name sku category quantity unit price ext price date 0 296809 Carroll PLC QN -82852 Belt 13 44.48 578.24 2014-09-27 07:13:03 1 98022 Heidenreich-Bosco MJ -21460 Shoes 19 53.62 1018.78 2014-07-29 02:10:44 2 563905 Kerluke, Reilly and Bechtelar AS -93055 Shirt 12 24.16 289.92 2014-03-01 10:51:24 3 93356 Waters-Walker AS -93055 Shirt 5 82.68 413.40 2013-11-17 20:41:11 4 659366 Waelchi-Fahey AS -93055 Shirt 18 99.64 1793.52 2014-01-03 08:14:27Now that we have read in the data, we can do some quick analysis sales.describe() account number quantity unit price ext price count 1000.000000 1000.000000 1000.000000 1000.00000 mean 535208.897000 10.328000 56.179630 579.84390 std 277589.746014 5.687597 25.331939 435.30381 min 93356.000000 1.000000 10.060000 10.38000 25% 299771.000000 5.750000 35.995000 232.60500 50% 563905.000000 10.000000 56.765000 471.72000 75% 750461.000000 15.000000 76.802500 878.13750 max 995267.000000 20.000000 99.970000 1994.80000We can actually learn some pretty helpful info from this simple command: * We can tell that customers on average purchases 10.3 items per transaction * The average cost of the transaction was $579.84 * It is also easy to see the min and max so you understand the range of the data If we want we can look at a single column as well: sales['unit price'].describe() count 1000.000000 mean 56.179630 std 25.331939 min 10.060000 25% 35.995000 50% 56.765000 75% 76.802500 max 99.970000 dtype: float64 I can see that my average price is $56.18 but it ranges from $10.06 to $99.97. I am showing the output of dtypes so that you can see that the date column is a datetime field. I also scan this to make sure that any columns that have numbers are floats or ints so that I can do additional analysis in the future. sales.dtypes account number int64 name object sku object category object quantity int64 unit price float64 ext price float64 date datetime64[ns] dtype: object PLOTTING SOME DATA We have our data read in and have completed some basic analysis. Let’s start plotting it. First remove some columns to make additional analysis easier. customers=sales[['name','ext price','date']]customers.head() name ext price date 0 Carroll PLC 578.24 2014-09-27 07:13:03 1 Heidenreich-Bosco 1018.78 2014-07-29 02:10:44 2 Kerluke, Reilly and Bechtelar 289.92 2014-03-01 10:51:24 3 Waters-Walker 413.40 2013-11-17 20:41:11 4 Waelchi-Fahey 1793.52 2014-01-03 08:14:27This representation has multiple lines for each customer. In order to understand purchasing patterns, let’s group all the customers by name. We can also look at the number of entries per customer to get an idea for the distribution. customer_group=customers.groupby('name')customer_group.size() name Berge LLC 52 Carroll PLC 57 Cole-Eichmann 51 Davis, Kshlerin and Reilly 41 Ernser, Cruickshank and Lind 47 Gorczany-Hahn 42 Hamill-Hackett 44 Hegmann and Sons 58 Heidenreich-Bosco 40 Huel-Haag 43 Kerluke, Reilly and Bechtelar 52 Kihn, McClure and Denesik 58 Kilback-Gerlach 45 Koelpin PLC 53 Kunze Inc 54 Kuphal, Zieme and Kub 52 Senger, Upton and Breitenberg 59 Volkman, Goyette and Lemke 48 Waelchi-Fahey 54 Waters-Walker 50 dtype: int64 Now that our data is in a simple format to manipulate, let’s determine how much each customer purchased during our time frame. The sum function allows us to quickly sum up all the values by customer. We can also sort the data using the sort command. sales_totals=customer_group.sum()sales_totals.sort(columns='ext price').head() ext price name Davis, Kshlerin and Reilly 19054.76 Huel-Haag 21087.88 Gorczany-Hahn 22207.90 Hamill-Hackett 23433.78 Heidenreich-Bosco 25428.29Now that we know what the data look like, it is very simple to create a quick bar chart plot. Using the IPython notebook, the graph will automatically display. my_plot=sales_totals.plot(kind='bar') Unfortunately this chart is a little ugly. With a few tweaks we can make it a little more impactful. Let’s try: * sorting the data in descending order * removing the legend * adding a title * labeling the axes my_plot=sales_totals.sort(columns='ext price',ascending=False).plot(kind='bar',legend=None,title=""Total Sales by Customer"")my_plot.set_xlabel(""Customers"")my_plot.set_ylabel(""Sales ($)"") This actually tells us a little about our biggest customers and how much difference there is between their sales and our smallest customers. Now, let’s try to see how the sales break down by category. customers=sales[['name','category','ext price','date']]customers.head() name category ext price date 0 Carroll PLC Belt 578.24 2014-09-27 07:13:03 1 Heidenreich-Bosco Shoes 1018.78 2014-07-29 02:10:44 2 Kerluke, Reilly and Bechtelar Shirt 289.92 2014-03-01 10:51:24 3 Waters-Walker Shirt 413.40 2013-11-17 20:41:11 4 Waelchi-Fahey Shirt 1793.52 2014-01-03 08:14:27We can use groupby to organize the data by category and name. category_group=customers.groupby(['name','category']).sum()category_group.head() ext price name category Berge LLC Belt 6033.53 Shirt 9670.24 Shoes 14361.10 Carroll PLC Belt 9359.26 Shirt 13717.61The category representation looks good but we need to break it apart to graph it as a stacked bar graph. unstack can do this for us. category_group.unstack().head() ext price category Belt Shirt Shoes name Berge LLC 6033.53 9670.24 14361.10 Carroll PLC 9359.26 13717.61 12857.44 Cole-Eichmann 8112.70 14528.01 7794.71 Davis, Kshlerin and Reilly 1604.13 7533.03 9917.60 Ernser, Cruickshank and Lind 5894.38 16944.19 5250.45Now plot it. my_plot=category_group.unstack().plot(kind='bar',stacked=True,title=""Total Sales by Customer"")my_plot.set_xlabel(""Customers"")my_plot.set_ylabel(""Sales"") In order to clean this up a little bit, we can specify the figure size and customize the legend. my_plot=category_group.unstack().plot(kind='bar',stacked=True,title=""Total Sales by Customer"",figsize=(9,7))my_plot.set_xlabel(""Customers"")my_plot.set_ylabel(""Sales"")my_plot.legend([""Total"",""Belts"",""Shirts"",""Shoes""],loc=9,ncol=4) Now that we know who the biggest customers are and how they purchase products, we might want to look at purchase patterns in more detail. Let’s take another look at the data and try to see how large the individual purchases are. A histogram allows us to group purchases together so we can see how big the customer transactions are. purchase_patterns=sales[['ext price','date']]purchase_patterns.head() ext price date 0 578.24 2014-09-27 07:13:03 1 1018.78 2014-07-29 02:10:44 2 289.92 2014-03-01 10:51:24 3 413.40 2013-11-17 20:41:11 4 1793.52 2014-01-03 08:14:27We can create a histogram with 20 bins to show the distribution of purchasing patterns. purchase_plot=purchase_patterns['ext price'].hist(bins=20)purchase_plot.set_title(""Purchase Patterns"")purchase_plot.set_xlabel(""Order Amount($)"")purchase_plot.set_ylabel(""Number of orders"") In looking at purchase patterns over time, we can see that most of our transactions are less than $500 and only a very few are about $1500. Another interesting way to look at the data would be by sales over time. A chart might help us understand, “Do we have certain months where we are busier than others?” Let’s get the data down to order size and date. purchase_patterns=sales[['ext price','date']]purchase_patterns.head() ext price date 0 578.24 2014-09-27 07:13:03 1 1018.78 2014-07-29 02:10:44 2 289.92 2014-03-01 10:51:24 3 413.40 2013-11-17 20:41:11 4 1793.52 2014-01-03 08:14:27If we want to analyze the data by date, we need to set the date column as the index using set_index . purchase_patterns=purchase_patterns.set_index('date')purchase_patterns.head() ext price date 2014-09-27 07:13:03 578.24 2014-07-29 02:10:44 1018.78 2014-03-01 10:51:24 289.92 2013-11-17 20:41:11 413.40 2014-01-03 08:14:27 1793.52One of the really cool things that pandas allows us to do is resample the data. If we want to look at the data by month, we can easily resample and sum it all up. You’ll notice I’m using ‘M’ as the period for resampling which means the data should be resampled on a month boundary. purchase_patterns.resample('M',how=sum) Plotting the data is now very easy purchase_plot=purchase_patterns.resample('M',how=sum).plot(title=""Total Sales by Month"",legend=None) Looking at the chart, we can easily see that December is our peak month and April is the slowest. Let’s say we really like this plot and want to save it somewhere for a presentation. fig=purchase_plot.get_figure()fig.savefig(""total-sales.png"") PULLING IT ALL TOGETHER In my typical workflow, I would follow the process above of using an IPython notebook to play with the data and determine how best to make this process repeatable. If I intend to run this analysis on a periodic basis, I will create a standalone script that will do all this with one command. Here is an example of pulling all this together into a single file: # Standard import for pandas, numpy and matplotimportpandasaspdimportnumpyasnpimportmatplotlib.pyplotasplt# Read in the csv file and display some of the basic infosales=pd.read_csv(""sample-salesv2.csv"",parse_dates=['date'])print""Data types in the file:""printsales.dtypesprint""Summary of the input file:""printsales.describe()print""Basic unit price stats:""printsales['unit price'].describe()# Filter the columns down to the ones we need to look at for customer salescustomers=sales[['name','ext price','date']]#Group the customers by name and sum their salescustomer_group=customers.groupby('name')sales_totals=customer_group.sum()# Create a basic bar chart for the sales data and show itbar_plot=sales_totals.sort(columns='ext price',ascending=False).plot(kind='bar',legend=None,title=""Total Sales by Customer"")bar_plot.set_xlabel(""Customers"")bar_plot.set_ylabel(""Sales ($)"")plt.show()# Do a similar chart but break down by category in stacked bars# Select the appropriate columns and group by name and categorycustomers=sales[['name','category','ext price','date']]category_group=customers.groupby(['name','category']).sum()# Plot and show the stacked bar chartstack_bar_plot=category_group.unstack().plot(kind='bar',stacked=True,title=""Total Sales by Customer"",figsize=(9,7))stack_bar_plot.set_xlabel(""Customers"")stack_bar_plot.set_ylabel(""Sales"")stack_bar_plot.legend([""Total"",""Belts"",""Shirts"",""Shoes""],loc=9,ncol=4)plt.show()# Create a simple histogram of purchase volumespurchase_patterns=sales[['ext price','date']]purchase_plot=purchase_patterns['ext price'].hist(bins=20)purchase_plot.set_title(""Purchase Patterns"")purchase_plot.set_xlabel(""Order Amount($)"")purchase_plot.set_ylabel(""Number of orders"")plt.show()# Create a line chart showing purchases by monthpurchase_patterns=purchase_patterns.set_index('date')month_plot=purchase_patterns.resample('M',how=sum).plot(title=""Total Sales by Month"",legend=None)fig=month_plot.get_figure()#Show the image, then save itplt.show()fig.savefig(""total-sales.png"") The impressive thing about this code is that in 55 lines (including comments), I’ve created a very powerful yet simple to understand program to repeatedly manipulate the data and create useful output. I hope this is useful. Feel free to provide feedback in the comments and let me know if this is helpful. * ← Simple Interactive Data Analysis with Python * Using Pandas To Create an Excel Diff → Tags pandas csv excel ipython -------------------------------------------------------------------------------- Tweet Vote on Hacker NewsCOMMENTS SOCIAL * Github * Twitter * BitBucket * Reddit * LinkedIn CATEGORIES * articles * news POPULAR * Pandas Pivot Table Explained * Common Excel Tasks Demonstrated in Pandas * Overview of Python Visualization Tools * Web Scraping - It's Your Civic Duty * Simple Graphing with IPython and Pandas TAGS sets pygal csv barnum process s3 matplotlib plotting stdlib oauth2 xlsxwriter pelican jinja python google matplot pandas ipython seaborn notebooks cases xlwings gui excel vcs ggplot beautifulsoup powerpoint bokeh plotly analyze-this pdf github FEEDS * Atom Feed -------------------------------------------------------------------------------- Site built using Pelican • Theme based on VoidyBootstrap by RKI","This article is a follow on to the previous article on analyzing data with python, building on the basic intro of IPython, notebooks and pandas to show how to visualize the data you have processed with these tools.",Simple Graphing with IPython and Pandas,Live,131 334,"Homepage Follow Sign in / Sign up Homepage * Home * Data Science * Machine Learning * Programming * Visualization * Events * Letters * * Contribute * Karlijn Willems Blocked Unblock Follow Following Data Science Journalist @DataCamp Oct 12 -------------------------------------------------------------------------------- COLLECTING DATA SCIENCE CHEAT SHEETS As you might already know, I’ve been making Python and R cheat sheets specifically for those who are just starting out with data science or for those who need an extra help when working on data science problems. Now you can find all of them in one place on the DataCamp Community. You can find all cheat sheets here . To recap, these are the data science cheat sheets that we have already made and shared with the community up until now: Basics * Python Basics Cheat Sheet * Scipy Linear Algebra Cheat Sheet Data Manipulation * NumPy Basics Cheat Sheet * Pandas Basics Cheat Sheet * Pandas Data Wrangling Cheat Sheet * xts Cheat sheet * data.table Cheat Sheet ( updated! ) Machine Learning, Deep Learning, Big Data * Scikit-Learn Cheat Sheet * Keras Cheat Sheet * PySpark RDD Cheat Sheet * PySpark SparkSQL Cheat Sheet Data Visualization * Matplotlib Cheat Sheet * Seaborn Cheat Sheet * Bokeh Cheat Sheet ( updated! ) IDE * Jupyter Notebook Cheat Sheet Enjoy and feel free to share! PS. Did you see another data science cheat sheet that you’d like to recommend? Let us know here ! * Data Science * Data Analysis * Big Data * Data Visualization * Machine Learning Show your supportClapping shows how much you appreciated Karlijn Willems’s story. 833 1 Blocked Unblock Follow FollowingKARLIJN WILLEMS Data Science Journalist @DataCamp FollowTOWARDS DATA SCIENCE Sharing concepts, ideas, and codes. * 833 * * * Never miss a story from Towards Data Science , when you sign up for Medium. Learn more Never miss a story from Towards Data Science Get updates Get updates",Python and R cheat sheets specifically for those who are just starting out with data science or for those who need an extra help when working on data science problems.,Collecting Data Science Cheat Sheets,Live,132 339,"Compose The Compose logo Articles Sign in Free 30-day trialHOW TO SCRIPT PAINLESS-LY IN ELASTICSEARCH Published Aug 9, 2017 How to Script Painless-ly in Elasticsearch elasticsearch painless scripting Free 30 Day TrialWith the release of Elasticsearch 5.x came Painless, Elasticsearch's answer to safe, secure, and performant scripting. We'll introduce you to Painless and show you what it can do. With the introduction of Elasticsearch 5.x over a year ago, we got a new scripting language, Painless. Painless is a scripting language developed and maintained by Elastic and optimized for Elasticsearch. While it's still an experimental scripting language, at its core Painless is promoted as a fast, safe, easy to use, and secure. In this article, we'll give you a short introduction to Painless, and show you how to use the language when searching and updating your data. On to Painless ... A PAINLESS INTRODUCTION The objective of Painless scripting is to make writing scripts painless for the user, especially if you're coming from a Java or Groovy environment. While you might not be familiar with scripting in Elasticsearch in general, let's start with the basics. Variables and Data TypesVariables can be declared in Painless using primitive, reference, string, void (doesn't return a value), array, and dynamic typings. Painless supports the following primitive types: byte , short , char , int , long , float , double , and boolean . These are declared in a way similar to Java, for example, int i = 0; double a; boolean g = true; . Reference types in Painless are also similar to Java, except they don't support access modifiers, but support Java-like inheritance. These types can be allocated using the new keyword on initialization such as when declaring a as an ArrayList, or simply declaring a single variable b to a null Map like: ArrayList a = new ArrayList(); Map b; Map g = [:]; List q = [1, 2, 3]; Lists and Maps are similar to arrays, except they don't require the new keyword on initialization, but they are reference types, not arrays. String types can be used along with any variable with or without allocating it with the new keyword. For example: String a = ""a""; String foo = new String(""bar""); Array types in Painless support single and multidimensional arrays with null as the default value. Like reference types, arrays are allocated using the new keyword then the type and a set of brackets for each dimension. An array can be declared and initialized like the following: int[] x = new int[2]; x[0] = 3; x[1] = 4; The size of the array can be explicit, for example, int[] a = new int[2] or you can create an array with values 1 to 5 and a size of 5 using: int[] b = new int[] {1,2,3,4,5}; Like arrays in Java and Groovy, the array data type must have a primitive, string, or even a dynamic def associated with it on declaration and initialization. def is the only dynamic type supported by Painless and has the best of all worlds when declaring variables. What it does is it mimics the behavior of whatever type it's assigned at runtime. So, when defining a variable: def a = 1; def b = ""foo""; In the above code, Elasticsearch will always assume a is a primitive type int with a value of 1 and b as a string type with the value of ""foo"" . Arrays can also be assigned with a def , for instance, note the following: def[][] h = new def[2][2]; def[] f = new def[] {4, ""s"", 5.7, 2.8C}; With variables out of the way, let's take a look at conditionals and operators. Operators and ConditionalsIf you know Java, Groovy, or a modern programming language, then conditionals and using operators in Painless will be familiar. The Painless documentation contains an entire list of operators that are compatible with the language in addition to their order of precedence and associativity. Most of the operators on the list are compatible with Java and Groovy languages. Like most programming languages operator precedence can be overridden with parentheses (e.g. int t = 5+(5*5) ). Working with conditionals in Painless is the same using them in most programming languages. Painless supports if and else , but not else if or switch . A conditional statement will look familiar to most programmers: if (doc['foo'].value = 5) { doc['foo'].value *= 10; } else { doc['foo'].value += 10; } Painless also has the Elvis operator ?: , which is behaves more like the operator in Kotlin than Groovy. Basically, if we have the following: x ?: y the Elvis operator will evaluate the right-side expression and returns whatever the value of x is if not null . If x is null then the left-side expression is evaluated. Using primitives won't work with the Elvis operator, so def is preferred here when it's used. MethodsWhile the Java language is where Painless gets most of its power from, not every class or method from the Java standard library (Java Runtime Environment, JRE) is available. Elasticsearch has a whitelist reference of classes and methods that are available to Painless. The list doesn't only include those available from the JRE, but also Elasticsearch and Painless methods that are available to use. Painless LoopsPainless supports while , do...while , for loops, and control flow statements like break and continue which are all available in Java. An example for loop in Painless will also look familiar in most modern programming languages. In the following example, we loop over an array containing scores from our document doc['scores'] and add them to the variable total then return it: def total = 0; for (def i = 0; i Modifying that loop to the following will also work: def total = 0; for (def score : doc['scores']) { total += score; } return total; Now that we have an overview of some of the language fundamentals, let's start looking at some data and see how we can use Painless with Elasticsearch queries. LOADING THE DATA Before loading data into Elasticsearch, make sure you have a fresh index set up. You'll need to create a new index either in the Compose console, in the terminal, or use the programming language of your choice. The index that we'll create is called ""sat"". Once you've set up the index, let's gather the data. The data we're going to use is a list of average SAT scores by school for the year 2015/16 compiled by the California Department of Education. The data from the California Department of Education comes in a Microsoft Excel file. We converted the data into JSON which can be downloaded from the Github repository here . After downloading the JSON file, using Elasticsearch's Bulk API we can insert the data into the ""sat"" index we created. curl -XPOST -u username:password 'https://portal333-5.compose-elasticsearch.compose-44.composedb.com:44444/_bulk' --data-binary @sat_scores.json Remember to substitute the username, password, and deployment URL with your own and add _bulk to the end of the URL to start importing data. SEARCHING ELASTICSEARCH USING PAINLESS Now that we have the SAT scores loaded into the ""sat"" index, we can start using Painless in our SAT queries. In the following examples, all variables will use def to demonstrate Painless's dynamic typing support. The format of scripts in Elasticsearch looks similar to the following: GET sat/_search { ""script_fields"": { ""some_scores"": { ""script"": { ""lang"": ""painless"", ""inline"": ""def scores = 0; scores = doc['AvgScrRead'].value + doc['AvgScrWrit'].value; return scores;"" } } } } Within a script you can define the scripting language lang , where Painless is the default. In addition, we can specify the source of the script. For example, we're using inline scripts or those that are run when making a query. We also have the option of using stored , which are scripts that are stored in the cluster. Also, we have file scripts that are scripts stored in a file and referenced within Elasticsearch's configuration directory. Let's look at the above script in a little more detail. In the above script, we're using the _search API and the script_fields command. This command will allow us to create a new field that will hold the scores that we write in the script . Here, we've called it some_scores just as an example. Within this new script field, use the script field to define the scripting language painless (Painless is already the default language) and use the field inline which will include our Painless script: def scores = 0; scores = doc['AvgScrRead'].value + doc['AvgScrWrit'].value; return scores; You'll notice immediately that the Painless script that we just wrote doesn't have any line breaks. That's because scripts in Elasticseach must be written out as a single-line string. Running this simple query doesn't require Painless scripting. In fact, it could be done with Lucene Expressions, but it serves just as an example. Let's look at the results: { ""_index"": ""sat"", ""_type"": ""scores"", ""_id"": ""AV3CYR8JFgEfgdUCQSON"", ""_score"": 1, ""_source"": { ""cds"": 1611760130062, ""rtype"": ""S"", ""sname"": ""American High"", ""dname"": ""Fremont Unified"", ""cname"": ""Alameda"", ""enroll12"": 444, ""NumTstTakr"": 298, ""AvgScrRead"": 576, ""AvgScrMath"": 610, ""AvgScrWrit"": 576, ""NumGE1500"": 229, ""PctGE1500"": 76.85, ""year"": 1516 }, ""fields"": { ""some_scores"": [ 1152 ] } } The script is run on each document in the index. The above result shows that a new field called fields has been created with another field containing the name of the new field some_scores that we created with the script_fields command. Let's write another query that will search for schools that have a SAT reading score of less than 350 and a math score of more than 350. The script for that would look like: doc['AvgScrRead'].value < 350 && doc['AvgScrMath'].value > 350 And the query: GET sat/_search { ""query"": { ""script"": { ""script"": { ""inline"": ""doc['AvgScrRead'].value < 350 && doc['AvgScrMath'].value > 350"", ""lang"": ""painless"" } } } } This will give us four schools. Of those four schools, we can then use Painless to create an array containing four values: the SAT scores from our data and a total SAT score, or the sum of all the SAT scores: def sat_scores = []; def score_names = ['AvgScrRead', 'AvgScrWrit', 'AvgScrMath']; for (int i = 0; i We'll create a sat_scores array to hold the SAT scores ( AvgScrRead , AvgScrWrit , and AvgScrMath ) and the total score that we'll calculate. We'll create another array called scores_names to hold the names of the document fields that contain SAT scores. If in the future our field names change, all we'd have to do is update the names in the array. Using a for loop, we'll loop through the document fields using the score_names array, and put their corresponding values in the sat_scores array. Next, we'll loop over our sat_scores array and add the values of the three SAT scores together and place that score in a temporary variable temp . Then, we add the temp value to our sat_scores array giving us the three individual SAT scores plus their total score. The entire query to get the four schools and the script looks like: GET sat/_search { ""query"": { ""script"": { ""script"": { ""inline"": ""doc['AvgScrRead'].value < 350 && doc['AvgScrMath'].value i "", ""lang"": ""painless"" } } } } Each document returned by the query will look similar to: ""hits"": { ""total"": 4, ""max_score"": 1, ""hits"": [ { ""_index"": ""sat"", ""_type"": ""scores"", ""_id"": ""AV3CYR8PFgEfgdUCQSpM"", ""_score"": 1, ""fields"": { ""scores"": [ 326, 311, 368, 1005 ] } } ... One drawback of using the _search API is that the results aren't stored. To do that, we'd have to use the _update or _update_by_query API to update individual documents or all the documents in the index. So, let's update our index with the query results we've just used. UPDATING ELASTICSEARCH USING PAINLESS Before we move further, let's create another field in our data that will hold an array of the SAT scores. To do that, we'll use Elasticsearch's _update_by_query API to add a new field called All_Scores which will initially start out as an empty array: POST sat/_update_by_query { ""script"": { ""inline"": ""ctx._source.All_Scores = []"", ""lang"": ""painless"" } } This will update the index to include the new field where we can start adding our scores to. To do that, we'll use a script to update the All_Scores field: def scores = ['AvgScrRead', 'AvgScrWrit', 'AvgScrMath']; for (int i = 0; i Using _update or the _update_by_query API, we won't have access to the doc value. Instead, Elasticsearch exposes the ctx variable and the _source document that allows us to access the each document's fields. From there we can update the All_Scores array for each document with each SAT score and the total average SAT score for the school. The entire query looks like this: POST sat/_update_by_query { ""script"": { ""inline"": ""def scores = ['AvgScrRead', 'AvgScrWrit', 'AvgScrMath']; for (int i = 0; i "", ""lang"": ""painless"" } } If we want to update only a single document, we can do that, too, using a similar script. All we'll need to indicate is the document's _id in the POST URL. In the following update, we're simply adding 10 points to the AvgScrMath score for the document with id ""AV2mluV4aqbKx_m2Ul0m"". POST sat/scores/AV2mluV4aqbKx_m2Ul0m/_update { ""script"": { ""inline"": ""ctx._source.AvgScrMath += 10"", ""lang"": ""painless"" } } SUMMING UP We've gone over the basics of Elasticsearch's Painless scripting language and have given some examples of how it works. Also, using some of the Painless API methods like HashMap and loops, we've given you a taste of what you could do with the language when updating your documents, or just modifying your data prior to getting your search results back. Nonetheless, this is just the tip of the iceberg for what's possible with Painless. -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. attribution Leeroy Agency Abdullah Alger is a former University lecturer who likes to dig into code, show people how to use and abuse technology, talk about GIS, and fish when the conditions are right. Coffee is in his DNA. Love this article? Head over to Abdullah Alger ’s author page to keep reading.CONQUER THE DATA LAYER Spend your time developing apps, not managing databases. Try Compose for Free for 30 DaysRELATED ARTICLES Aug 4, 2017NEWSBITS - SUMMER READING WITH SCYLLA, ELASTICSEARCH, CASSANDRA AND POSTGRESQL These are the Compose NewsBits for the week ending August 4th... Using Scylla and Elasticsearch together. Cassandra, partitio… Dj Walker-Morgan Jul 28, 2017NEWSBITS - SCYLLA PREVIEWS MATERIALIZED VIEWS These are the database, cloud and developer News bits for the week ending July 28th: A preview of Scylla's materialized view… Dj Walker-Morgan Jul 12, 2017INTEGRATION TESTING AGAINST REAL DATABASES Integration testing can be challenging, and adding a database to the mix makes it even more so. In this Write Stuff contribu… Guest Author Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","With the release of Elasticsearch 5.x came Painless, Elasticsearch's answer to safe, secure, and performant scripting. We'll introduce you to Painless and show you what it can do.",How to Script Painless-ly in Elasticsearch,Live,133 340,"☰ * Login * Sign Up * Learning Paths * Courses * Our Courses * Partner Courses * Badges * Our Badges * BDU Badge Program * Events * Blog * Resources * Resources List * Downloads * BLOG Welcome to the Big Data University Blog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * This Week in Data Science (November 01, 2016) * This Week in Data Science (October 25, 2016) * This Week in Data Science (October 18, 2016) * How to run a successful Data Science meetup * This Week in Data Science (October 11, 2016) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsTHIS WEEK IN DATA SCIENCE (NOVEMBER 01, 2016) Posted on November 3, 2016 by cora Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * Democracy in the age of the Internet of Things – With the release of Swipe the Vote in spring 2016, Tinder, the ultimate hook-up app, broke new ground in the United States by claiming to be able to match young voters with their dream-perfect presidential candidate. * 5 Simple Math Problems No One Can Solve – Easy to understand, supremely difficult to prove. * Building an efficient neural language model over a billion words – New tools help researchers train state-of-the-art language models. * These are the 10 hottest data jobs – With recruitment in data on the rise and big businesses investing heavily in the data sector, Hays are looking at the top jobs in data. * Scholars use Big Data to show Marlowe co-wrote three Shakespeare plays – A new edition of William Shakespeare’s complete works will name Christopher Marlowe as co-author of three plays, shedding new light on the links between the two great playwrights after centuries of speculation and conspiracy theories. * Predicting the Presidential Election – With the presidential election less than a week out, Greg shares how he uses data to predict the results of the race. * What Happens When You Merge Virtual Reality with Big Data – Researchers at Cal Tech University are working on platforms that would allow scientists to use immersive virtual reality for multidimensional data visualization. * Pokemon Go Increased U.S. Activity Levels by 144 Billion Steps in Just 30 Days – The latest gaming craze increases activity levels for players, regardless of their age, sex, or weight. * Watch IBM Watson Suggest Treatments for a Cancer Patient – An IBM exec showed off a demo at Fortune’s inaugural Brainstorm Health conference. * Once Again: Prefer Confidence Intervals to Point Estimates – Today I saw a claim being made on Twitter that 17% of Jill Stein supporters in Louisiana are also David Duke supporters. For anyone familiar with US politics, this claim is a priori implausible, although certainly not impossible. * Data science and Big Data: Definitions and Common Myths – There are many ways to define what big data is, and this is why probably it still remains a really difficult concept to grasp. * Accelerated Computing and Deep Learning – This is truly an extraordinary time. In my three decades in the computer industry, none has held more potential, or been more fun. The era of AI has begun. * What to Know Before You Get In a Self-driving Car – Uber thinks its self-driving taxis could change the way millions of people get around. But autonomous vehicles aren’t any­where near to being ready for the roads. * Education’s Response to the Big Data Skills Demand – What are universities and colleges doing to make Big Data skills easier to obtain, and how are they speeding up the educational process to get these people into the workforce faster? UPCOMING DATA SCIENCE EVENTS * Introduction to Python for Data Science – Learn how to use Python for data science on November 10th. * IBM Event: Analytics Strategies in the Cloud – Join IBM and 2-time Canadian Olympic gold-medalist Alexandre Bilodeau on November 7th for a complimentary event in Montreal where you’ll network, eat, drink and engage in an inspiring discussion on making business analytics easier and more available for all departments throughout your company. SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * RELATED Tags: analytics , Big Data , data science , events , weekly roundup -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Community * FAQ * Ambassador Program * Legal Follow us * * * * * * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our thirty eighth release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (November 01, 2016)",Live,134 349,"How do you back up a CouchDB or Cloudant database? One solution is to useCouchDB’s built-in replication API. Let’s say we have a Cloudant database called mydata that we need to back up.In CouchDB 1.x, backing up an entire database was as simple as locating thedatabase’s .couch file and copying it somewhere else. With its 2.x release, CouchDB and theCloudant database shard the data, splitting a single database into pieces anddistributing the data across multiple servers. So backing up a database is nolonger as simple as copying a single file.Then how do you back up? This blog post presents 3 options: * back up to a text file * replicate via the command-line * replicate via the Cloudant dashboardBACK UP TO A TEXT FILECloudant has a RESTful HTTP API, so it is easy to create your own tools tointeract with the service. I created a command-line tool called couchbackup , which you can use to spool an entire database (either CouchDB or Cloudant) toa text file.N.B. couchbackup does not do CouchDB replication, it simply pages throught the/_all_docs endpoint. Conflicts, deletions and revision history are discarded.Only the winning revisions (without the _rev) survive.To install the tool:You must have Node.js installed, together with its “npm” package manager. Then follow these steps: 1. Run: npm install -g couchbackup 2. Define an environment variable which holds the path of either: * your remote Cloudant database: export COUCH_URL=""https://myusername:mypassword@myhost.cloudant.com"" * or local CouchDB instance: export COUCH_URL=""http://localhost:5984"" 3. Back up individual databases to their own text files: couchbackup --db mydb mydb.txt 4. If you want to restore data from a backup into an empty database, then use the tool couchrestore which was also installed with couchbackup : cat mydb.txt | couchrestore --db mydb 5. To increase the speed of the restore operation you can perform multiple write operations in parallel: cat mydb.txt | couchrestore --db mydb --parallelism 5 REPLICATION VIA THE COMMAND-LINEAnother option is to replicate the database to another Cloudant account or to another CouchDB service byissuing an API call to set off a replication task that copies data from thesource database to the target database.Start replication by adding a document into the _replicator database; a document that lists the source and target database, includingauthentication credentials. You can achieve all of this from the command-lineusing a single curl command: export SOURCE=""https://myusername:mypassword@myhost.cloudant.com"" export TARGET=""https://myotherusername:myotherpassword@myotherhost.cloudant.com"" export JSON=""{\""source\"":\""$SOURCE/mydata\"",\""target\"":\""$TARGET/mydata\""}"" curl -X PUT -H ""Content-Type: application/json"" -d ""$JSON"" ""$SOURCE/_replicator""{""id"":""0b05156eefc1feca97e48cd6bd000380"",""_rev"":""1-a301b0fbfa8840f3ca936876729e37cc""} The API returns with a JSON object containing the id of a document, which youcan fetch to monitor the status of the replication job: curl ""$SOURCE/_replicator/0b05156eefc1feca97e48cd6bd000380""If you have Apache CouchDB installed locally and you intend to back up data froma Cloudant cluster, then instruct your local CouchDB installation to perform thereplication. Why your local machine? Because it has visibility to the Cloudantservice, but not vice-versa. export SOURCE=""https://myusername:mypassword@myhost.cloudant.com"" export TARGET=""https://localhost:5984"" export JSON=""{\""source\"":\""$SOURCE/mydata\"",\""target\"":\""$TARGET/mydata\""}"" curl -X PUT -H ""Content-Type: application/json"" -d ""$JSON"" ""$TARGET/_replicator""{""id"":""0b05156eefc1feca97e48cd6bd001976"",""_rev"":""1-ac15e7843682715ccb712fac41169cf5""} REPLICATION VIA THE CLOUDANT DASHBOARDYou can also start and monitor a replication using the web-based user interfaceof the Cloudant dashboard. 1. On the left, choose the Replication tab, 2. Click New Replication 3. Complete the form and click Replicate .You can monitor running replications from this screen.In the above example, we are replicating a database that lives in the currentuser’s Cloudant account (the My Databases tab in the Source Database section) to another Cloudant account (the Remote Database tab in the Target Database section). Use the same form to perform replicationsbetween all combinations of local and remote sources and targets.THE DIFFERENCE BETWEEN REPLICATION AND COUCHBACKUPCouchDB/Cloudant replication is a sophisticated sync protocol that ensures alldata from the source database is transferred to the target. If the targetdatabase already contains some documents, then clashing revisions are stored as document conflicts . In addition, deleted documents from the source database are also transferredto the target database.couchbackup simply iterates through the /db/_all_docs endpoint fetching the “winning revisions” no conflicting revisions are created.The result of a couchrestore operation is a collection of “first revisions” that matches the winningrevisions of the source database.BACK UP BEFORE TRYING CLOUDANT’S COUCHDB 2.0 SANDBOXNow that you have the tools you need to do backups, run one now before moving toCloudant’s new sandbox001 cluster. It’s a test cluster that’s running an alpha release of Apache CouchDB2.0. (Backups are important here, as all data will be deleted from the clusterat the end of the sandbox program!)Cloudant will soon run its clusters on the CouchDB 2.0 code base. It’s all partof a larger effort to realign Cloudant’s code base with that of the Apacheproject. For more information, read Stefan Kruger’s article, “Cloudant <3 Apache CouchDB™ 2.0″ , which includes details on accessing the sandbox cluster.LINKS * Cloudant Replication documentation * couchbackup© “Apache”, “CouchDB”, “Apache CouchDB”, and the CouchDB logo are trademarks orregistered trademarks of The Apache Software Foundation. All other brands andtrademarks are the property of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: backup / cloudant / couchbackup / CouchDB / NoSQL / replication Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Following CouchDB's latest release, how do you back up a CouchDB or Cloudant database?",Simple CouchDB and Cloudant Backup,Live,135 354,"☰ * Login * Sign Up * Learning Paths * Courses * Our Courses * Partner Courses * Badges * Our Badges * BDU Badge Program * BLOG Welcome to the BDUBlog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * This Week in Data Science (February 7, 2017) * This Week in Data Science (January 31, 2017) * This Week in Data Science (January 24, 2017) * This Week in Data Science (January 17, 2017) * This Week in Data Science (January 10, 2017) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsTHIS WEEK IN DATA SCIENCE (FEBRUARY 7, 2017) Posted on February 7, 2017 by Janice Darling Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * IBM and United Airlines collaborate on enterprise iOS apps – United Airlines partners with IBM to develop iOS apps in an effort more efficient customer service. * Capturing IoT data from network’s edge to the cloud – Improving customer service through combining untapped IoT data and traditional consumer data. * Becoming a Data Scientist – The skills and tools needed to become an effective Data Scientist. * IBM’s Watson wants to help you do your taxes at H&R Block – IBM Watson partners with H&R Block to improve customer service and identify credits and deductions. * Essentials of working with Python cloud (Ubuntu) – A summary of functionalities that may assist in running Python scripts on the Ubuntu cloud. * First IBM France Sparkathon a winning success – Top Apache Spark enthusiasts participated in the first IBM Sparkathon aimed at improving banking customer services. * Now over 10,000 packages in R – The official R package repository has surpassed the 10,000 mark. * IBM calls healthcare industry a ‘leaky vessel in a stormy sea’ – How the healthcare industry is more at risk for cyberattacks. * A Computer Just Clobbered Four Pros At Poker – Program making use of A.I. algorithm defeats poker professionals. * Internet of Things Tutorial: IoT Devices and the Semantic Sensor Web – How IoT applications utilize multiple sensors and Internet connected devices. * The 5 deadly Data Management sins – 5 practices to avoid Data Management pitfalls. * Data Scientist – best job in America, again – Glassdoor has again ranked the Data Scientist position as the best job in USA. * Internet of Things: Setting business vision on speed and agility – The importance of an agile data platform in a competitive atmosphere. * Stream processing and the IBM Open Platform – Choosing the right engine for real-time data processing with Hadoop. * R Packages worth a look – A roundup of some interesting R packages. UPCOMING DATA SCIENCE EVENTS * IBM Event: Big Data and Analytics Summit – February 14, 2017 @ 7:15 am – 4:45 pm, Toronto Marriott Downtown Eaton Centre Hotel 525 Bay St. Toronto Ontario. COOL DATA SCIENCE VIDEOS * Deep Learning with Tensorflow – Applying Recurrent Networks to Language Modelling – Explanation of Applying Recurrent Networks to Language Modelling * Deep Learning with Tensorflow – Introduction to Unsupervised Learning – Overview of the basic concepts of Unsupervised Learning. * Deep Learning with Tensorflow – RBMs and Autoencoders – An overview of RBMs and Autoencoders. * SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * * RELATED Tags: analytics , Big Data , data science , events -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Events * Ambassador Program * Resources * FAQ * Legal Follow us * * * * * * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (February 7, 2017)",Live,136 355,"This video shows you how to execute some common HTTP API commands to create, read, update, and delete data in a Cloudant database. Sign up for a Cloudant account here: https://cloudant.com/sign-up/. Find more videos and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center","This video shows you how to execute some common HTTP API commands to create, read, update, and delete data in a Cloudant database. ",Execute Common HTTP API Commands,Live,137 362,"Learn R programming for data science * Home * About Us * Archives * Contribute * Free Account * We share R tutorials from scientists at academic and scientific institutions with a goal to give everyone in the world access to a free knowledge. Our tutorials cover different topics including statistics, data manipulation and visualization! Introduction Getting Data Data Management Visualizing Data Basic Statistics Regression Models Advanced Modeling Programming Best R Packages Tips & Tricks Data ManagementBEST PACKAGES FOR DATA MANIPULATION IN R by Fisseha Berhane on May 17, 2016 2 Commentsdplyr and data.table are amazing packages that make data manipulation in R fun. Both packages have their strengths. While dplyr is more elegant and resembles natural language, data.table is succinct and we can do a lot with data.table in just a single line. Further, data.table is, in some cases, faster (see benchmark here ) and it may be a go-to package when performance and memory are constraints. You can read comparison of dplyr and data.table from Stack Overflow and Quora . You can get reference manual and vignettes for data.table here and for dplyr here . You can read other tutorial about dplyr published at DataScience+ BACKGROUND I am a long time dplyr and data.table user for my data manipulation tasks. For someone who knows one of these packages, I thought it could help to show codes that perform the same tasks in both packages to help them quickly study the other. If you know either package and have interest to study the other, this post is for you. DPLYR dplyr has 5 verbs which make up the majority of the data manipulation tasks we perform. Select: used to select one or more columns; Filter: used to select some rows based on specific criteria; Arrange: used to sort data based on one or more columns in ascending or descending order; Mutate: used to add new columns to our data; Summarise: used to create chunks from our data. DATA.TABLE data.table has a very succinct general format: DT[ i, j, by ], which is interpreted as: Take DT, subset rows using i , then calculate j grouped by by . DATA MANIPULATION First we will install some packages for our project. library(dplyr) library(data.table) library(lubridate) library(jsonlite) library(tidyr) library(ggplot2) library(compare) The data we will use here is from DATA.GOV . It is Medicare Hospital Spending by Claim and it can be downloaded from here . Let’s download the data in JSON format using the fromJSON function from the jsonlite package. Since JSON is a very common data format used for asynchronous browser/server communication, it is good if you understand the lines of code below used to get the data. You can get an introductory tutorial on how to use the jsonlite package to work with JSON data here and here . However, if you want to focus only on the data.table and dplyr commands, you can safely just run the codes in the two cells below and ignore the details. spending=fromJSON(""https://data.medicare.gov/api/views/nrth-mfg3/rows.json?accessType=DOWNLOAD"") names(spending) ""meta"" ""data"" meta=spending$meta hospital_spending=data.frame(spending$data) colnames(hospital_spending)=make.names(meta$view$columns$name) hospital_spending=select(hospital_spending,-c(sid:meta)) glimpse(hospital_spending) Observations: 70598 Variables: $ Hospital.Name (fctr) SOUTHEAST ALABAMA MEDICAL CENT... $ Provider.Number. (fctr) 010001, 010001, 010001, 010001... $ State (fctr) AL, AL, AL, AL, AL, AL, AL, AL... $ Period (fctr) 1 to 3 days Prior to Index Hos... $ Claim.Type (fctr) Home Health Agency, Hospice, I... $ Avg.Spending.Per.Episode..Hospital. (fctr) 12, 1, 6, 160, 1, 6, 462, 0, 0... $ Avg.Spending.Per.Episode..State. (fctr) 14, 1, 6, 85, 2, 9, 492, 0, 0,... $ Avg.Spending.Per.Episode..Nation. (fctr) 13, 1, 5, 117, 2, 9, 532, 0, 0... $ Percent.of.Spending..Hospital. (fctr) 0.06, 0.01, 0.03, 0.84, 0.01, ... $ Percent.of.Spending..State. (fctr) 0.07, 0.01, 0.03, 0.46, 0.01, ... $ Percent.of.Spending..Nation. (fctr) 0.07, 0.00, 0.03, 0.58, 0.01, ... $ Measure.Start.Date (fctr) 2014-01-01T00:00:00, 2014-01-0... $ Measure.End.Date (fctr) 2014-12-31T00:00:00, 2014-12-3... As shown above, all columns are imported as factors and let’s change the columns that contain numeric values to numeric. cols = 6:11; # These are the columns to be changed to numeric. hospital_spending[,cols] <- lapply(hospital_spending[,cols],as.character) hospital_spending[,cols] <- lapply(hospital_spending[,cols], as.numeric) The last two columns are measure start date and measure end date. So, let’s use the lubridate package to correct the classes of these columns. cols = 12:13; # These are the columns to be changed to dates. hospital_spending[,cols] <- lapply(hospital_spending[,cols], ymd_hms) Now, let’s check if the columns have the classes we want. sapply(hospital_spending, class) $Hospital.Name ""factor"" $Provider.Number. ""factor"" $State ""factor"" $Period ""factor"" $Claim.Type ""factor"" $Avg.Spending.Per.Episode..Hospital. ""numeric"" $Avg.Spending.Per.Episode..State. ""numeric"" $Avg.Spending.Per.Episode..Nation. ""numeric"" $Percent.of.Spending..Hospital. ""numeric"" $Percent.of.Spending..State. ""numeric"" $Percent.of.Spending..Nation. ""numeric"" $Measure.Start.Date ""POSIXct"" ""POSIXt"" $Measure.End.Date ""POSIXct"" ""POSIXt"" CREATE DATA TABLE We can create a data.table using the data.table() function. hospital_spending_DT = data.table(hospital_spending) class(hospital_spending_DT) ""data.table"" ""data.frame"" SELECT CERTAIN COLUMNS OF DATA To select columns, we use the verb select in dplyr . In data.table , on the other hand, we can specify the column names. SELECTING ONE VARIABLE Let’s selet the “Hospital Name” variable from_dplyr = select(hospital_spending, Hospital.Name) from_data_table = hospital_spending_DT[,.(Hospital.Name)] Now, let’s compare if the results from dplyr and data.table are the same. compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes REMOVING ONE VARIABLE from_dplyr = select(hospital_spending, -Hospital.Name) from_data_table = hospital_spending_DT[,!c(""Hospital.Name""),with=FALSE] compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes we can also use := function which modifies the input data.table by reference. We will use the copy() function, which deep copies the input object and therefore any subsequent update by reference operations performed on the copied object will not affect the original object. DT=copy(hospital_spending_DT) DT=DT[,Hospital.Name:=NULL] ""Hospital.Name""%in%names(DT)FALSE We can also remove many variables at once similarly: DT=copy(hospital_spending_DT) DT=DT[,c(""Hospital.Name"",""State"",""Measure.Start.Date"",""Measure.End.Date""):=NULL] c(""Hospital.Name"",""State"",""Measure.Start.Date"",""Measure.End.Date"")%in%names(DT) FALSE FALSE FALSE FALSE SELECTING MULTIPLE VARIABLES Let’s select the variables: Hospital.Name,State,Measure.Start.Date,and Measure.End.Date. from_dplyr = select(hospital_spending, Hospital.Name,State,Measure.Start.Date,Measure.End.Date) from_data_table = hospital_spending_DT[,.(Hospital.Name,State,Measure.Start.Date,Measure.End.Date)] compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes DROPPING MULTIPLE VARIABLES Now, let’s remove the variables Hospital.Name,State,Measure.Start.Date,and Measure.End.Date from the original data frame hospital_spending and the data.table hospital_spending_DT. from_dplyr = select(hospital_spending, -c(Hospital.Name,State,Measure.Start.Date,Measure.End.Date)) from_data_table = hospital_spending_DT[,!c(""Hospital.Name"",""State"",""Measure.Start.Date"",""Measure.End.Date""),with=FALSE] compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes dplyr has functions contains() , starts_with() and, ends_with() which we can use with the verb select. In data.table , we can use regular expressions. Let’s select columns that contain the word Date to demonstrate by example. from_dplyr = select(hospital_spending,contains(""Date"")) from_data_table = subset(hospital_spending_DT,select=grep(""Date"",names(hospital_spending_DT))) compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes names(from_dplyr) ""Measure.Start.Date"" ""Measure.End.Date"" RENAME COLUMNS setnames(hospital_spending_DT,c(""Hospital.Name"", ""Measure.Start.Date"",""Measure.End.Date""), c(""Hospital"",""Start_Date"",""End_Date"")) names(hospital_spending_DT) ""Hospital"" ""Provider.Number."" ""State"" ""Period"" ""Claim.Type"" ""Avg.Spending.Per.Episode..Hospital."" ""Avg.Spending.Per.Episode..State."" ""Avg.Spending.Per.Episode..Nation."" ""Percent.of.Spending..Hospital."" ""Percent.of.Spending..State."" ""Percent.of.Spending..Nation."" ""Start_Date"" ""End_Date"" hospital_spending = rename(hospital_spending,Hospital= Hospital.Name, Start_Date=Measure.Start.Date,End_Date=Measure.End.Date) compare(hospital_spending,hospital_spending_DT, allowAll=TRUE) TRUE dropped attributes FILTERING DATA TO SELECT CERTAIN ROWS To filter data to select specific rows, we use the verb filter from dplyr with logical statements that could include regular expressions. In data.table , we need the logical statements only. FILTER BASED ON ONE VARIABLE from_dplyr = filter(hospital_spending,State=='CA') # selecting rows for California from_data_table = hospital_spending_DT[State=='CA'] compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes FILTER BASED ON MULTIPLE VARIABLES from_dplyr = filter(hospital_spending,State=='CA' & Claim.Type!=""Hospice"") from_data_table = hospital_spending_DT[State=='CA' & Claim.Type!=""Hospice""] compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes from_dplyr = filter(hospital_spending,State %in% c('CA','MA',""TX"")) from_data_table = hospital_spending_DT[State %in% c('CA','MA',""TX"")] unique(from_dplyr$State) CA MA TX compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes ORDER DATA We use the verb arrange in dplyr to order the rows of data. We can order the rows by one or more variables. If we want descending, we have to use desc() as shown in the examples.The examples are self-explanatory on how to sort in ascending and descending order. Let’s sort using one variable. ASCENDING from_dplyr = arrange(hospital_spending, State) from_data_table = setorder(hospital_spending_DT, State) compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes DESCENDING from_dplyr = arrange(hospital_spending, desc(State)) from_data_table = setorder(hospital_spending_DT, -State) compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes SORTING WITH MULTIPLE VARIABLES Let’s sort with State in ascending order and End_Date in descending order. from_dplyr = arrange(hospital_spending, State,desc(End_Date)) from_data_table = setorder(hospital_spending_DT, State,-End_Date) compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes ADDING/UPDATING COLUMN(S) In dplyr we use the function mutate() to add columns. In data.table , we can Add/update a column by reference using := in one line. from_dplyr = mutate(hospital_spending, diff=Avg.Spending.Per.Episode..State. - Avg.Spending.Per.Episode..Nation.) from_data_table = copy(hospital_spending_DT) from_data_table = from_data_table[,diff := Avg.Spending.Per.Episode..State. - Avg.Spending.Per.Episode..Nation.] compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE sorted renamed rows dropped row names dropped attributes from_dplyr = mutate(hospital_spending, diff1=Avg.Spending.Per.Episode..State. - Avg.Spending.Per.Episode..Nation.,diff2=End_Date-Start_Date) from_data_table = copy(hospital_spending_DT) from_data_table = from_data_table[,c(""diff1"",""diff2"") := list(Avg.Spending.Per.Episode..State. - Avg.Spending.Per.Episode..Nation.,diff2=End_Date-Start_Date)] compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes SUMMARIZING COLUMNS We can use the summarize() function from dplyr to create summary statistics. summarize(hospital_spending,mean=mean(Avg.Spending.Per.Episode..Nation.)) mean 1820.409 hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Nation.))] mean 1820.409 summarize(hospital_spending,mean=mean(Avg.Spending.Per.Episode..Nation.), maximum=max(Avg.Spending.Per.Episode..Nation.), minimum=min(Avg.Spending.Per.Episode..Nation.), median=median(Avg.Spending.Per.Episode..Nation.)) mean maximum minimum median 1820.409 20025 0 109 hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Nation.), maximum=max(Avg.Spending.Per.Episode..Nation.), minimum=min(Avg.Spending.Per.Episode..Nation.), median=median(Avg.Spending.Per.Episode..Nation.))] mean maximum minimum median 1820.409 20025 0 109 We can calculate our summary statistics for some chunks separately. We use the function group_by() in dplyr and in data.table , we simply provide by . head(hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Hospital.)),by=.(Hospital)]) mygroup= group_by(hospital_spending,Hospital) from_dplyr = summarize(mygroup,mean=mean(Avg.Spending.Per.Episode..Hospital.)) from_data_table=hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Hospital.)), by=.(Hospital)] compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE sorted renamed rows dropped row names dropped attributes We can also provide more than one grouping condition. head(hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Hospital.)), by=.(Hospital,State)]) mygroup= group_by(hospital_spending,Hospital,State) from_dplyr = summarize(mygroup,mean=mean(Avg.Spending.Per.Episode..Hospital.)) from_data_table=hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Hospital.)), by=.(Hospital,State)] compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE sorted renamed rows dropped row names dropped attributes CHAINING With both dplyr and data.table , we can chain functions in succession. In dplyr , we use pipes from the magrittr package with %>% which is really cool. %>% takes the output from one function and feeds it to the first argument of the next function. In data.table , we can use %>% or [ for chaining. from_dplyr=hospital_spending%>%group_by(Hospital,State)%>%summarize(mean=mean(Avg.Spending.Per.Episode..Hospital.)) from_data_table=hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Hospital.)), by=.(Hospital,State)] compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE sorted renamed rows dropped row names dropped attributes hospital_spending%>%group_by(State)%>%summarize(mean=mean(Avg.Spending.Per.Episode..Hospital.))%>% arrange(desc(mean))%>%head(10)%>% mutate(State = factor(State,levels = State[order(mean,decreasing =TRUE)]))%>% ggplot(aes(x=State,y=mean))+geom_bar(stat='identity',color='darkred',fill='skyblue')+ xlab("""")+ggtitle('Average Spending Per Episode by State')+ ylab('Average')+ coord_cartesian(ylim = c(3800, 4000)) hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Hospital.)), by=.(State)][order(-mean)][1:10]%>% mutate(State = factor(State,levels = State[order(mean,decreasing =TRUE)]))%>% ggplot(aes(x=State,y=mean))+geom_bar(stat='identity',color='darkred',fill='skyblue')+ xlab("""")+ggtitle('Average Spending Per Episode by State')+ ylab('Average')+ coord_cartesian(ylim = c(3800, 4000)) SUMMARY In this blog post, we saw how we can perform the same tasks using data.table and dplyr packages. Both packages have their strengths. While dplyr is more elegant and resembles natural language, data.table is succinct and we can do a lot with data.table in just a single line. Further, data.table is, in some cases, faster and it may be a go-to package when performance and memory are the constraints. You can get the code for this blog post at my GitHub account. This is enough for this post. If you have any questions or feedback, feel free to leave a comment. Tags Best R Packages Data Manipulation dplyr The Author Fisseha is a writer for DataScience+, a data scientist at Aurotech and works for the FDA. He enjoys challenging and complex data analysis, data mining, machine learning and data visualization tasks. Fisseha holds a PhD in atmospheric Physics. LinkedIn WebsiteDISCLOSURE * Fisseha Berhane does not work or receive funding from any company or organization that would benefit from this article. 0 Shares Like this article? Give it a share: Facebook Twitter Google+ Linkedin Email this * Andrej OskinThank you for this interesting article, but one thing is wrong. These lines will mess original data: “` cols = 6:11; # These are the columns to be changed to numeric. hospital_spending[,cols] <- lapply(hospital_spending[,cols], as.numeric) “` It is discussed for example in this SO question: http://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-an-integer-numeric-without-a-loss-of-information You can not apply ""as.numeric"" to factor, because you will get labels instead of levels. I think it would be better to define options(stringAsFactors = F) somewhere at the beginning of your script, or in .Rprofile * Fisseha BerhaneThank you. Changed it to character first by adding hospital_spending[,cols] <- lapply(hospital_spending[,cols],as.character) Also: updated results that were affected * * TRENDING NOW ON DATASCIENCE+ * K Means Clustering in R * Sentiment analysis with machine learning in R * Fitting a Neural Network in R; neuralnet package * Implementing Apriori Algorithm in R * How to Create, Rename, Recode and Merge Variables in R DataScience+ Learn R programming for data science Site Links * About Us * Contribute * Advertise * Contact Us Legal * Privacy Policy * Terms of Use * Account Terms * Stylebook Other Sites * R Bloggers * * * * Connect with Us © 2016 DataSciencePlus.com","dplyr and data.table are amazing packages that make data manipulation in R fun. Both packages have their strengths. While dplyr is more elegant and resembles natural language, data.table is succinct",Best packages for data manipulation in R,Live,138 364,"Enterprise Pricing Articles Sign in Free 30-Day TrialDESIGNING THE UFC MONEYBALL Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 18, 2017Using big data analysis on sports? Gigi Sayfan takes us through doing just that with Cassandra/Scylla, MySQL, and Redis. In this Compose's Write Stuff article, he shows us how he constructs his own Moneyball. Sports and big data analysis are a great match. In any sporting event, so much is happening at every moment. There are many trends that evolve over various time scales - momentum within the same match, within a season, and over the career of an athlete. You can collect a lot of data, analyze it and use it for many purposes. The movie Moneyball (based on a book by Michael Lewis ) made this notion popular. In this article, we will take this concept to the world of MMA (mixed martial arts) and design a data collection and analytics platform for it. The main focus will be on the data collection, storage, and access. The design of the actual analytics will be left as the dreaded exercise to the reader! The weapons of choice will Cassandra/Scylla, MySQL, and Redis. UFC MONEYBALL Before we jump ahead and start talking databases, let's understand the domain a little bit and what we want to accomplish. I always find it as a critical first step that provides structure and a framework to operate within. The use cases and conceptual model usually stabilize pretty quickly and provide a lot of clarity. Additional use cases and concepts are often added later as ""yet another"" and fit into the existing framework. QUICK INTRO TO MMA MMA is the sport where two competitors with backgrounds in multiple martial arts fight each other in a cage using versatile techniques that involve striking (punching and kicking) and grappling (throws, takedowns, joint locks, and chokes). A fighter wins if his opponents submits, is knocked out, or is unable to defend themselves intelligently as determined by a referee present in the cage with them. The UFC is most popular and successful organization and there is a lot of money involved . USE CASES When you start looking at MMA from data analytics point of view many use cases come to mind. Here is an arbitrary list: * Understand the style of a particular fighter * Find effective attacks against a particular fighter * Find effective defense against the attack of a particular fighter * Understand the energy level of a fighter along a fight * Understand how a fighter changes his style in the presence of an injury * Understand the game plan of a fighter against a particular fighter or type of fighter * Find tactics that can throw a fighter off his game plan * Adjust game plan during a round or between rounds For example, the fighter Demian Maia is an elite Jiu Jitsu practitioner. He is famous for his effective ground game where he drags opponents to the floor, mounts them or takes their back and chokes them. Maia is at one end of the scale because his game plan is very simple and everybody knows what he's going to do, but he is so good at it that he is very difficult to stop. The fighter Yair Rodriguez is at the other end of the scale exhibiting an extravagant style full of jumping, spinning kicks and somersaults mixed with surprise take-downs. It doesn't appear that he himself knows what he's going to do from one second to the next. CONCEPTUAL MODEL A basic conceptual model for this domain may include the following entities: Fighter, Match, and Event. The Fighter entity has a lot of data: physical attributes, age, weight class, fighting stance, stamina, ranks in various martial arts disciplines, match history, injury history, favorite techniques, etc. The Fight entity has a lot of data, too: venue, opponents, referee, number of rounds, and a collection of events. The FightEvent entity represents anything relevant that happens during a match: fighter A advances, takedown/throw attempt, jab thrown, jab lands, uppercut thrown, knock down, front kick to the face, eye poke (illegal), stance switch, guard pass to side control, arm-bar attempt, etc. The EventCategory classifies events to various categories. For example, movement, punch, kick, judo throw, position change on the ground, submission. The interesting aspect of fight events is that they represent a time-series and the order and timing of event sequences contain a lot of information that help our use cases. Note that the UFC organizes and promotes events that contain multiple matches in one night. The events we consider here are fight events happening during the match. MMA ANALYTICS Machine learning or more traditional statistical analysis and visualization can take all the data about a fighter and their opponent, both historically and in real-time during a match. It can provide a lot of insights that will help the well-informed fighter and their coaches prepare the perfect game plan for a particular opponent in a particular match and adjust it intelligently based on how the match evolves. CASSANDRA, MYSQL AND REDIS The UFC moneyball data is diverse and will be used in different ways. Storing it all in one database is not ideal. In this section, I'll describe briefly the databases we will use in our design. CASSANDRA/SCYLLA The open source Apache Cassandra is a great database for time-series data. It was designed for distributed, large-scale workloads. It is fast, stable and battle-tested. It is a decentralized, highly available and has no single point of failure. Cassandra is also idempotent and provides an interesting mix of consistency levels on a query by query level. Cassandra succeeds in doing all that by a careful selection of its feature set and even more careful selection of the features it doesn't implement. For example, efficient ad-hoc queries are not supported. With Cassandra you better know the shape of your queries when you model your data and design your schema. Scylla a high-performance drop-in replacement for Cassandra, which is already plenty fast. It claims to have 10X better throughput and super low latency. Conveniently Compose provides Hosted Scylla (in beta right now). This is cool because you get to benefit from the extensive Cassandra documentation, experience, tooling and community and yet run a streamlined and highly optimized Scylla engine. MYSQL MySQL needs no introduction. I'll just mention that Compose now has a hosted MySQL service in beta. If you prefer PostgreSQL, which is also hosted by Compose, or any other relational database that's fine. I will not be using any MySQL-specific capabilities here and the concepts transfer. REDIS Redis is top of the class when it comes to fast in-memory key-value stores. But, it is much more than that and defines itself as a data structure server. We'll see this capability in action later. Of course, Compose can host Redis for you. A HYBRID POLY-STORE STORAGE SCHEME In this section, we'll model our domain and conceptual model. The basic idea is to utilize the strength of each store and divide each type of data or metadata into the most appropriate data store. Then the application can combine data from multiple stores. STORING FIGHT EVENTS IN CASSANDRA Cassandra is a columnar database. This means that column data is stored sequentially in memory (and on disk). But, unlike relational databases, you can query arbitrary data in a single query. Cassandra organizes the data in wide rows. Each such wide row has a key and can contain a lot of data (e.g. 100MB) and you can query a single wide row at the time. If you try to think of it in relational terms, then a wide row is the analog of a SQL table in a DB that doesn't support joins. This can get really confusing because CQL (Cassandra Query Language) is very similar syntactically to SQL, but the same terms mean different things. For example, a Cassandra table is made of multiple wide rows. Each row in the table shares the same schema, but since you can query on a single row at a time it is better to think of a Cassandra table as a collection of SQL tables with similar schema in a sharded relational database. This is pretty accurate because different wide rows may be split across machines. Another limitation of Cassandra's design is that you can efficiently query only consecutive data from a single wide row. That means that it is very important to design your schema in a way that matches your queries. If you need to query data in different orders, Cassandra says disks are cheap and you just need to store the data multiple times in different orders (a.k.a materialized views). Let's see how all this affects our modeling of fight events. We're interested in querying fight events at the match level and then at the round level. This way we can analyze the meaningful time series. We may be interested also in doing longitudinal studies on a particular fighter, how they evolved over their career, what are their strengths and weaknesses etc. Here is a Cassandra table schema that addresses these concerns: CREATE KEYSPACE fightdb WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }; use fightdb; DROP TABLE fight_events; CREATE TABLE fight_events ( fight_id int, round int, ts int, fighter_id int, event_id int, PRIMARY KEY (fight_id, round, ts) ) WITH CLUSTERING ORDER BY (round ASC, ts ASC); Let's break it down. The first line creates a keyspace called fightdb , which is like a separate DB with its own policies. Normally, replication factor will be at least 3 to gain redundancy. Then we tell Cassandra to use that it, so there is no need to qualify names with the DB name. Next, we drop the fight_events table in case we're re-creating the DB from scratch. Don't do this in production because you'll destroy all your data. You can ALTER TABLE to modify the schema. Finally, we get to create the table fight_Events . It looks like regular SQL. The columns are defined using Cassandra data types. The primary key is where things get interesting. The primary key is composed of a partition key and a clustering key. The partition key is fight_id and it defines the wide row. Every entry with the same fight id will go into the same wide row. The clustering key is round and ts . The ts column represents seconds into the current 5 minutes round (values will be 0 through 299). Inside the wide row, each record is called a compound column, which is the analog of a SQL row. Then we have the clustering order, which says that the order will be by round first and ts second both ascending. It looks pretty harmless so far. Let's insert some data. Inserts look just like SQL inserts, but you have to provide the primary key. No auto incrementing ID (which you want to avoid in a distributed system anyway) INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id) VALUES (1, 1, 10, 2, 1); INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id) VALUES (1, 1, 11, 2, 4); INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id) VALUES (1, 1, 12, 1, 3); INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id) VALUES (1, 2, 7, 1, 4); INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id) VALUES (1, 2, 8, 2, 1); INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id) VALUES (2, 1, 3, 2, 2); INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id) VALUES (2, 1, 4, 2, 1); Cassandra will also overwrite records with the same primary key. No uniques and no duplicates in Cassandra. The reason is that Cassandra is idempotent. You can perform the same operation multiple times with modifying the state. So, insert is also update in Cassandra. OK, let's run some queries. This is where things get interesting. Starting with selecting all records: select * from fight_events; fight_id | round | ts | event_id | fighter_id ----------+-------+----+----------+------------ 1 | 1 | 10 | 1 | 2 1 | 1 | 11 | 4 | 2 1 | 1 | 12 | 3 | 1 1 | 2 | 7 | 4 | 1 1 | 2 | 8 | 1 | 2 2 | 1 | 3 | 2 | 2 2 | 1 | 4 | 1 | 2 (7 rows) So far, so good. Note that the timestamp seems different. The order is indeed by round and ts . Let's verify that by inserting a record with the same primary key replacing the existing one. INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id) VALUES (1, 1, 10, 2, 5); We replaced the first record. Selecting all the records shows the following and the order remains the same: select * from fight_events; fight_id | round | ts | event_id | fighter_id ----------+-------+----+----------+------------ 1 | 1 | 10 | 1 | 2 1 | 1 | 11 | 4 | 2 1 | 1 | 12 | 3 | 1 1 | 2 | 7 | 4 | 1 1 | 2 | 8 | 1 | 2 2 | 1 | 3 | 2 | 2 2 | 1 | 4 | 1 | 2 (7 rows) Let's try getting just the events with an id greater than 3: select * from fight_events where event_id InvalidRequest: Error from server: code=2200 [Invalid query] message=""Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"" We can't do that. The ALLOW FILTERING option is a table scan, so not much help there. Maybe, we can at least select events with a particular event_id (e.g. 4): select * from fight_events where event_id = 4; InvalidRequest: Error from server: code=2200 [Invalid query] message=""Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"" We get exactly the same result. You must specify at least the partition key. Maybe, we're asking too much. The event_id is not part of the primary key, so it's sort of understandable why you can't efficiently query by it. Let's go for something simpler. Let's just get all the records from round 1: select * from fight_events where round = 1; InvalidRequest: Error from server: code=2200 [Invalid query] message=""Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"" You can't do that either. Cassandra has the IN operator that allows you to provide multiple keys in one query, but you will have to list each and every partition key: select * from fight_events where fight_id IN (1, 2) AND round = 1; fight_id | round | ts | event_id | fighter_id ----------+-------+----+----------+------------ 1 | 1 | 10 | 1 | 2 1 | 1 | 11 | 4 | 2 1 | 1 | 12 | 3 | 1 2 | 1 | 3 | 2 | 2 2 | 1 | 4 | 1 | 2 (5 rows) You must use equality tests on all the components of your where clause (which should only use elements from your clustering key left to right) except the last component where you can use inequalities or ranges. For example to get events that occurred after the first 5 seconds of round 1: select * from fight_events where fight_id IN (1, 2) AND round = 1 AND ts fight_id | round | ts | event_id | fighter_id ----------+-------+----+----------+------------ 1 | 1 | 10 | 1 | 2 1 | 1 | 11 | 4 | 2 1 | 1 | 12 | 3 | 1 Here are a couple of other queries that are invalid in CQL: select * from fight_events where fight_id IN (1, 2) AND ts InvalidRequest: Error from server: code=2200 [Invalid query] message=""PRIMARY KEY column ""ts"" cannot be restricted as preceding column ""round"" is not restricted"" select * from fight_events where fight_id IN (1, 2) AND round < 3 AND ts InvalidRequest: Error from server: code=2200 [Invalid query] message=""Clustering column ""ts"" cannot be restricted (preceding column ""round"" is restricted by a non-EQ relation)"" What about indices? Cassandra supports secondary indexes, but they come with so many restrictions and caveats that the consensus is that you should rarely use them. Check out this article for all the nitty-gritty details about secondary indices: Cassandra Native Secondary Index Deep Dive . The official position of Cassandra is that disks are cheap and if you want to access your data in different ways, you should simply duplicate the data. In general, for every query type you should have a dedicated table where they data is already organized sequentially such that you can pull out the answer as is. For example, if we want to pull all the data by rounds then our partition key should be round . Consider this primary key for the same fight_events table: PRIMARY KEY (round, fight_id, ts) WITH CLUSTERING ORDER BY (fight_id ASC, ts ASC); We merely switched fight_id and round , but this changes everything (and not for the better). Remember that the partition key defines a wide row that must be fully present on the same node. Since there are only 3 rounds (ignoring championship fights that last 5 rounds) then we can't distribute our data on more than 3 nodes. This is ridiculous of course. Given enough time and data a single machine won't even be able to hold all the fight events that occurred in round 1. The solution is a compound partition key. For example, we can add the month to the partition key (the parentheses group tells Cassandra that round and month are a compound partition key): PRIMARY KEY ((round, month), fight_id, round, ts) ) WITH CLUSTERING ORDER BY (round ASC, ts ASC); Now, each wide row will contain all the events that occurred in a round in a particular month. This is better for data distribution and you don't have to worry about the data in a single wide row growing over time beyond the capacity of a single machine. But, now if you want to query all the events in round 1 over the entire year of 2016, you'll need to run 12 queries for each combination of round 1 and a month: select * from fight_events where round = 1 and month = 1; select * from fight_events where round = 1 and month = 2; select * from fight_events where round = 1 and month = 3; ... select * from fight_events where round = 1 and month = 12; In general, we prefer to avoid data duplication. Cassandra's assertion that disks are cheap doesn't hold up for web-scale systems. In a previous company, I ran a Cassandra cluster that accumulated half a billion events per day. Over more 3 years that system collected many terabytes of data. Various analytics jobs required fast access to the entire dataset. The thing with Cassandra is that even if disk space is relatively cheap, network traffic isn't. Cassandra replicates data as part of its robust design. Cassandra is also constantly compacting and re-shuffling data across the cluster. The more data you have the more you pay for maintenance operations that might even be invisible to you. That's the reason to store as little as possible in Cassandra. You'll note also that the schema contains just integer ids. Where is the actual data? Again, the idea is to save storage. Why store repeatedly the same values? Even with Cassandra's compression, there is a price to pay (mostly in big result sets). This is especially true if you need to update some value stored ubiquitously across the cluster. Enter the relational DB. STORING METADATA IN MYSQL The idea is that all these ids like fight_id , event_id and fighter_id are identifiers of rows in a corresponding relational metadata DB. Let's look at a simple schema: CREATE TABLE fight_event ( id INTEGER, name VARCHAR(255), PRIMARY KEY(id) ) ENGINE=INNODB; CREATE TABLE fighter ( id INTEGER, name VARCHAR(255), age INTEGER, weight INTEGER, PRIMARY KEY(id) ) ENGINE=INNODB; CREATE INDEX fighter_age ON fighter(age); CREATE INDEX fighter_weight ON fighter(weight); CREATE TABLE fight ( id INTEGER, fighter1_id INTEGER, fighter2_id INTEGER, title VARCHAR(255), PRIMARY KEY(id), FOREIGN KEY (fighter1_id) REFERENCES fighter(id), FOREIGN KEY (fighter2_id) REFERENCES fighter(id) ) ENGINE=INNODB; CREATE INDEX fight_fighter1 ON fight(fighter1_id); CREATE INDEX fight_fighter2 ON fight(fighter2_id); MySQL can manage the metadata that will be indexed extensively. The metadata is very read-heavy. A lot of indices to update on insert don't present a problem. But, you can slice and dice it very efficiently to arrive at important ids that are stored in Cassandra. The hybrid query pattern is that you query MySQL using convoluted ad-hoc query to your heart's content. You end up with fight ids and fighter ids that you use to construct and filter Cassandra queries and then when you get back from Cassandra a result set with a bunch event ids, you can look them up in the fight_event table or more likely in a in-memory dictionary you loaded at the beginning of your program. STORING BLAZING HOT DATA IN REDIS It sounds like we're all set with the hybrid Cassandra + MySQL hybrid query system. But, sometimes it's not enough. Consider a live UFC championship fight, millions of viewers watching the fight via our custom app that adds live stats and displays various visualizations of real-time fight events and slow-motion. The typical web solution to deal with the massive demand of popular content is a CDN (content delivery network). CDNs are great, but they are mostly optimized for static, large content. Here we're talking about live streams of relatively small data. You may try to service each request dynamically as it comes from the hybrid Cassandra + MySQL, but the reality is that it is very difficult to try and fine-tune the caching behavior. Instead, we can use Redis. Redis is a super-fast, in-memory (yet can be durable), data structure server. That means it's a fancy key-value store that excels at retrieving data for its users. It can be distributed via a Redis Cluster, so you don't have to worry about being limited to a single machine. When there is a massive demand for a lot of data, Redis can be a great solution to improve the responsiveness of the system, as well as providing additional capacity quickly (à la elastic horizontal scaling). In comparison adding a new node to a Cassandra cluster is a long and tedious process. The replication will impact the entire cluster because Cassandra will try to evenly distribute the data between all nodes, even if you just want to add a node temporarily to handle a spike in requests. Redis can also be great for distributed locks and counters . Overall, Redis gives you a lot of options for high-performance flexible operations on data that is not suitable for either Cassandra or MySQL. Cassandra has distributed counters, but they suffer from high latency due to some design limitations . For example, let's say we want to keep track on the significant strikes (very important statistic) of every fighter in every fight this evening. A good way to model it in Redis is to use its HASH data structure. The HASH is a dictionary or a map. Let's create a HASH called significant_strikes . The HASH will map the pair fight_id:fighter_id to the number of significant strikes they delivered to the opponent. Note that in some tournaments the same fighter may participate in multiple fights. Here we initialize the significant_strikes HASH by setting two keys ( fight_id:fighter_id ) to 0. In this case, the fighters 44 and 55 fight each other in fight 123. HSET significant_strikes 123:44 0 HSET significant_strikes 123:55 0 Let's say 44 delivered a significant strike. We need to increment its counter: HINCRBY significant_strikes 123:44 1 Now, suppose 55 countered with a 3 strike combo (Wow!): HINCRBY significant_strikes 123:55 3 At each point you can get the entire significant_strikes HASH: HGETALL significant_strikes 1) ""123:44"" 2) ""1"" 3) ""123:55"" 4) ""3"" Or just specific keys: HGETALL significant_strikes 123:55 ""3"" ARCHITECTING THE UFC MONEYBALL Let's go big and think about the overall system architecture. The working assumption is that a large number of users will access the data concurrently. During a live event there will be peak demand for data related to the matches and the participating fighters. In addition, various jobs will run in the background and some long running machine learning processes will digest and crunch numbers constantly. There will be publicly facing REST APIs. Stateless API servers (e.g. nginx) will delegate queries and requests to internal services via fast protocols (e.g. grpc). The services will fetch data from all the stores, merge them, massage the data and return it to the users via the APIs. The users will consume the data via various clients: mobile, web, custom tools, etc. In addition to Cassandra, MySQL and Redis, the system may also use some cloud storage for AWS S3 for archiving cold data and for backups. The system will run on one of the public cloud providers: AWS, GCE or Azure. The stateless microservices will be deployed as Docker containers. The data stores will be deployed directly, and the containers will be orchestrated as a Kubernetes cluster. CONCLUSION Large-scale systems require multiple types of data stores to manage their data properly. When you deal with time-series data, Cassandra is a solid option. ScyllaDB is a promising high-performance drop-in replacement for Cassandra. But, Cassandra data modeling is not trivial and querying it efficiently can be assisted by storing metadata in a relational DB like MySQL. Redis is a great option for caching frequently used data in memory to offload pressure from Cassandra and MySQL. One of the most challenging aspects when designing a large-scale system that has to handle a lot of data is figuring out what kinds of data you need to handle, their cardinality, and the operations that you need to perform on each. Of course, very often you will not have a full grasp at the outset of your problem domain and even if you do, things will change. That means that you also have to build a flexible enough system that will allow you to move data between stores (and possibly add more data stores) as you learn more. Gigi Sayfan is the chief platform architect of VRVIU, a start-up developing cutting-edge hardware and software technology in the virtual reality space. Gigi has been developing software professionally for 21 years in domains as diverse as instant messaging, morphing, chip fabrication process control, embedded multi-media application for game consoles, brain-inspired machine learning, custom browser development, web services for 3D distributed game platform, IoT/sensors and most recently virtual reality.This article is licensed with CC-BY-NC-SA 4.0 by Compose. Image via Skitterphoto Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud * Subscribe Join 75,000 other devs on our weekly database newsletter. © 2017 Compose","Gigi Sayfan takes us through doing just that with Cassandra/Scylla, MySQL, and Redis. In this Compose's Write Stuff article, he shows us how he constructs his own Moneyball.",Designing the UFC Moneyball,Live,139 366,"Compose The Compose logo Articles Sign in Free 30-day trialDATALAYER EXPOSED: EMILE BAIZEL & BUILDING A FINTECH BOT ON MONGODB AND ELASTICSEARCH Published Jul 24, 2017 datalayer DataLayer Exposed: Emile Baizel & Building a Fintech Bot on MongoDB and ElasticsearchIt's Monday morning, which means we'll kick this week off with another video from our DataLayer Conference earlier this year. This week, we're featuring Emile Baizel from Digit . Emile Baizel took the stage at DataLayer as our eighth speaker. Emile is a full-stack developer Digit where he works to make the Digit bot smarter. And that's where Emile's inspiration for his talk came from. Digit users engage with Digit through their bot who answers questions about their account like what's my checking balance and is my money safe and more humorous ones as tell me a joke . They tried a few different approaches before going with their current bot that learns from past user questions to help answer future ones. Emile talks about how they built their bot, why they chose MongoDB and Elasticsearch and how they use Node's event emitters. If you'd like to follow along with Emile's slide deck, you can download it here . Previous DataLayer 2017 talks: * Charity Majors' presentation on observability * Ross Kukulinski's presentation on the state of containers * Antonio Chavez's presentation on the why he left MongoDB * Jonas Helfer's presentation on Joins across databases with GraphQL * Joshua Drake's presentation on PostgreSQL as the center of your data universe * Lorna Jane Mitchell's presentation on surviving failure with RabbitMQ * Amy Unrah's presentation on Scaling out SQL Databases with Spanner Be sure to tell us what you think using hashtag #DataLayerConf and check back next Monday for the next talk at DataLayerConf. -------------------------------------------------------------------------------- We're in the planning stages for DataLayer 2018 right now so, if you have an idea for a talk, start fleshing that out. We'll have a CFP, followed by a blind submission review, and then select our speakers, who we'll fly to DataLayer to present. Sounds fun, right? Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the beach, reading, spending time with his wife and daughter and tinkering. Love this article? Head over to Thom Crowe ’s author page and keep reading.RELATED ARTICLES Jul 17, 2017DATALAYER EXPOSED: AMY UNRUH & SCALING OUT SQL DATABASES WITH SPANNER Let's start the week off with another video from DataLayer Conf, the Compose sponsored Conference held in Austin this past ma… Thom Crowe Jul 10, 2017DATALAYER EXPOSED: LORNA JANE MITCHELL & SURVIVING FAILURE WITH RABBITMQ It's Monday which means it's time for our next DataLayer Conf video installment. This week, we'll hear about surviving failur… Thom Crowe Jul 3, 2017DATALAYER EXPOSED: JOSHUA DRAKE & POSTGRESQL: THE CENTER OF YOUR DATA UNIVERSE Start your Monday on a high note and catch up on videos from this year's DataLayer Conference. This week we're highlighting J… Thom Crowe Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","Another video from our DataLayer Conference, featuring Emile Baizel from Digit.",Emile Baizel & Building a Fintech Bot on MongoDB and Elasticsearch,Live,140 369,"REDDIT SENTIMENT ANALYSIS IN SPARKR AND COUCHDB Chetna Warade / August 15, 2016A few months ago, I published Sentiment Analysis of Reddit AMAs , which explained how to grab a reddit Ask Me Anything (AMA) conversation and export its data for analysis using our Simple Data Pipe app. From there, I used the Spark-Cloudant Connector, and Watson Tone Analyzer to get insights into writer sentiment. Recently, I followed up with a similar exercise, but this time performing analysis using dashDB data warehouse and R . I’m back at it again, to share my excitement over SparkR , an R API for Apache Spark. Analysis using SparkR lets you create a full working notebook really fast and iterate with ease. In this tutorial, I connect to our Cloudant database using a handy new CouchDB R package, fetch all json documents, create a SparkR dataframe, analyze with SQL and SparkR, then plot results with R. Here’s the flow: BEFORE YOU BEGIN If you haven’t already, read my earlier Sentiment Analysis of Reddit AMAs blog post , so you understand what we’re up to here. You’ll get the background you need, and we can dive right in to this alternate analysis approach. (You don’t need to follow that earlier tutorial, nor the follow-up on dashDB + R in order to implement this SparkR solution. All the steps you need are here in this blog post.) DEPLOY SIMPLE DATA PIPE The fastest way to deploy this app to Bluemix (IBM’s Cloud platform) is to click the Deploy to Bluemix button, which automatically provisions and binds the Cloudant service too. (Bluemix offers a free trial, which means you can try this tutorial out for free.) If you would rather deploy manually , or have any issues, refer to the readme . When deployment is done, click the EDIT CODE button. INSTALL REDDIT CONNECTOR Since we’re importing data from reddit, you need to establish a connection between reddit and Simple Data Pipe. Note: If you have a local copy of Simple Data Pipe, you can install this connector using Cloud Foundry . 1. In Bluemix, at the deployment succeeded screen, click the EDIT CODE button. 2. Click the package.json file to open it. 3. Edit the package.json file to add the following line to the dependencies list: ""simple-data-pipe-connector-reddit"": ""^0.1.2"" Tip: be sure to end the line above with a comma and follow proper JSON syntax. 4. From the menu, choose File Save . 5. Press the Deploy app button and wait for the app to deploy again. ADD SERVICES IN BLUEMIX To work its magic, the reddit connector needs help from a couple of additional services. In Bluemix, we’re going analyze our data using the Apache Spark and Watson Tone Analyzer services. So add them now by following these steps: PROVISION IBM ANALYTICS FOR APACHE SPARK SERVICE 1. On your Bluemix dashboard, click Work with Data . Click New Service . Find and click Apache Spark then click Choose Apache Spark Click Create . PROVISION WATSON TONE ANALYZER SERVICE 1. In Bluemix, go to the top menu, and click Catalog . 2. In the Search box, type Tone Analyzer , then click the Tone Analyzer tile. 3. Under app , click the arrow and choose your new Simple Data Pipe application. Doing so binds the service to your new app. 4. In Service name enter only tone analyzer (delete any extra characters) 5. Click Create . 6. If you’re prompted to restage your app, do so by clicking Restage . LOAD REDDIT DATA 1. Launch simple data pipe in one of the following ways: * If you just restaged, click the URL for your simple data pipe app. * Or, in Bluemix, go to the top menu and click Dashboard , then on your Simple Data Pipe app tile, click the Open URL button. 2. In Simple Data Pipe, go to menu on the left and click Create a New Pipe . 3. Click the Type dropdown list, and choose Reddit AMA .When you added a reddit connector earlier, you added the Reddit option you’re choosing now. 4. In Name , enter ibmama , or whatever you wish. 5. If you want, enter a Description . 6. Click Save and continue . 7. Enter the URL for the reddit conversation you want to analyze. You’re not limited to using an AMA conversation here. You can enter the URL of any reddit conversation, including the IBM-hosted AMA we used in earlier tutorials: https://www.reddit.com/r/IAmA/comments/3ilzey/were_a_bunch_of_developers_from_ibm_ask_us 8. Click Connect to AMA . You see a You’re connected confirmation message. 9. Click Save and continue . 10. On the Filter Data screen, make the following 2 choices: * under Comments to Load , select Top comments only . * under Output format , choose JSON flattened . Then click Save and continue . Why flattened JSON? Flat JSON format is much easier for Apache Spark to process, so for this tutorial, the flattened option is the best choice. If you decide to use the Simple Data Pipe to process reddit data with something other than Spark, you probably want to choose JSON to get the output in its purest form. 11. Click Skip , to bypass scheduling. 12. Click Run now . When the data’s done loading, you see a Pipe Run complete! message. 13. Click View details . ANALYZE REDDIT DATA CREATE NEW R NOTEBOOK 1. In Bluemix, open your Apache Spark service. Go to your dashboard and, under Services , click the Apache Spark tile and click Open . 2. Open an existing instance or create a new one. 3. Click New Notebook . 4. Click the From URL tab. 5. Enter any name, and under Notebook URL enter https://github.com/ibm-cds-labs/reddit-sentiment-analysis/raw/master/couchDB-R/Preview-R-couchDB.ipynb 6. Click Create Notebook 7. Copy and enter your Cloudant credentials.In a new browser tab or window, open your Bluemix dashboard and click your Cloudant service to open it. From the menu on the left, click Service Credentials . If prompted, click Add Credentials . Copy your Cloudant host , username , and password into the corresponding places in cell 4 of the notebook (replacing XXXX’s). RUN THE CODE AND GENERATE REPORTS 1. Install CouchDB R package Run cells 1 and 2 to install the CouchDB package and library. You need to run these only once. Read more about the package . 2. Define a variable sqlContext to use existing Spark ( sc ) and SparkRSQL Context that is already initialized with IBM Analytics for Apache Spark as Service. In [3]: sqlContext 3. Run cell 4 to connect to Cloudant. 4. Run cell 5 to get a list of Cloudant databases. In [5]: couch_list_databases(myconn) Out [5]: 'pipe_db' 'reddit_sparkr_top_comments_only' 5. Then read this connection by running the next cell:In [6]: print(myconn) CREATE A SPARKR DATAFRAME FROM A CLOUDANT DATABASE There is no magic function that gets desired documents into a ready-to-use SparkR dataframe. Instead, the function couch_fetch() retrieves a document object with value based on a key. At this point in the code, I don't have keys in hand. Thanks to the primary index _all_docs that comes with Cloudant databases, there's no need to write extra code. Simply add a forward slash / and _all_docs to the database name. (To learn more, read https://cloudant.com/for-developers/all_docs/ .) 1. Use _all_docs to fetch all documents from the Cloudant database (mine is named reddit_regularreddit_top_comments_and_replies ) and create a data frame by running the following command:In[7]: results Note: Insert your database name in this cell. You can find it in results from running couch_list_databases(myconn) 2 cells before: About SparkR and R Dataframes ""SparkR is based on Spark’s parallel DataFrame abstraction. Users can create SparkR DataFrames from “local” R data frames, or from any Spark data source such as Hive, HDFS, Parquet or JSON."" -Spark 1.4 Announcement You can create a SparkR dataframe from R data by calling function createDataFrame() or as.DataFrame() both will do the job. In this case, I used createDataFrame(sqlContext, data) where data is R dataframe or a list, and it returns a DataFrame. Alternatively you can download the content of a SparkDataFrame into an R's data.frame, by calling function as.data.frame() (all lowercase). So, as.data.frame(x) where x is DataFrame, returns a data.frame. Tip: Learn more about R data type by calling function typeof(x) where x is a R data type, either a matrix or list or vector or data.frame. For detailed API http://spark.apache.org/docs/latest/api/R/index.html 2. Print the schema that you just created by running the next cell:In [8]: printSchema(df) which returns: root |-- total_rows: integer (nullable = true) |-- offset: integer (nullable = true) |-- rows_id: string (nullable = true) |-- rows_key: string (nullable = true) |-- rows_rev: string (nullable = true) |-- rows_id_1: string (nullable = true) |-- rows_key_1: string (nullable = true) |-- rows_rev_1: string (nullable = true) |-- rows_id_2: string (nullable = true) |-- rows_key_2: string (nullable = true) |-- rows_rev_2: string (nullable = true) |-- rows_id_3: string (nullable = true) |-- rows_key_3: string (nullable = true) |-- rows_rev_3: string (nullable = true) The first row is the _design_ document so ignore. All that follows is reddit data. 3. Run the typeof(results) command, which returns 'list' 4. Print the list results returned by couch_fetch() . In [10]: print(results) $total_rows [1] 4 $offset [1] 0 $rows $rows[[1]] $rows[[1]]$id [1] ""_design/Top comments and replies"" $rows[[1]]$key [1] ""_design/Top comments and replies"" $rows[[1]]$value $rows[[1]]$value$rev [1] ""1-edc6f6bb0062260ecf1160c81872efdd"" $rows[[2]] $rows[[2]]$id [1] ""f4f7cfa487898608fff6eb639fe6ed26"" $rows[[2]]$key [1] ""f4f7cfa487898608fff6eb639fe6ed26"" $rows[[2]]$value $rows[[2]]$value$rev [1] ""1-c0be345c89577577cdeb301328d9e4f5"" ..... 5. Next, iterate over the list of keys returned, fetch individual documents, create a R dataframe, add each document as a row to the dataframe and create a new SparkR dataframe.In [11]: keys_list Output looks like root |-- X_id: string (nullable = true) |-- X_rev: string (nullable = true) |-- author: string (nullable = true) |-- created: integer (nullable = true) |-- edited: integer (nullable = true) |-- id: string (nullable = true) |-- title: string (nullable = true) |-- text: string (nullable = true) |-- Anger: string (nullable = true) |-- Disgust: string (nullable = true) |-- Fear: string (nullable = true) |-- Joy: string (nullable = true) |-- Sadness: string (nullable = true) |-- Analytical: string (nullable = true) |-- Confident: string (nullable = true) |-- Tentative: string (nullable = true) |-- Openness: string (nullable = true) |-- Conscientiousness: string (nullable = true) |-- Extraversion: string (nullable = true) |-- Agreeableness: string (nullable = true) |-- Emotional_Range: string (nullable = true) |-- pt_type: string (nullable = true) +--------------------+--------------------+----------+----------+----------+-------+-----+--------------------+-----+-------+-----+-----+-------+----------+---------+---------+--------+-----------------+------------+-------------+---------------+--------------------+ | X_id| X_rev| author| created| edited| id|title| text|Anger|Disgust| Fear| Joy|Sadness|Analytical|Confident|Tentative|Openness|Conscientiousness|Extraversion|Agreeableness|Emotional_Range| pt_type| +--------------------+--------------------+----------+----------+----------+-------+-----+--------------------+-----+-------+-----+-----+-------+----------+---------+---------+--------+-----------------+------------+-------------+---------------+--------------------+ |f4f7cfa487898608f...|1-c0be345c8957757...| delfinom|1467130823| 0|d4rcp8s| |our strategy ...|18.72| 46.65|19.80|14.65| 27.61| 92.20| 0.00| 64.70| 42.10| 59.10| 81.20| 69.30| 15.80|Top comments and ...| |f4f7cfa487898608f...|1-6b0c4d5588c127c...|BlackOdder|1467127251|1467129666|d4ra2ax| |This is good. Hop...|33.49| 54.14|29.26| 3.34| 29.87| 51.60| 0.00| 88.90| 1.20| 7.90| 96.60| 98.90| 94.20|Top comments and ...| |f4f7cfa487898608f...|1-aa11fd3a2efdfd7...|grauenwolf|1467127117|1467127784|d4r9yse| |I don't see how t...|98.15| 54.75|13.05| 2.03| 5.21| 70.70| 0.00| 96.40| 50.20| 3.80| 47.30| 39.00| 84.30|Top comments and ...| +--------------------+--------------------+----------+----------+----------+-------+-----+--------------------+-----+-------+-----+-----+-------+----------+---------+---------+--------+-----------------+------------+-------------+---------------+--------------------+ ``` ANALYZE REDDIT DATA WITH SPARKR SQL AND PLOT WITH R Now we'll create a bar chart showing comment count by sentiment (for comments scoring higher than 70%). In [12]: registerTempTable(df2,""reddit"") sentimentDistribution 70') df3 70% in IBM Reddit AMA"",col=139, ylim=c(0,130),cex.axis=0.5,cex.names=0.5,ylab=""Reddit comment count"") FILTER REDDIT DATA WITH SPARKR AND PRINT REPORT TO NOTEBOOK You don't need to run SQL queries to work with Spark Dataframes. Dataframes have functions like filter, select, grouping, and aggregation. filter() returns rows and select() returns columns that meet the condition passed in the input. In the following code, filter() returns a Dataframe containing rows that have an emotion score higher than 70%. select() returns author (redditor - reddit user) and text (comments by redditors) from the Dataframe returned earlier. We could use SparkR Dataframe functions head() and showDF() to show a quick data overview. But since we want a full list of comments by sentiment with that high emotional score ( 70%), we call R print() function. In [13] for(i in 1:length(columns)){ columnset 70') ) if(count(columnset) 0){ print('----------------------------------------------------------------') print(columns[i]) print('----------------------------------------------------------------') comments Results show comments grouped by sentiment. Some comments appear under multiple sentiment categories. For example, the question Are you ashamed of Lotus Notes? appears both under Disgust and Extraversion . You can scroll through the list. TRY LOADING A DIFFERENT REDDIT CONVERSATION Launch your Simple Data Pipe app again and return to the Load reddit Data section. In step 7, swap in a different URL, run the notebook again, and check out the results. CONCLUSION If you're an R fan, you'll appreciate that SparkR provides a handy R frontend for Spark. What's great about moving data from JSON document to a SparkR or R dataframe, is that the data structure pretty much remains the same. That offers flexibility to write SparkR notebooks fast and makes it easy to move data in and out of SparkR and R dataframes. Both offer powerful operations that produce informative analytics with high performance. R and JSON: made for each other. SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Click to share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","More analysis options for Simple Data Pipe output. Once data is in Cloudant, connect via CouchDB then analyze the JSON with SparkR in a Jupyter notebook.",Reddit sentiment analysis in SparkR and CouchDB,Live,141 372,"Homepage Stats and Bots Follow Sign in Get started Homepage * Home * DATA SCIENCE * ANALYTICS * STARTUPS * BOTS * DESIGN * Subscribe * * 🤖 TRY STATSBOT FREE * Jay Shah Blocked Unblock Follow Following Machine Learning Enthusiast Nov 16 -------------------------------------------------------------------------------- NEURAL NETWORKS FOR BEGINNERS: POPULAR TYPES AND APPLICATIONS AN INTRODUCTION TO NEURAL NETWORKS LEARNING Today, neural networks are used for solving many business problems such as sales forecasting, customer research, data validation, and risk management. For example, at Statsbot we apply neural networks for time series predictions, anomaly detection in data, and natural language understanding. In this post, we’ll explain what neural networks are, the main challenges for beginners of working on them, popular types of neural networks, and their applications. We’ll also describe how you can apply neural networks in different industries and departments. THE IDEA OF HOW NEURAL NETWORKS WORK Recently there has been a great buzz around the words “neural network” in the field of computer science and it has attracted a great deal of attention from many people. But what is this all about, how do they work, and are these things really beneficial? Essentially, neural networks are composed of layers of computational units called neurons, with connections in different layers. These networks transform data until they can classify it as an output. Each neuron multiplies an initial value by some weight, sums results with other values coming into the same neuron, adjusts the resulting number by the neuron’s bias, and then normalizes the output with an activation function. ITERATIVE LEARNING PROCESS A key feature of neural networks is an iterative learning process in which records (rows) are presented to the network one at a time, and the weights associated with the input values are adjusted each time. After all cases are presented, the process is often repeated. During this learning phase, the network trains by adjusting the weights to predict the correct class label of input samples. Advantages of neural networks include their high tolerance to noisy data, as well as their ability to classify patterns on which they have not been trained. The most popular neural network algorithm is the backpropagation algorithm . Once a network has been structured for a particular application, that network is ready to be trained. To start this process, the initial weights (described in the next section) are chosen randomly. Then the training (learning) begins. The network processes the records in the “training set” one at a time, using the weights and functions in the hidden layers, then compares the resulting outputs against the desired outputs. Errors are then propagated back through the system, causing the system to adjust the weights for application to the next record. This process occurs repeatedly as the weights are tweaked. During the training of a network, the same set of data is processed many times as the connection weights are continually refined. SO WHAT’S SO HARD ABOUT THAT? One of the challenges for beginners in learning neural networks is understanding what exactly goes on at each layer. We know that after training, each layer extracts higher and higher-level features of the dataset (input), until the final layer essentially makes a decision on what the input features refer to. How can it be done? Instead of exactly prescribing which feature we want the network to amplify, we can let the network make that decision. Let’s say we simply feed the network an arbitrary image or photo and let the network analyze the picture. We then pick a layer and ask the network to enhance whatever it detected. Each layer of the network deals with features at a different level of abstraction, so the complexity of features we generate depends on which layer we choose to enhance. POPULAR TYPES OF NEURAL NETWORKS AND THEIR USAGE In this post on neural networks for beginners, we’ll look at autoencoders, convolutional neural networks, and recurrent neural networks. AUTOENCODERS This approach is based on the observation that random initialization is a bad idea and that pre-training each layer with an unsupervised learning algorithm can allow for better initial weights. Examples of such unsupervised algorithms are Deep Belief Networks. There are a few recent research attempts to revive this area, for example, using variational methods for probabilistic autoencoders. They are rarely used in practical applications. Recently, batch normalization started allowing for even deeper networks, we could train arbitrarily deep networks from scratch using residual learning. With appropriate dimensionality and sparsity constraints, autoencoders can learn data projections that are more interesting than PCA or other basic techniques. Let’s look at the two interesting practical applications of autoencoders: • In data denoising a denoising autoencoder constructed using convolutional layers is used for efficient denoising of medical images. A stochastic corruption process randomly sets some of the inputs to zero, forcing the denoising autoencoder to predict missing (corrupted) values for randomly selected subsets of missing patterns. • Dimensionality reduction for data visualization attempts dimensional reduction using methods such as Principle Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). They were utilized in conjunction with neural network training to increase model prediction accuracy. Also, MLP neural network prediction accuracy depended greatly on neural network architecture, pre-processing of data, and the type of problem for which the network was developed. CONVOLUTIONAL NEURAL NETWORKS ConvNets derive their name from the “convolution” operator. The primary purpose of convolution in the case of a ConvNet is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. ConvNets have been successful in such fields as: * Identifying faces In the identifying faces work, they have used a CNN cascade for fast face detection. The detector evaluates the input image at low resolution to quickly reject non-face regions and carefully process the challenging regions at higher resolution for accurate detection. Illustration sourceCalibration nets were also introduced in the cascade to accelerate detection and improve bounding box quality. Illustration source * Self driving cars In the self driving cars project, depth estimation is an important consideration in autonomous driving as it ensures the safety of the passengers and of other vehicles. Such aspects of CNN usage have been applied in projects like NVIDIA’s autonomous car. CNN’s layers allow them to be extremely versatile because they can process inputs through multiple parameters. Subtypes of these networks also include deep belief networks (DBNs). Convolutional neural networks are traditionally used for image analysis and object recognition. Illustration sourceAnd for fun, a link to use CNNs to d rive a car in a game simulator and predict steering angle . RECURRENT NEURAL NETWORKS RNNs can be trained for sequence generation by processing real data sequences one step at a time and predicting what comes next. Here is the guide on how to implement such a model . Assuming the predictions are probabilistic, novel sequences can be generated from a trained network by iteratively sampling from the network’s output distribution, then feeding in the sample as input at the next step. In other words, by making the network treat its inventions as if they were real, much like a person dreaming. • Language-driven image generation Can we learn to generate handwriting for a given text? To meet this challenge a soft window is convolved with the text string and fed as an extra input to the prediction network. The parameters of the window are output by the network at the same time as it makes the predictions, so that it dynamically determines an alignment between the text and the pen locations. Put simply, it learns to decide which character to write next. • Predictions A neural network can be trained to produce outputs that are expected, given a particular input. If we have a network that fits well in modeling a known sequence of values, one can use it to predict future results. An obvious example is Stock Market Prediction. APPLYING NEURAL NETWORKS TO DIFFERENT INDUSTRIES Neural networks are broadly used for real world business problems such as sales forecasting, customer research, data validation, and risk management. MARKETING Target marketing involves market segmentation, where we divide the market into distinct groups of customers with different consumer behavior. Neural networks are well-equipped to carry this out by segmenting customers according to basic characteristics including demographics, economic status, location, purchase patterns, and attitude towards a product. Unsupervised neural networks can be used to automatically group and segment customers based on the similarity of their characteristics, while supervised neural networks can be trained to learn the boundaries between customer segments based on a group of customers. RETAIL & SALES Neural networks have the ability to simultaneously consider multiple variables such as market demand for a product, a customer’s income, population, and product price. Forecasting of sales in supermarkets can be of great advantage here. If there is a relationship between two products over time, say within 3–4 months of buying a printer the customer returns to buy a new cartridge, then retailers can use this information to contact the customer, decreasing the chance that the customer will purchase the product from a competitor. BANKING & FINANCE Neural networks have been applied successfully to problems like derivative securities pricing and hedging, futures price forecasting, exchange rate forecasting, and stock performance. Traditionally, statistical techniques have driven the software. These days, however, neural networks are the underlying technique driving the decision making. MEDICINE It is a trending research area in medicine and it is believed that they will receive extensive application to biomedical systems in the next few years. At the moment, the research is mostly on modelling parts of the human body and recognising diseases from various scans. CONCLUSION Perhaps NNs can, though, give us some insight into the “easy problems” of consciousness: how does the brain process environmental stimulation? How does it integrate information? But, the real question is, why and how is all of this processing, in humans, accompanied by an experienced inner life, and can a machine achieve such a self-awareness? It makes us wonder whether neural networks could become a tool for artists — a new way to remix visual concepts — or perhaps even shed a little light on the roots of the creative process in general. All in all, neural networks have made computer systems more useful by making them more human. So next time you think you might like your brain to be as reliable as a computer, think again — and be grateful you have such a superb neural network already installed in your head! I hope that this introduction to neural networks for beginners will help you build your first project with NNs. RECOMMENDED SOURCES FOR BEGINNERS DEEP NEURAL NETWORKS * What is the difference between deep learning and usual machine learning? * What is the difference between a neural network and a deep neural network? * How is deep learning different from multilayer perceptron? NEURAL NETWORKS PROJECTS * A classic example of Mapping Input to Output Image * Trying out Face Recognition on your own * Convolutional Neural Networks f0r Visual Recognition * Online Stanford Course on CNNs YOU’D ALSO LIKE: How to Get All Your Product Launch Metrics Without Leaving Slack How we used Statsbot to track our product launch metrics blog.statsbot.co SQL Queries for Funnel Analysis A template for building SQL funnel queries blog.statsbot.co How to Reduce Churn Rate By Handling Stripe Failed Payments How We Automated Dunning Management blog.statsbot.co * Machine Learning * Neural Networks * Artificial Neural Network * Data Science * Recurrent Neural Network One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. 151 Blocked Unblock Follow FollowingJAY SHAH Machine Learning Enthusiast FollowSTATS AND BOTS Data stories on machine learning and analytics. From Statsbot’s makers. * 151 * * * Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates","An introduction to neural networks for beginners: the main challenges of working on neural networks, their popular types and applications.",Neural networks for beginners: popular types and applications,Live,142 374,"* Home * Community * Projects * Blog * About * Resources * Code * Contributions * University * IBM Design * Apache SystemML * Apache Spark™ SPARK.TC ☰ * Community * Projects * Blog * About * Resources * Code * Contributions * University * IBM Design * Apache SystemML * Apache Spark™ APACHE SPARK 0 TO LIFE-CHANGING APP: SCALA FIRST STEPS AND AN INTERVIEW WITH JAKOB ODERSKY SCALA! THE LANGUAGE THAT EVOKES EXTREME DIFFERENCES IN OPINION. Being new to Silicon Valley, I have only recently come across the very strong opinions of developers. Whether it be spaces versus tabs or Scala versus Python, people definitely feel strongly one way or the other. So whether you love Scala for its brevity and concise nature or whether you hate it for being different, the fact is, Scala is very important for Spark, and after all, this is the Spark Technology Center. That is why this week I am giving you some context around Scala and a means to get you started, you know, before we move forward with our life-changing app and generally saving the world. Not sure what I'm talking about or what I'm doing? Look here and here and here . For all of those people out there who are new to Spark or Scala, what you might not know is that although Spark has a shell available in Scala and Python and supports Scala, Java, Python, Clojure, and R, Scala has an advantage . Spark is written in the Scala Programming Language and runs on the Java Virtual Machine (JVM). This means that Scala has more capabilities on Spark than the PySpark alternative. (Depending on who you ask, this difference is varying--again, lot's of opinions!) Not only this, but Scala inherently allows you to have more succinct code, which is great for working with big data. TO UNDERSTAND SCALA EVEN BETTER, I SAT DOWN WITH JAKOB ODERSKY, A REAL-LIFE, BONAFIDE SCALA EXPERT, TO ASK HIM A FEW SERIOUSLY SCALA QUESTIONS. WHY IS SCALA IMPORTANT FOR SPARK? Spark's core APIs are implemented in Scala; it is the lingua franca of the engine. I would also suggest that Scala's features, specifically its conciseness combined with typesafety, make it ideal for implementing any kind of collection framework, which, if you think about it, Spark really is at its highest level of abstraction. DOES IT PERFORM DIFFERENTLY THAN PYTHON? Python is generally an interpreted language and therefore runs slower than Scala, which is compiled to java bytecode and can run on the heavily optimized Java Virtual Machine. In the original Spark APIs, where you write a sequence of operations on RDDs, a difference in performance is quite noticable. However, The newer Spark APIs (Datasets, Dataframes, etc) are opaque, in that they hide operation details and let you specify ""what"" you want rather than ""how"" you want it. This enables them to apply further optimization and expose a uniform entry-point to all languages, thus making performance differences negligible (if you require only the functionality provided by the newer APIs). WHAT DO YOU LIKE MOST ABOUT SCALA? There a couple of things I like about the language. Its type system is incredibly complete, yet it doesn't get in your way of writing elegant and concise code. I would say that my favorite feature is its simplicity compared to expressivity: the language itself offers few, yet extremely powerful constructs, allowing you to build libraries that feel ""native"" or ""built-in"", yet are just implemented with regular features offered by Scala to anyone. WHY IS IT RELEVANT TO BIG DATA AND SYSTEMML? Making so called ""big data"" accessible from easy-to-use abstractions is essential for fast and productive analysis. Scala makes it very simple to write domain specific languages that can leverage analytics engines such as SystemML but offer a low-barrier entry point to anyone. Furthermore, it is also possible to use Scala in an interpreter, making it a natural choice to integrate into data science notebooks [like Jupyter and Zeppelin]. This in turn makes it possible to rapidly explore data, and with all the benefits of the language's safety and expressivity, also make it a fun experience! DO YOU HAVE ANY RESOURCES YOU WOULD RECOMMEND FOR NEW DEVELOPERS AND DATA SCIENTISTS? My recommendation would be to check out the first weeks of some online courses, just to get a basic understanding of the language. As a beginner you are extremely susceptible to either like or hate a topic, depending on the way you learn it, therefore a good source is essential. There is no need to follow the whole program however, just a few hours should give you a solid foundation to continue on your own. If you have already have some knowledge in Java, I would also recommend reading Cay Horstmann's book ""Scala for the Impatient"". NOW THAT YOU HAVE THE CONTEXT, BELOW IS A BASIC TUTORIAL ON HOW TO GET GOING WITH SCALA. Quick Note: going beyond this cheat sheet is essential. I definitely recommend reading the book 'Atomic Scala' by Bruce Eckel and Dianne Marsh to understand the basics of Scala syntax once you have your shell or REPL up and running. ASSUMING YOU FOLLOWED MY FIRST BLOG, YOU SHOULD HAVE ALREADY DOWNLOADED SPARK AND SET SPARK HOME IN YOUR BASH PROFILE. IF YOU HAVEN'T, THEN DO THIS BEFORE YOU TRY TO ENTER THE SPARK SHELL IN THE STEP BELOW. MAKE SURE TO ALSO SET YOUR PATH! MY SCALA AND SPARK ARE TOGETHER IN THE FOLLOWING EXAMPLE. FIRST, MAKE SURE JAVA IS INSTALLED. //In your terminal type: java -version //Update if needed //Or install if needed brew tap caskroom/cask brew install Caskroom/cask/java UPDATE OR INSTALL SCALA. //check what version of scala you have installed brew which scala //If you want to switch versions type this: brew switch scala 2.9.2 brew switch scala 2.10.0 //If you need to install scala brew install scala SET SCALA HOME AND PUT SCALA IN YOUR PATH. //Pay attention to where you saved Scala! //Go to your bash profile. vi ~/.bash_profile //Type i for insert. i //Now set Scala Home and put it in your path. export SCALA_HOME=/Users/stc/scala /*Notice my Scala Home and Spark Home are on the same line of code for my path.*/ export PATH=$SCALA_HOME/bin:$SPARK_HOME/bin:$PATH //Now write and quit the changes :wq LOAD THE CHANGES YOU MADE IN YOUR BASH PROFILE. source~/.bash_profile NOW YOU CAN LOAD THE REPL (READ-EVALUATE-PRINT-LOOP) OR THE SPARK SHELL TO WORK IN SCALA. //To load the REPL just type while in your terminal: scala /*If you saved Scala Home and put it in your path it should work */ //For the spark-shell, type: spark-shell //The scala> prompt should now be showing. //If it's not, double check your .bash_profile YOU'RE READY TO START EXPERIMENTING! //Try setting some variables and running simple math. scala> val a = 15 scala> val b = 15.15 scala> a * b //should return: res0: Double = 227.25 //Double means a fractional number. //An Int means a whole number. //Knowing this, you could rewrite the above code as: scala> val a:Int = 15 scala> val b:Double = 15.5 /*Just remember that val is immutable and var is mutable. Immutable means that if you change the value, you create a new value. Mutable means you can change the value at the source. Be careful using mutable values if you're working with others. This can make it very difficult for everyone to be on the same page at the same time.*/ //You can also print your first line. scala> println(""What up Scala coder?"") //If you're ready to exit, type: :quit Now you are ready to use Scala in the Spark shell! Before we move forward with our life-changing app, I'd recommend viewing some tutorials or reading one of the recommended books. Knowledge of Scala will be super helpful as we move forward with saving the world! Stay tuned for our next step! By Madison J. Myers SHARE ON * * Share MADISON J MYERS DATE 25 July 2016TAGS apache spark, systemml, Life-changingSPARK TECHNOLOGY CENTER * Community * Projects * Blog * About The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.","To understand Scala even better, I sat down with Jakob Odersky, a real-life, bonafide Scala expert, to ask him a few seriously Scala questions.",0 to Life-Changing App: Scala First Steps and an Interview with Jakob Odersky,Live,143 375,"Toggle navigation * Courses * Courses List * Learning Paths * Events * Badges * Resources * Resources List * Downloads * Participate! * Blog * About * Login * Register 1. Home 2. Weekly Roundups 3. This Week in Data Science (July 12, 2016) THIS WEEK IN DATA SCIENCE (JULY 12, 2016) Posted on July 12, 2016 by Coralie Phanord Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * Musicmap – Learn about the Genealogy and history of popular music genres through an interactive data visualization. * IBM is making a music app that can create entirely new songs just for you – IBM Watson will soon be able to work as a creative assistant with humans to create entirely new music on an app. * Data Mining Reveals the Six Basic Emotional Arcs of Storytelling – Researchers at the Computational Story Lab at the University of Vermont used sentiment analysis to map the emotional arcs of over 1,700 stories and then used data-mining techniques to reveal the most common arcs. * Predictive Analytics, Big Data, and How to Make Them Work for You – Learn how data mining, regression analysis, machine learning, and data visualization tools can help change the way you do business. * Internet Of Things On Pace To Replace Mobile Phones As Most Connected Device In 2018 – Internet of Things (IoT) devices are expected to grow at a 23% compound annual growth rate from 2015 to 2021. They are expected to exceed mobile phones as the largest category of connected devices. * Is it brunch time? – Ben Jacobson uses data analysis and visualization to study the best and most popular time for brunch. * Google’s robot cars recognize cyclists’ hand signals – better than most cyclists – Google’s self-driving car is friendly to cyclists. It will err on the cautious side and surrender the lane to cyclists. * Weather Visualization is Powered by Big Data – High-performance computing (HPC) developers are using Big Data to eliminate guesswork involved in accurate weather forecasts. * Introducing OpenCellular: An open source wireless access platform – Facebook has designed a cost-effective open source wireless access platform aimed to improve connectivity in remote areas of the world. * Improving City Living with Smart Lighting Data – Hackathon platform Devpost and GE are launching a Hackathon that challenges civic hackers to develop smart city applications using the data from Internet-connected lighting systems. * Big data jobs are in high demand – As Big Data is becoming a part of everyday life, organizations in all fields can use big data to improve. * Google’s DeepMind AI to use 1 million NHS eye scans to spot diseases earlier – Google partnered with NHS’s Moorfields Eye Hospital to apply machine learning in order to spot eye diseases earlier. * Hadoop vs Spark: Which is right for your business? Pros and cons, vendors, customers and use cases – What are the pros and cons of each open source big data framework. Which is best for your enterprise. * Privacy Shield – Houston, We Still Have a Problem! – The European Commission (EC) has been working on an agreement with the U.S., called the Privacy Shield. How is it different from Safe Harbour, the previous agreement. * Mapping the Computer Science Skills Gap – The app association has created an interactive map showing the areas of the United States with the highest demand for people with computer science skills. * Why Python is Slow: Looking Under the Hood – Take a look at Python’s standard library and dive into the details to understand why Python is so slow. UPCOMING DATA SCIENCE EVENTS * Data Science Bootcamp – A summer of data, analytics and insight – Join Ryerson University and Big Data University’s bootcamp this summer in Toronto. * The Big Data Channel – Join leaders in Big Data at the IoT, Big Data, and Visualization summits on September 8 & 9 in Boston. * Data for Development: Powering Evidence-Based International Aid with Mobile Technology – Join the Center for Data Innovation for a panel discussion on how policymakers and international development organizations can take advantage data to improve effectiveness on August 3rd in Washington D.C. SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * RELATED Tags: analytics , Big Data , data science , events , weekly roundup -------------------------------------------------------------------------------- LEAVE A REPLY CANCEL REPLY BLOG Welcome to the Big Data University Blog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * This Week in Data Science (July 12, 2016) * This Week in Data Science (July 05, 2016) * This Week in Data Science (June 28, 2016) * This Week in Data Science (June 21, 2016) * This Week in Data Science (June 14, 2016) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My Tweets Follow us Facebok Twitter Google+ Linkedin YouTube * FAQ * Contact * About * Blog * Legal * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our twenty second release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (July 12, 2016)",Live,144 377,"Homepage Follow Sign in / Sign up Homepage * Home * Data Science Experience * Data Catalog * * Watson Data Platform * Susanna Tai Blocked Unblock Follow Following Offering Manager, Watson Data Platform | Data Catalog Aug 15 -------------------------------------------------------------------------------- DON’T THROW MORE DATA AT THE PROBLEM! HERE’S HOW TO UNLOCK TRUE VALUE FROM YOUR DATA LAKE August 15, 2017 | Written by: Jay Limburn Just recently in the UK, we’ve seen the dangers of making decisions based on incomplete or poor data play out on the world stage. The Prime Minister called a general election three years earlier than she needed to, basing her decision on data that showed that it would allow her to win a bigger majority in parliament. Evidently, the data her team used was lacking: her party lost its overall majority, and the UK ended up with a hung parliament. So, what had the Prime Minister’s team missed? The election saw a higher turnout of voters under age 35 than previous elections[1] — a demographic that her policies had failed to win over. The result was a bad decision based on incomplete data. We may not all have the fates of nations in our hands, but the lesson is one from which we can all learn. Companies grapple with a version of this same challenge every day, when they try to make important strategic decisions based on data that may be incomplete, inconsistent, inaccurate, or out-of-date. To lessen the likelihood of bad decisions, many companies have invested in extending their data lakes: the idea is that the more data you have, the less likely you are to miss something important. But throwing more data at the problem isn’t always enough to protect you from poor choices. Having too much information can prevent you from seeing the forest for the trees — particularly if that information is poorly organized or difficult to find. DISILLUSIONED WITH BIG DATA? YOU’RE NOT THE ONLY ONE It’s a familiar story: companies respond to the hype around big data by building huge data lakes, but then find they don’t deliver the expected value. The data is there, but knowledge workers can’t easily access it, and therefore can’t work effectively. Moreover, the company is now paying for new systems to house all this data, and needs to find highly skilled data scientists and engineers to maintain them. What’s gone wrong? One common issue is cultural: despite having the technical infrastructure in place, different departments are often reluctant to share their data. We discussed this challenge in my recent blog post, “ Data governance — You could be looking at it all wrong ” , but essentially, data owners need to have confidence that the data they share will be accessed, used and protected appropriately. A lack of effective data governance within data lakes prevents users from trusting the system, so they hoard their data instead. As a result, its value is lost to the rest of the company. Even if users are persuaded to share their data, it can be difficult to decide (a) how to share it, and, (b) what kind of data cleansing needs to happen before it is safe for others to use. Answering these questions may require yet another large IT investment. The other major challenge is findability of data. This issue is often exacerbated when companies treat their data lake as a dumping ground for assets, rather than a well-organized and actively managed archive. In these circumstances, it is difficult for users to find or understand assets within the data lake, and when they do, they are of questionable quality and unknown provenance. Again, this discourages data sharing and reuse. The problem is widespread: it was recently reported that data scientists, business analysts and other knowledge workers estimate that they spend 80 percent of their time searching for, cleaning and organizing data, and only 20 percent actually analyzing it.[2] But what if there was a way to resolve the challenges around both data governance and findability of data in a single move? ENTER IBM DATA CATALOG Built on Watson Data Platform , IBM Data Catalog is IBM’s next-generation, cloud-based enterprise data catalog. It promises to provide a central solution where users can catalog, govern and discover information assets, and it is designed to slash the time spent searching for and hesitating over sharing data, so that you can focus on extracting business value from your data assets. With Data Catalog, you will be able to index the assets already in your data lake, and then extend your strategy to include data from other sources too. For example, you can take advantage of the built-in governance and control functions to safely ingest enterprise assets that you were previously unable to move to the data lake due to complexity or ownership issues. Data hosted by shadow IT teams or SaaS providers, open datasets, data from social media or sensor feeds, local spreadsheets and other dark data, and so on — Data Catalog will help you liberate the value from all of these sources. Beyond the advantages of uniting all your assets in a single, governed catalog, Data Catalog will also offer: Self-service capabilities: With its intelligent catalog capabilities, Data Catalog will provide users with true self-service access to all the assets they are authorized to see. Its advanced search features will also help users zone in on the data that is most relevant to them, contributing to productivity. Driving culture change: With Data Catalog, every user becomes a data custodian. By making the process of cataloging simple, and automating the enforcement of governance policies, it will encourage users to share data. They can also curate and comment on assets, which makes the data easier for other users to find in the future. These factors drive a culture change towards data-centricity, creating a virtuous circle that continuously improves data governance over time. Uncovering insights: By providing a space where users can bring different datasets together and work with them in new ways, Data Catalog will help knowledge workers get deeper, more accurate and more nuanced answers to their questions, sooner. Integration with other solutions: Data Catalog will integrate with IBM Data Connect through the fabric of Watson Data Platform, making it easy for users to access physical data and move it into shared sandboxes or other workspaces for further manipulation or analysis. It is also integrated with IBM Data Science Experience , giving users access to a set of powerful data science tools they can use to explore new datasets and enhance their analysis. THE LURE OF THE CLOUD A few years ago, it was common to hear people say they would never move data outside their company’s firewalls. However, times are changing. Recent high-profile cyber attacks have demonstrated that keeping data on-premises may be no safer that storing it in the cloud. In fact, there’s even an argument that specialized cloud service providers may be able to take advantage of economies of scale to invest in better security capabilities than most traditional companies can afford in-house. As a result, many organizations are now considering moving at least some of their data into the cloud. For these organizations, creating a metadata index of your data with Data Catalog will be an ideal starting point. You won’t actually have to move your data to the cloud — only your metadata. In the process, you can get comfortable with cloud solutions, and start to foster support within your organization. As you gain confidence, Data Catalog will also help you assess which of your data assets naturally gravitate towards cloud platforms, and how best to prioritize the next steps in your cloud strategy. If we’ve piqued your interest, learn more about Data Catalog today. [1] Source: How Britain voted at the 2017 general election (YouGov) [2] Source: 2016 Data Science Report (CrowdFlower) -------------------------------------------------------------------------------- Originally published at www.ibm.com on August 15, 2017. * Data Governance * Data Management * Big Data * Data Lake * Data Catalog One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. Blocked Unblock Follow FollowingSUSANNA TAI Offering Manager, Watson Data Platform | Data Catalog FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Just recently in the UK, we’ve seen the dangers of making decisions based on incomplete or poor data play out on the world stage. The Prime Minister called a general election three years earlier than…",Don’t throw more data at the problem! Here’s how to unlock true value from your data lake,Live,145 378,"Homepage Follow Sign in / Sign up Homepage * Home * Data Science Experience * * Watson Data Platform * Greg Filla Blocked Unblock Follow Following Product manager & Data scientist — Data Science Experience and Watson Machine Learning Apr 14 -------------------------------------------------------------------------------- HOW TO USE DB2 WAREHOUSE ON CLOUD IN DATA SCIENCE EXPERIENCE NOTEBOOKS We have heard from many of you that Db2 Warehouse on Cloud is the relational database of choice for use in DSX. Today, I’m happy to announce a new feature that makes it even easier to use your Db2 Warehouse on Cloud data in DSX notebooks. We have added the same “Insert to code” functionality for Db2 Warehouse on Cloud that we have available for CSV and JSON files. You can insert a Db2 Warehouse on Cloud table into your code by creating a Connection in DSX. Connections allow you to manage database connections that can be added to different projects in DSX. This helps to encapsulate the data access for only members of a project. Let’s see this feature in action: SETTING UP A PROJECT TO USE THIS FEATURE 1. Now, create a connection for this service in DSX. You can use this documentation to help with this step. 2. Once the connection is created add it to a project by going to `Connections` in the 1001 tab of a project, checking the box for your connection and clicking `Apply`. 3. With the connection in the project, it’s ready for use in a notebook. The connection can be used for existing notebooks or new ones. USING THIS FEATURE INSIDE A NOTEBOOK 1. This feature is very similar to other insert to code functionality in DSX. Check out this post showing how it is used for files in object storage. 2. Use this documentation to see how insert to code for Db2 Warehouse on Cloud works. 3. See the sections below to see what file formats can be selected for Db2 Warehouse on Cloud tables. Python Notebook * ibmdbpy IdaDataFrame (awesome library from IBM — pushes operations to Db2 Warehouse on Cloud rather than pulling into memory/cluster) * pandas DataFrame * SQL Context (Spark 1.6)/ SparkSession (Spark 2.0) * Insert Credentials R Notebook * ibmdbr ida.data.frame (awesome package — similar to ibmdbpy) * R DataFrame * SQL Context (Spark 1.6)/ SparkSession (Spark 2.0) * Insert Credentials Scala Notebook * SQL Context (Spark 1.6)/ SparkSession (Spark 2.0) * Insert Credentials Watch this video to see how to set up a connection to Db2 Warehouse on Cloud and a simple example of loading and analyzing data in a Scala notebook. We hope this feature makes your future work with DSX and Db2 Warehouse on Cloud a breeze. You can add any feedback or product suggestions to the DSX ideas page. -------------------------------------------------------------------------------- Originally published at datascience.ibm.com on April 14, 2017. * Data Science * Dsx * Db2 Show your supportClapping shows how much you appreciated Greg Filla’s story. Blocked Unblock Follow FollowingGREG FILLA Product manager & Data scientist — Data Science Experience and Watson Machine Learning FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","We have heard from many of you that Db2 Warehouse on Cloud is the relational database of choice for use in DSX. Today, I’m happy to announce a new feature that makes it even easier to use your Db2…",How to use Db2 Warehouse on Cloud in Data Science Experience notebooks,Live,146 382,"OFFLINE-FIRST QR-CODE BADGE SCANNERGlynn Bird / May 5, 2016Offline-first web applications are websites with a twist; they instruct thebrowser to cache all of the assets they need to render themselves, such asimages, css, and JavaScript files. Once loaded, the websites continue tofunction even when there is a flaky or non-existent network connection. Thekiller feature of such apps is that they can use in-browser storage to read andwrite dynamic data without relying on the presence of a cloud server. PouchDB lets the web application store data in the browser using a variety of localstorage mechanisms, while presenting a simple API. Furthermore, when it doesfind a network connection, a PouchDB database can sync with a remote Apache®CouchDB™ or Cloudant database, and changes flow seamlessly in both directionswithout loss of data.Last year I made a simple offline-first data collection app that lets you design an HTML form and then use it to capture structured data,which is stored in PouchDB. My developer advocate colleagues used the app tocollect submissions for a competition at a tech conference where the wifi was sopoor that offline-first was the only option. When they returned home, theysynced the PouchDB in their iPads to a shared Cloudant database.I thought I’d revisit this app to allow it to scan conference badges thatcontain a QR code . In fact, I ended up writing a whole new app.QR CODESQR Codes are two-dimensional barcodes containing a text payload that is encodedin black squares on a white background.The payload can be a URL or some text. At conferences, the payload tends to be a vCard —a blob of text that contains the attendee’s name, email, company, and url. Themore data the QR code has to store, the more detailed the blocks on the QR codeimage have to be:url vCardThe URL example above contains only http://www.glynnbird.com . The vCard example contains:BEGIN:VCARDVERSION:3.0N:Bird;GlynnFN:Glynn BirdORG:IBMTITLE:Developer AdvocateADR;work:;;1 The Square;Bristol;;BS1 6DG;UKTEL;WORK;VOICE:01179295012EMAIL;WORK;INTERNET:glynn.bird@uk.ibm.comURL:www.glynnbird.comEND:VCARDLEVERAGING OPEN-SOURCEAt first, I thought I would have to create a native iPhone or Android app tocapture images from a camera, decode QR codes, and store data in a database.Fortunately other open-source heroes have solved the hard problems for me: * W3C MediaStream API – to capture the host’s video camera feed * JavaScript QR code parsing library – to find QR codes in images * vCard parsing snippet – to parse vCard text * PouchDB – to store data in a local, in-browser database * Picnic CSS – to make the front-end presentable * AppCache API – to cache page assetsStill, it’s not quite that simple. The MediaStream API is new and notuniversally supported, so my code has to fall back on the older but also notuniversally supported getUserMedia API . I was also unable to get the media streaming code to work properly on mobiledevices. Conversely, the AppCache API is deprecated but its replacement, Service Workers , is not widely supported. Developers have to make such compromises every day;weighing established but deprecated functions against the latest bleeding edgecode that doesn’t have wide browser support.The finished demo app uses all of the above technologies in a single-page webapp that can be deployed to IBM Bluemix . Once you visit the page, it should be cached by your browser – try turningoff your wifi and revisiting the page:HOW DOES IT WORK?The web page contains a video tag, which the JavaScript uses to render areal-time feed of your machine’s webcam. The first time you open the website,you should be asked for permission for the app to access your webcam’s feed.There is also an invisible “canvas” control in the HTML markup, which takes asnapshot of the image every 0.5s. The data in the canvas goes to the QR-codeparsing library, which returns some data if it finds a QR code on the canvasimage.The QR-code is parsed and turned into a JSON object:{ ""version"": ""3.0"", ""fn"": ""Glynn Bird"", ""org"": ""IBM"", ""title"": ""Developer Advocate"", ""adr"": "",,1 The Square,Bristol,,BS1 6DG,UK"", ""tel"": ""01179295012"", ""email"": ""glynn.bird@uk.ibm.com"", ""url"": ""www.glynnbird.com"", ""ts"": 1461074275541, ""date"": ""2016-04-19T13:57:55.541Z""}which is saved to a PouchDB database using the db.post function call.Below the real-time video feed, is a table of previously saved cards presentedin “newest-first” order. This is achieved by querying the PouchDB database using a Map/Reduce index ordered on the ts (timestamp) value we created inside the JSON object.SYNCING TO CLOUDANTIn PouchDB, syncing to a remote CouchDB or Cloudant database is a simple ascalling the replicateTo function: db.replicate.to(remoteDB) .on(""change"", function(info) { // something changed }) .on(""complete"", function(info) { // all done }) .on(""error"", function(err) { // something went wrong });The variable remoteDB contains a URL of the remote database in the form: https://username:password@myhostname.cloudant.com/mydatabaseCONCLUSIONSCreating offline-first applications is in some ways easier than creatingtraditional client-server applications. Your database is always availablebecause it resides on the same device as the browser, making for fastperformance and 100% uptime. The hard part—getting data from the client to theserver and vice versa—is handled for you by PouchDB/CouchDB/Cloudantreplication, which requires only a single function call to initiate the process.Allowing webpages to render and function without a network lets web apps goplaces they couldn’t normally go to: * capturing health data in developing countries * recording IoT data from remote sites * collecting information when the network is down or unusably slowCombining PouchDB with Cloudant makes it easy to create such applications,without getting into native application development.LINKS * Source code – https://github.com/ibm-cds-labs/badgescanner * Demo – https://badgescanner.mybluemix.net/© “Apache”, “CouchDB,” and “Apache CouchDB” are trademarks or registeredtrademarks of The Apache Software Foundation. All other brands and trademarksare the property of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: cloudant / Offline First / PouchDB Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Build a data collection app that captures and stores QR Code data, even when your network is unavailable.",Offline-first QR-code Badge Scanner,Live,147 383,"SEARCH SLACK WITH IBM GRAPH ptitzler / August 22, 2016If you use Slack, you know it can be hard to find information in the torrent of messages that flow through your account. Our developer advocacy team is part an enormous slack account. When I have a question, it’s hard to identify appropriate channels or members to ask. Fun fact: We currently have about 3,000 public channels, discussing a variety of topics, including cats! THE PROBLEM Let’s say I went to our team’s Slack acocunt looking for information about the Cloudant schema discovery process , which is used to build and populate a dashDB data warehouse from IBM’s Cloudant NoSQL database. I could: * Find channels that contain one or more relevant keywords in the name or purpose. To browse/join channels that include the key words I seek, like Cloudant (20+ hits), schema (2), and discovery (3) is time-consuming and may not help. Use the built-in search features (global search, in-channel search, by-user search) to find messages that include Cloudant schema discovery . Results usually vary between no hits and a gazillion, depending on the quality of your exact search term(s) and the number of Slack messages in the system. This never seems to work well for me.Ask people where or whom to ask. Not very efficient. THE SOLUTION USE A GRAPH DATABASE TO EXPLORE RELATIONSHIPS With hundreds of people exchanging thousands of messages daily, chances are good that the information (or contacts) you need can be automatically derived from the messages that were exchanged between users. A graph database is the perfect place to load and analyze this data. A graph is comprised of vertices (nodes) and edges (relationships). In our scenario, Slack users, channels, and keywords are vertices . Relationship between vertices, like user-to-channel, user-to-user, and user-to-keyword are Edges . I built a graph database prototype solution that analyzes these relationships to find answers to common questions. The solution uses a custom slash command as the “public” interface in Slack, a service to process the request and IBM Graph as the back-end database. HOW IT WORKS If you want to find info in Slack using my solution, you first enter the custom slash command /about followed by the search term. So to find info on Cloudant , you’d enter: /about cloudant . The service queries the graph database and returns the results to Slack for display. Immediately you see the people and channels containing that term. Retrieve information about channels or users by entering /about #nosql and /about @claudia , respectively. BUILDING A SLACK TEAM GRAPH To create a graph for a team representing users, channels, and keywords we: 1. Generate social and keyword statistics from the Slack messages. Batch scripts collect the data, operating on exported team message archives. We use Watson’s AlchemyAPI to extract keywords and user and channel references (like @betty and #cloudant-sdp ) to collect social stats. We’ve really just scratched the surface … Additional information could be used to improve result quality. For example, channels frequented primarily by bots (like #cloudant-devops ) might be ranked lower than channels with heavy user activity ( #cloudant-help ). 2. Build a graph model based on these statistics. The model is a logical representation of the Slack team graph, representing users, channels, keywords, and their relationships. The sample messages shown in the beginning of the blog post, might be represented in the model as follows: Once all relevant information has been added to the graph model, we can load it into IBM Graph. A graph model can be translated on the fly to Gremlin or input via bulk input APIs , so we can create many vertices and edges in the database with a relatively small number of requests. 3. Load the graph model into IBM Graph. We translate the graph model to Gremlin scripts and run those to create the vertices and edges. Once all objects are created we can use the IBM Graph web console in Bluemix to explore the Slack team graph by running traversal steps.For example, to inspect the Slack team graph, open the Query tab and enter Gremlin queries, like: def g=graph.traversal(); g.V().has(""isUser"", true).count(); def g=graph.traversal(); g.V().has(""isChannel"", true).count(); def g=graph.traversal(); g.V().has(""iskeyword"", true).count(); to count users, channels, or keywords: Here’s the big picture of how we create the graph: HOW SLACK USERS ACCESS THE GRAPH To provide users easy access to the graph (within Slack) we’ve created a simple service called about , implemented in NodeJS. This service extracts the query details (channel name, user name, or keyword) from the Slack request, connects to IBM Graph and runs predefined graph traversals using the IBM Graph client library (hat tip to Mike Elsmore ). The results are visible only to the user that invoked the slash command. Sound interesting? Ready to explore your Slack Graph? Start here . SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Click to share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: AlchemyAPI / Bluemix / graph / slack Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Search your Slack account using an IBM Graph database and Watson's AlchemyAPI.,Search Slack with IBM Graph,Live,148 385,"See how easy it is to unlock your data for use in mobile and web applications, or for more flexible analysis and reporting. Bluemix Secure Gateway service lets you move data from on-premises to the cloud in a secure manner. This is a multi-part tutorial which shows how to set up a gateway and then build an app on top of it. Here, in Part 1, we’ll cover:Lots of enterprises have valuable data they need to protect. To keep sensitive data secure, databases are often stored on-premises within an organization’s physical location, where staff can protect it more easily. But more and more, organizations also want to host data in the cloud for easy availability and integration with analytics and mobile or web apps. They're looking to take data out of their system of record and open it to one or more systems of engagement.Secure gateway lets you safely connect to an on-premises database. It works by creating a secure tunnel through which you can access protected data. The gateway encrypts and authenticates user connections, to prohibit unauthorized access. It’s a way to open your on-premises data to the cloud and enjoy the flexibility, security, and scalability that it offers.One gateway can connect to many on-premises data sources. In this tutorial, we're using Bluemix, IBM's cloud platform, to create the gateway. Here's a simplified version of what we're doing here in Part 1:We'll create a new Secure Gateway on Bluemix, which generates a gateway ID. We'll use that ID to start the gateway client in our on-premises network.Optional, but smart: You can add additional security by enforcing the use of a security token when starting the client.Create one or more destinations (data sources) to your on-prem database servers. Each destination will have its own port on the Bluemix server.Test the connection by accessing your data from your a browser or a bluemix app, through the url given for each destination.Here's how the different pieces connect together.In this tutorial, we’ll set up a secure gateway for access a sample Apache CouchDBTM database. The point of using CouchDB is to verify that the Secure Gateway instance works. You can replace it with any database of your choice to achieve the same results.Docker Engine is a lightweight runtime and packaging tool for apps. Docker works best on Linux OS. If you want to use Docker on Mac or Windows, just install the helper app, Boot2Docker. You’ll find all the details and instructions at https://docs.docker.com/installation/#installation. Just choose your operating system and follow the instructions.Now we’re ready to set up the gateway.Go to the Bluemix site: https://console.ng.bluemix.net/If you’re new to Bluemix, you can sign up for a free trial.Scroll down to Integration and click Secure Gateway.Tip: Most Bluemix services run entirely on the cloud. Secure Gateway is the rare exception to this rule, since its very purpose is to securely connect to on-prem data sources. So, it requires both cloud-platform-side and on-premises processes.On the upper right of the screen, click the APP dropdown and choose Leave unbound.Note: If you haven’t yet installed the Docker client, you must go do so now (see previous section).Enter any name you want for the gateway.Under How would you like to connect this gateway? choose Docker.Copy the text and, if you’re on Mac or Windows, add additional text:If you’re on Linux, this command works fine as-is. But for Mac and Windows, you need to insert the following additional text, right after docker runInsert spaces on either side. The beginning of the line should look like this:Go to your computer’s command line, paste in the text, and press Enter.Your gateway client is now connected to Bluemix.Connected! If you go back and open the gateway in Bluemix,status in the upper right corner shows as Connected.Leave your terminal command line window open. You’ll return to it in a few minutes.Next, we must set the data source endpoint. This will be the on-premises source database we want to share out to the cloud. For the purposes of this tutorial, we’ll use a simple CouchDB database.On your on-prem laptop or computer, install CouchDB.Return to Bluemix and open your open the gateway. Under Create Destinations Enter a name for the connection. Then enter the IP address and port of the on-prem machine where your couchDB database resides and click the +plus button on the far right of the line (use 127.0.0.1 if CouchDB is installed on the current laptop)If you're on Windows or Mac, configure Boot2Docker to provide access to the data.On Windows and Mac, you must allow access through multiple containers. To do so, open a new instance of Boot2Docker and run the following command--inserting your own IP and port information. (If couchDB is running on your local laptop, you can use 127.0.0.1 for the host and 5984 for the port, which are the default settings.)Now you'll see some results. Follow these steps to view your local couchDB data from outside your network.On a laptop or machine outside your on-premise network, open a browser and sign in to Bluemix.Locate the secure gateway connection you created and click its i information button.Open another browser window and paste the string into the address bar. At the end of the string, type /_utils so the address looks like this:You'll see your couchDB dashboard (Futon app) appear. That's it! Your database is now accessible from outside your on-premises network!You saw it happen, and so did Bluemix. In Bluemix, return to or open the gateway. The chart shows a spike in traffic.Now you know how create a secure gateway that opens your on-prem data to the cloud. You can try these same steps with MYSQL, DB2, MongoDB, or any other databases you use on-premises.There are 2 types of security to consider. You can:* Require a security token when starting the gateway client. This is useful if you want to control who can start the gateway client. To do so, when you add the gateway, turn on the Enforce Security Token on Client checkbox.Once you do, you see the security token in Gateway details (beside the key icon) for use when starting the gateway on the client:* (Advanced) Extend TLS encryption between the gateway client and your on-prem data source. To implement, click the Enable client TLS checkbox located in the Advanced section of the destination configuration. Optionally, you can upload a certificate file (.pem extension). Note: You do not have to do this step if the certificate is self-signed....for additional parts of this tutorial which will show you how to build an app that leverages the secure gateway. After that, we'll learn how to include data sets from multiple sources (cloud-based and local) for combination and analysis.© ""Apache"", ""CouchDB"", ""Apache CouchDB"" and the CouchDB logo are trademarks or registered trademarks of The Apache Software Foundation. All other brands and trademarks are the property of their respective owners.","See how easy it is to unlock your data for use in mobile and web applications, or for more flexible analysis and reporting. Bluemix Secure Gateway service lets you move data from on-premises to the cloud in a secure manner. ",ibm-cds-labs/hybrid-cloud-tutorial,Live,149 392,"Compose The Compose logo Articles Sign in Free 30-day trialCAMPUS DISCOUNTS - MAKING THE MOST OF COMPOSE Published May 3, 2017 case study mongodb elasticsearch Campus Discounts - Making the Most of ComposeCampus Discounts uses several Compose-hosted databases including MySQL, MongoDB, Redis, Elasticsearch and RabbitMQ to power their social media platform. Recently they started exploring IBM Watson to add cognitive features to the app. We sat down with founder and CTO Don Omondi to hear their story. As a student in Kenya, Don Omondi had difficulty getting information when he was looking to buy a cell phone. “I had to travel more than 22 miles to the nearest town to start window shopping for a phone.” He knew he wasn’t the only one struggling with buying the necessities. “Maybe I could create a platform which will make it easy for students like me to find and buy things easily and from sellers nearby,” Don told us. The opportunity came when IBM held their SmartCamp competition in Nairobi, where Don was among the finalists. Shortly thereafter, he founded Campus Discounts to realize his dream. Campus Discounts is a social network where students find and recommend discounts posted by vendors near their campuses. Businesses create pages and post discounts on the campus site. Students can then view their campus page and find discounts nearby. After a free signup, students can select product categories of interest and also connect to fellow students through the buddy system. Students can also flag bargains and notify their friends easily via recommendations which make up a users’ news feed and timelines. Businesses who are interested in listing their offerings can tag up to 3 locations which will make the discount show in all campuses within a default 10 km radius or they can target a wider geography for an extra fee. They can also get analytics like traffic flow, behavioral trends etc., on the same platform. Localization of the platform (language, currency) is done automatically. At the moment, Campus Discounts doesn’t have any peer-to-peer sales model but that’s on their roadmap, including transaction processing and other e-commerce features. Powering the data layer of the platform are six databases. MySQL is used for ‘primary’ data such as users, discounts, business pages, apps, and sessions. Redis is used to cache this data for redundancy. MongoDB is used for storing ‘secondary’ data. This data is derived from actions on primary data such as likes, comments, follows, ratings, reviews, friendships, etc. Don likes MongoDB because “We can store and retrieve all these little pieces of data easily - they don’t have to be related with one another.” The majority of the site’s user-centric features are handled by Elasticsearch. “The reason we use Elasticsearch is for its power of geographic qualities, scoring and sorting of data and flexible search capabilities,” said Don. Discounts have a geo shape field mapping while campuses have a geo point field mapping which, for example, allows them to do a query in Elasticsesarch for any discounts belonging to specific categories, with at least 5 likes, that have the word ‘Samsung’ in their description and within a given radius (e.g., 10 kilometers). “Elasticsearch makes these kinds of queries very easy to implement.” The fifth database Campus Discounts uses is JanusGraph (currently not hosted on Compose). It’s a highly scalable graph database which is originally a fork of the popular open source project Titan. Don uses this for graphing relationships between registered and non-registered users for social invites as well as to suggest new friendships on the platform based on their interests, what businesses they are following, and so forth. This also makes it easier for businesses to provide targeted discounts to student segments. Finally, Campus Discounts uses Compose for RabbitMQ, a popular message broker to synchronize, track, route, and queue tasks that need to be processed later. “All our secondary data is persisted asynchronously, developers who tap into our API can activate webhooks to know when it’s done” Running on top of the databases is PHP Symfony (a collection of reusable PHP components) for the backend and Ember.js plus Node.js on the front end. Why PHP? According to Don, “A lot of people hate on PHP, but it has an unrivaled community and library support which can be priceless for certain use cases. For example, Symfony comes with Doctrine, a mature data persistence library for ORM, ODM and Cache as well as ways to integrate them all.” Don has shared his experiences working with multiple databases and application stack in several articles under Compose’s WriteStuff banner. You can find them here . Recently, Don started experimenting with IBM Watson to embed its cognitive abilities into the Campus Discounts platform. One idea that he loves and has already implemented is matching real world items with discounts posted on the platform. As he explains, “Wouldn’t it be cool if you see this dress or bike that you really like, you take a picture, upload it to Campus Discounts, and we find nearby offers that match that image?” With Watson, he can now do that. He’s also expanded this into a chat bot and voice command like feature. As he explains, “Using the HTML5 audio API, you can talk to our Watson Bot to find specific offers, or even to log out.” Don reckons since Compose databases are already available on IBM Bluemix and Watson Data Platform, it’s secure, easy and performant to blend the two, “Watson makes sense of it then Elasticsearch finds it”. So, why Compose? “First and foremost, as CTO of a growing startup, I have a lot on my plate right now. Compose really comes in and takes the weight off my shoulders. I can focus on developing my code; I don't need to worry about installing databases, keeping them up to date, keeping my platform live and keeping it secure. And when I do need help, I can rely on your support team for a prompt response.” Don likes the fact that as a cloud platform, Compose allows him to host databases in many locations and with different service providers like Google Cloud Platform, Amazon AWS, and IBM SoftLayer. He also likes to play around and try out new things, “Compose recently introduced ScyllaDB, which is a faster, Cassandra replacement database. So, if I wanted to test it out, I just need to spin it up with a click and try it in minutes.” With hard work and Compose’s help, Campus Discounts has seen rapid growth since its inception in 2015. It’s now available in over 36,500 campuses worldwide. To learn more about Campus Discounts, visit: https://campus-discounts.com/ . -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. attribution Samuel Zeller Arick Disilva works in Product Marketing at Compose. Love this article? Head over to Arick Disilva ’s author page and keep reading.RELATED ARTICLES Oct 11, 2016COMPOSE: NOW AVAILABLE ON IBM BLUEMIX The power of IBM's Bluemix cloud platform is now able to seamlessly harness Compose's databases, making Compose-configured Mo… Dj Walker-Morgan Dec 3, 2015CHOOSING THE RIGHT SOLUTION FOR YOU - COMPOSE PB&J If you're new to some of the databases that Compose offers, you might be wondering which ones you should choose for your proj… Lisa Smith Apr 28, 2017NEWSBITS - MYSQL, ELASTICSEARCH, MONGODB, ETCD, COCKROACHDB, SQL SERVER, CRICKET AND JUICE NewBits for the week ending 28th April - MySQL 8.0.1's preview demos better replication, Elasticsearch, MongoDB and etcd get… Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Campus Discounts uses several Compose-hosted databases including MySQL, MongoDB, Redis, Elasticsearch and RabbitMQ to power their social media platform. Recently they started exploring IBM Watson to add cognitive features to the app. We sat down with founder and CTO Don Omondi to hear their story.",Campus Discounts - Making the Most of Compose (customer),Live,150 395,"Homepage Sign in / Sign up 3 * Share * 3 * * Never miss a story from Karlijn Willems , when you sign up for Medium. Learn more Never miss a story from Karlijn Willems Blocked Unblock Follow Get updates Karlijn Willems Blocked Unblock Follow Following Data Science Journalist @DataCamp Nov 16 -------------------------------------------------------------------------------- JUPYTER NOTEBOOK TUTORIAL: THE DEFINITIVE GUIDE Originally published at https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook Data science is about learning by doing. One of the ways you can learn how to do data science is by building your own portfolio: elaborating your own pet project, doing a quick data exploration task, participating in a data challenge, reporting on your research or advancements you have made in learning data science, creating an Extract, Transform, and Load (ETL) flow of data, … This way, you exercise the practical skills you will need when you work as a data scientist. As a web application in which you can create and share documents that contain live code, equations, visualizations as well as text, the Jupyter Notebook is one of the ideal tools to help you to gain the data science skills you need. This tutorial will cover the following topics: * A basic overview of the Jupyter Notebook App and its components, * The history of Jupyter Project to show how it’s connected to IPython, * An overview of the three most popular ways to run your notebooks : with the help of a Python distribution, with pip or in a Docker container, * A practical introduction to the components that were covered in the first section, complete with an explanation on how to make your notebook documents magical and answers to frequently asked questions, such as “How to toggle between Python 2 and 3?”, and * The best practices and tips that will help you to make your notebook an added value to any data science project! The Jupyter Notebook: an interactive data science environment -------------------------------------------------------------------------------- WHAT IS A JUPYTER NOTEBOOK? In this case, “notebook” or “notebook documents” denote documents that contain both code and rich text elements, such as figures, links, equations, … Because of the mix of code and text elements, these documents are the ideal place to bring together an analysis description and its results as well as they can be executed perform the data analysis in real time. These documents are produced by the Jupyter Notebook App. We’ll talk about this in a bit. For now, you should just know that “Jupyter” is a loose acronym meaning Julia, Python, and R. These programming languages were the first target languages of the Jupyter application, but nowadays, the notebook technology also supports many other languages . And there you have it: the Jupyter Notebook. As you just saw, the main components of the whole environment are, on the one hand, the notebooks themselves and the application. On the other hand, you also have a notebook kernel and a notebook dashboard. Let’s look at these components in more detail. WHAT IS THE JUPYTER NOTEBOOK APP? As a server-client application, the Jupyter Notebook App allows you to edit and run your notebooks via a web browser. The application can be executed on a PC without Internet access or it can be installed on a remote server, where you can access it through the Internet. Its two main components are the kernels and a dashboard. A kernel is a program that runs and introspects the user’s code. The Jupyter Notebook App has a kernel for Python code, but there are also kernels available for other programming languages. The dashboard of the application not only shows you the notebook documents that you have made and can reopen but can also be used to manage the kernels: you can which ones are running and shut them down if necessary. THE HISTORY OF IPYTHON AND JUPYTER NOTEBOOKS To fully understand what the Jupyter Notebook is and what functionality it has to offer you need to know how it originated. Let’s back up briefly to the late 1980s. Guido Van Rossum begins to work on Python at the National Research Institute for Mathematics and Computer Science in the Netherlands. Wait, maybe that’s too far. Let’s go to late 2001, twenty years later. Fernando Pérez starts developing IPython. In 2005, both Robert Kern and Fernando Pérez attempted building a notebook system. Unfortunately, the prototype had never become fully usable. Fast forward two years: the IPython team had kept on working, and in 2007, they formulated another attempt at implementing a notebook-type system. By October 2010, there was a prototype of a web notebook and in the summer of 2011, this prototype was incorporated and it was released with 0.12 on December 21, 2011. In subsequent years, the team got awards, such as the Advancement of Free Software for Fernando Pérez on 23 of March 2013 and the Jolt Productivity Award, and funding from the Alfred P. Sloan Foundations, among others. Lastly, in 2014, Project Jupyter started as a spin-off project from IPython. IPython is now the name of the Python backend, which is also known as the kernel. Recently, the next generation of Jupyter Notebooks has been introduced to the community. It’s called JupyterLab. Read more about it here . After all this, you might wonder where this idea of notebooks originated or how it came about to the creators. Go here to find out more. HOW TO INSTALL JUPYTER NOTEBOOK RUNNING JUPYTER NOTEBOOKS WITH THE ANACONDA PYTHON DISTRIBUTION One of the requirements here is Python, either Python 3.3 or greater or Python 2.7. The general recommendation is that you use the Anaconda distribution to install both Python and the notebook application. The advantage of Anaconda is that you have access to over 720 packages that can easily be installed with Anaconda’s conda, a package, dependency, and environment manager. You can download and follow the instructions for the installation of Anaconda here . Is something not clear? You can always read up on the Jupyter installation instructions here . RUNNING JUPYTER NOTEBOOK THE PYTHONIC WAY: PIP If you don’t want to install Anaconda, you just have to make sure that you have the latest version of pip. If you have installed Python, you will normally already have it. What you do need to do is upgrading pip and once you have pip, you can get started on installing Jupyter. Go to the original article for the commands to install Jupyter via pip. RUNNING JUPYTER NOTEBOOKS IN DOCKER CONTAINERS Docker is an excellent platform to run software in containers. These containers are self-contained and isolated processes. This sounds a bit like a virtual machine, right? Not really. Go here to read an explanation on why they are different, complete with a fantastic house metaphor. Running Jupyter in Docker ContainersYou can easily get started with Docker: turn to the original article to get started with Jupyter on Docker. HOW TO USE JUPYTER NOTEBOOKS Now that you know what you’ll be working with and you have installed it, it’s time to get started for real! GETTING STARTED WITH JUPYTER NOTEBOOKS Run the following command to open up the application: jupyter notebook Then you’ll see the application opening in the web browser on the following address: http://localhost:8888. For a complete overview of all the components of the Jupyter Notebook, complete with gifs, go to the original article . If you want to start on your notebook, go back to the main menu and click the “Python 3” option in the “Notebook” category. You will immediately see the notebook name, a menu bar, a toolbar and an empty code cell. You can immediately start with importing the necessary libraries for your code. This is one of the best practices that we will discuss in more detail later on. After, you can add, remove or edit the cells according to your needs. And don’t forget to insert explanatory text or titles and subtitles to clarify your code! That’s what makes a notebook a notebook in the end. For more tips, go here . Are you not sure what a whole notebook looks like? Hop over to the last section to discover the best ones out there! TOGGLING BETWEEN PYTHON 2 AND 3 IN JUPYTER NOTEBOOKS Up until now, working with notebooks has been quite straightforward. But what if you don’t just want to use Python 3 or 2? What if you want to change between the two? Luckily, the kernels can solve this problem for you! You can easily create a new conda environment to use different notebook kernels. Then you restart the application and the two kernels should be available to you. Very important: don’t forget to (de)activate the kernel you (don’t) need. Go to the original article to see how this works and how you can manually register your kernels. RUNNING R IN YOUR JUPYTER NOTEBOOK As the explanation of the kernels in the first section already suggested, you can also run other languages besides Python in your notebook! If you want to use R with Jupyter Notebooks but without running it inside a Docker container, you can run the following command to install the R essentials in your current environment. These “essentials” include the packages dplyr , shiny , ggplot2 , tidyr , caret and nnet . If you don't want to install the essentials in your current environment, you can use the following command to create a new environment just for the R essentials. Next, open up the notebook application to start working with R with the usual command. If you want to know about the commands to execute or extra tips to run R successfully in your Jupyter Notebook, go here . If you now want to install additional R packages to elaborate your data science project, you can either build a Conda R package or you can install the package from inside of R via install.packages or devtools::install_github (from GitHub). You just have to make sure to add new package to the correct R library used by Jupyter. Note that you can also install the IRKernel, a kernel for R, to work with R in your notebook. You can follow the installation instructions here . Note that you also have kernels to run languages such as Julia, SAS, … in your notebook. Go here for a complete list of the kernels that are available. This list also contains links to the respective pages that have installation instructions to get you started. Making your Jupter Notebook Magical With Magic CommandsMAKING YOUR JUPYTER NOTEBOOK MAGICAL If you want to get the most out of this, you should consider learning about the so-called “magic commands”. Also, consider adding even more interactivity to your notebook so that it becomes an interactive dashboard to others should be one of your considerations! The Notebook’s Built-In Commands There are some predefined ‘magic functions’ that will make your work a lot more interactive. To see which magic commands you have available in your interpreter, you can simply run the following: %lsmagic And you’ll see a whole bunch of them appearing. You’ll probably see some magics commands that you’ll grasp, such as %save , %clear or %debug , but others will be less straightforward. If you’re looking for more information on the magics commands or on functions, you can always use the ?. Note that there is a difference between using % and && . To know more about this and other useful magic commands that you can use, go here . You can also use magics to mix languages in your notebook without setting up extra kernels: there is rmagics to run R code, SQL for RDBMS or Relational Database Management System access and cythonmagic for interactive work with cython ,... But there is so much more here ! Interactive Notebooks As Dashboards: Widgets The magic commands already do a lot to make your workflow with notebooks agreeable, but you can also take additional steps to make your notebook an interactive place for others by adding widgets to it! This example was taken from a wonderful tutorial on building interactive dashboards in Jupyter, which you can find on this page . SHARE YOUR JUPYTER NOTEBOOKS In practice, you might want to share your notebooks with colleagues or friends to show them what you have been up to or as a data science portfolio for future employers. However, the notebook documents are JSON documents that contain text, source code, rich media output, and metadata. Each segment of the document is stored in a cell. Ideally, you don’t want to go around and share JSON files. That’s why you want to find and use other ways to share your notebook documents with others. When you create a notebook, you will see a button in the menu bar that says “File”. When you click this, you see that Jupyter gives you the option to download your notebook as an HTML, PDF, Markdown or reStructuredText, or a Python script or a Notebook file. You can use the nbconvert command to convert your notebook document file to another static format, such as HTML, PDF, LaTex, Markdown, reStructuredText, ... But don't forget to import nbconvert first if you don't have it yet! Then, you can give in something like the following command to convert your notebooks: jupyter nbconvert --to html Untitled4.ipynb With nbconvert , you can make sure that you can calculate an entire notebook non-interactively, saving it in place or to a variety of other formats. The fact that you can do this makes notebooks a powerful tool for ETL and for reporting. For reporting, you just make sure to schedule a run of the notebook every so many days, weeks or months; For an ETL pipeline, you can make use of the magic commands in your notebook in combination with some type of scheduling. Besides these options, you could also consider the following options . JUPYTER NOTEBOOKS IN PRACTICE This all is very interesting when you’re working alone on a data science project. But most times, you’re not alone. You might have some friends look at your code or you’ll need your colleagues to contribute to your notebook. How should you actually use these notebooks in practice when you’re working in a team? The following tips will help you to effectively and efficiently use notebooks on your data science project. TIPS TO EFFECTIVELY AND EFFICIENTLY USE YOUR JUPYTER NOTEBOOKS Using these notebooks doesn’t mean that you don’t need to follow the coding practices that you would usually apply. You probably already know the drill, but these principles include the following: * Try to provide comments and documentation to your code. They might be a great help to others! * Also consider a consistent naming scheme, code grouping, limit your line length, … * Don’t be afraid to refactor when or if necessary In addition to these general best practices for programming, you could also consider the following tips to make your notebooks the best source for other users to learn: * Don’t forget to name your notebook documents! * Try to keep the cells of your notebook simple: don’t exceed the width of your cell and make sure that you don’t put too many related functions in one cell. * If possible, import your packages in the first code cell of your notebook, and * [More tips here ] JUPYTER NOTEBOOKS FOR DATA SCIENCE TEAMS: BEST PRACTICES Jonathan Whitmore wrote in his article some practices for using notebooks for data science and specifically addresses the fact that working with the notebook on data science problems in a team can prove to be quite a challenge. That is why Jonathan suggests some best practices: * Use two types of notebooks for a data science project, namely, a lab notebook and a deliverable notebook. The difference between the two (besides the obvious that you can infer from the names that are given to the notebooks) is the fact that individuals control the lab notebook, while the deliverable notebook is controlled by the whole data science team, * Use some type of versioning control (Git, Github, …). Don’t forget to commit also the HTML file if your version control system lacks rendering capabilities, and * Use explicit rules on the naming of your documents. LEARN FROM THE BEST NOTEBOOKS This section is meant to give you a short list with some of the best notebooks that are out there so that you can get started on learning from these examples . You will find that many people regularly compose and have composed lists with interesting notebooks. Don’t miss this gallery of interesting IPython notebooks or this KD Nuggets article. -------------------------------------------------------------------------------- Originally published at www.datacamp.com . Data Science Python Data Mining Machine Learning R 3 Blocked Unblock Follow FollowingKARLIJN WILLEMS Data Science Journalist @DataCamp","Data science is about learning by doing. One of the ways you can learn how to do data science is by building your own portfolio: elaborating your own pet project, doing a quick data exploration task…",Jupyter Notebook Tutorial,Live,151 396,"Homepage Follow Sign in Get started * Home * About Insight * Data Science * Data Engineering * Health Data * AI * Emmanuel Ameisen Blocked Unblock Follow Following Program Director at Insight AI @EmmanuelAmeisen Jan 24 -------------------------------------------------------------------------------- HOW TO SOLVE 90% OF NLP PROBLEMS: A STEP-BY-STEP GUIDE USING MACHINE LEARNING TO UNDERSTAND AND LEVERAGE TEXT. How you can apply the 5 W’s and H to Text Data!TEXT DATA IS EVERYWHERE Whether you are an established company or working to launch a new service, you can always leverage text data to validate, improve, and expand the functionalities of your product. The science of extracting meaning and learning from text data is an active topic of research called Natural Language Processing (NLP). NLP produces new and exciting results on a daily basis, and is a very large field. However, having worked with hundreds of companies, the Insight team has seen a few key practical applications come up much more frequently than any other: * Identifying different cohorts of users/customers (e.g. predicting churn, lifetime value, product preferences) * Accurately detecting and extracting different categories of feedback (positive and negative reviews/opinions, mentions of particular attributes such as clothing size/fit…) * Classifying text according to intent (e.g. request for basic help, urgent problem) While many NLP papers and tutorials exist online, we have found it hard to find guidelines and tips on how to approach these problems efficiently from the ground up. HOW THIS ARTICLE CAN HELP After leading hundreds of projects a year and gaining advice from top teams all over the United States, we wrote this post to explain how to build Machine Learning solutions to solve problems like the ones mentioned above. We’ll begin with the simplest method that could work, and then move on to more nuanced solutions, such as feature engineering, word vectors, and deep learning. After reading this article, you’ll know how to: * Gather, prepare and inspect data * Build simple models to start, and transition to deep learning if necessary * Interpret and understand your models, to make sure you are actually capturing information and not noise We wrote this post as a step-by-step guide; it can also serve as a high level overview of highly effective standard approaches. -------------------------------------------------------------------------------- This post is accompanied by an interactive notebook demonstrating and applying all these techniques. Feel free to run the code and follow along! STEP 1: GATHER YOUR DATA EXAMPLE DATA SOURCES Every Machine Learning problem starts with data, such as a list of emails, posts, or tweets. Common sources of textual information include: * Product reviews (on Amazon, Yelp, and various App Stores) * User-generated content (Tweets, Facebook posts, StackOverflow questions) * Troubleshooting (customer requests, support tickets, chat logs) “Disasters on Social Media” dataset For this post, we will use a dataset generously provided by CrowdFlower , called “Disasters on Social Media”, where: Contributors looked at over 10,000 tweets culled with a variety of searches like “ablaze”, “quarantine”, and “pandemonium”, then noted whether the tweet referred to a disaster event (as opposed to a joke with the word or a movie review or something non-disastrous).Our task will be to detect which tweets are about a disastrous event as opposed to an irrelevant topic such as a movie. Why? A potential application would be to exclusively notify law enforcement officials about urgent emergencies while ignoring reviews of the most recent Adam Sandler film. A particular challenge with this task is that both classes contain the same search terms used to find the tweets, so we will have to use subtler differences to distinguish between them. In the rest of this post, we will refer to tweets that are about disasters as “ disaster ”, and tweets about anything else as “ irrelevant ”. LABELS We have labeled data and so we know which tweets belong to which categories. As Richard Socher outlines below, it is usually faster, simpler, and cheaper to find and label enough data to train a model on, rather than trying to optimize a complex unsupervised method. Richard Socher’s pro-tipSTEP 2: CLEAN YOUR DATA The number one rule we follow is: “Your model will only ever be as good as your data.”One of the key skills of a data scientist is knowing whether the next step should be working on the model or the data. A good rule of thumb is to look at the data first and then clean it up. A clean dataset will allow a model to learn meaningful features and not overfit on irrelevant noise. Here is a checklist to use to clean your data: (see the code for more details): 1. Remove all irrelevant characters such as any non alphanumeric characters 2. Tokenize your text by separating it into individual words 3. Remove words that are not relevant, such as “@” twitter mentions or urls 4. Convert all characters to lowercase, in order to treat words such as “hello”, “Hello”, and “HELLO” the same 5. Consider combining misspelled or alternately spelled words to a single representation (e.g. “cool”/”kewl”/”cooool”) 6. Consider lemmatization (reduce words such as “am”, “are”, and “is” to a common form such as “be”) After following these steps and checking for additional errors, we can start using the clean, labelled data to train models! STEP 3: FIND A GOOD DATA REPRESENTATION Machine Learning models take numerical values as input. Models working on images, for example, take in a matrix representing the intensity of each pixel in each color channel. A smiling face represented as a matrix of numbers.Our dataset is a list of sentences, so in order for our algorithm to extract patterns from the data, we first need to find a way to represent it in a way that our algorithm can understand, i.e. as a list of numbers. ONE-HOT ENCODING (BAG OF WORDS) A natural way to represent text for computers is to encode each character individually as a number ( ASCII for example). If we were to feed this simple representation into a classifier, it would have to learn the structure of words from scratch based only on our data, which is impossible for most datasets. We need to use a higher level approach. For example, we can build a vocabulary of all the unique words in our dataset, and associate a unique index to each word in the vocabulary. Each sentence is then represented as a list that is as long as the number of distinct words in our vocabulary. At each index in this list, we mark how many times the given word appears in our sentence. This is called a Bag of Words model , since it is a representation that completely ignores the order of words in our sentence. This is illustrated below. Representing sentences as a Bag of Words. Sentences on the left, representation on the right. Each index in the vectors represent one particular word.VISUALIZING THE EMBEDDINGS We have around 20,000 words in our vocabulary in the “Disasters of Social Media” example, which means that every sentence will be represented as a vector of length 20,000. The vector will contain mostly 0s because each sentence contains only a very small subset of our vocabulary. In order to see whether our embeddings are capturing information that is relevant to our problem (i.e. whether the tweets are about disasters or not), it is a good idea to visualize them and see if the classes look well separated. Since vocabularies are usually very large and visualizing data in 20,000 dimensions is impossible, techniques like PCA will help project the data down to two dimensions. This is plotted below. Visualizing Bag of Words embeddings.The two classes do not look very well separated, which could be a feature of our embeddings or simply of our dimensionality reduction. In order to see whether the Bag of Words features are of any use, we can train a classifier based on them. STEP 4: CLASSIFICATION When first approaching a problem, a general best practice is to start with the simplest tool that could solve the job. Whenever it comes to classifying data, a common favorite for its versatility and explainability is Logistic Regression . It is very simple to train and the results are interpretable as you can easily extract the most important coefficients from the model. We split our data in to a training set used to fit our model and a test set to see how well it generalizes to unseen data. After training, we get an accuracy of 75.4%. Not too shabby! Guessing the most frequent class (“irrelevant”) would give us only 57%. However, even if 75% precision was good enough for our needs, we should never ship a model without trying to understand it. STEP 5: INSPECTION CONFUSION MATRIX A first step is to understand the types of errors our model makes, and which kind of errors are least desirable. In our example, false positives are classifying an irrelevant tweet as a disaster, and false negatives are classifying a disaster as an irrelevant tweet. If the priority is to react to every potential event, we would want to lower our false negatives. If we are constrained in resources however, we might prioritize a lower false positive rate to reduce false alarms. A good way to visualize this information is using a Confusion Matrix , which compares the predictions our model makes with the true label. Ideally, the matrix would be a diagonal line from top left to bottom right (our predictions match the truth perfectly). Confusion Matrix (Green is a high proportion, blue is low)Our classifier creates more false negatives than false positives (proportionally). In other words, our model’s most common error is inaccurately classifying disasters as irrelevant. If false positives represent a high cost for law enforcement, this could be a good bias for our classifier to have. EXPLAINING AND INTERPRETING OUR MODEL To validate our model and interpret its predictions, it is important to look at which words it is using to make decisions. If our data is biased, our classifier will make make accurate predictions in the sample data, but the model would not generalize well in the real world. Here we plot the most important words for both the disaster and irrelevant class. Plotting word importance is simple with Bag of Words and Logistic Regression, since we can just extract and rank the coefficients that the model used for its predictions. Bag of Words: Word importanceOur classifier correctly picks up on some patterns (hiroshima, massacre), but clearly seems to be overfitting on some meaningless terms (heyoo, x1392). Right now, our Bag of Words model is dealing with a huge vocabulary of different words and treating all words equally . However, some of these words are very frequent, and are only contributing noise to our predictions. Next, we will try a way to represent sentences that can account for the frequency of words, to see if we can pick up more signal from our data. STEP 6: ACCOUNTING FOR VOCABULARY STRUCTURE TF-IDF In order to help our model focus more on meaningful words, we can use a TF-IDF score (Term Frequency, Inverse Document Frequency) on top of our Bag of Words model. TF-IDF weighs words by how rare they are in our dataset, discounting words that are too frequent and just add to the noise. Here is the PCA projection of our new embeddings. Visualizing TF-IDF embeddings.We can see above that there is a clearer distinction between the two colors. This should make it easier for our classifier to separate both groups. Let’s see if this leads to better performance. Training another Logistic Regression on our new embeddings, we get an accuracy of 76.2%. A very slight improvement. Has our model has started picking up on more important words? If we are getting a better result while preventing our model from “cheating” then we can truly consider this model an upgrade. TF-IDF: Word importanceThe words it picked up look much more relevant! Although our metrics on our test set only increased slightly, we have much more confidence in the terms our model is using, and thus would feel more comfortable deploying it in a system that would interact with customers. STEP 7: LEVERAGING SEMANTICS WORD2VEC Our latest model managed to pick up on high signal words. However, it is very likely that if we deploy this model, we will encounter words that we have not seen in our training set before. The previous model will not be able to accurately classify these tweets, even if it has seen very similar words during training . To solve this problem, we need to capture the semantic meaning of words , meaning we need to understand that words like ‘good’ and ‘positive’ are closer than ‘apricot’ and ‘continent.’ The tool we will use to help us capture meaning is called Word2Vec. Using pre-trained words Word2Vec is a technique to find continuous embeddings for words. It learns from reading massive amounts of text and memorizing which words tend to appear in similar contexts. After being trained on enough data, it generates a 300-dimension vector for each word in a vocabulary, with words of similar meaning being closer to each other. The authors of the paper open sourced a model that was pre-trained on a very large corpus which we can leverage to include some knowledge of semantic meaning into our model. The pre-trained vectors can be found in the repository associated with this post. SENTENCE LEVEL REPRESENTATION A quick way to get a sentence embedding for our classifier is to average Word2Vec scores of all words in our sentence. This is a Bag of Words approach just like before, but this time we only lose the syntax of our sentence, while keeping some semantic information. Word2Vec sentence embeddingHere is a visualization of our new embeddings using previous techniques: Visualizing Word2Vec embeddings.The two groups of colors look even more separated here, our new embeddings should help our classifier find the separation between both classes. After training the same model a third time (a Logistic Regression), we get an accuracy score of 77.7% , our best result yet! Time to inspect our model. THE COMPLEXITY/EXPLAINABILITY TRADE-OFF Since our embeddings are not represented as a vector with one dimension per word as in our previous models, it’s harder to see which words are the most relevant to our classification. While we still have access to the coefficients of our Logistic Regression, they relate to the 300 dimensions of our embeddings rather than the indices of words. For such a low gain in accuracy, losing all explainability seems like a harsh trade-off. However, with more complex models we can we can leverage black box explainers such as LIME in order to get some insight into how our classifier works. LIME LIME is available on Github through an open-sourced package. A a black-box explainer allows users to explain the decisions of any classifier on one particular example by perturbing the input (in our case removing words from the sentence) and seeing how the prediction changes. Let’s see a couple explanations for sentences from our dataset. Correct disaster words are picked up to classify as “relevant”. Here, the contribution of the words to the classification seems less obvious.However, we do not have time to explore the thousands of examples in our dataset. What we’ll do instead is run LIME on a representative sample of test cases and see which words keep coming up as strong contributors. Using this approach we can get word importance scores like we had for previous models and validate our model’s predictions. Word2Vec: Word importanceLooks like the model picks up highly relevant words implying that it appears to make understandable decisions. These seem like the most relevant words out of all previous models and therefore we’re more comfortable deploying in to production. STEP 8: LEVERAGING SYNTAX USING END-TO-END APPROACHES We’ve covered quick and efficient approaches to generate compact sentence embeddings. However, by omitting the order of words, we are discarding all of the syntactic information of our sentences. If these methods do not provide sufficient results, you can utilize more complex model that take in whole sentences as input and predict labels without the need to build an intermediate representation. A common way to do that is to treat a sentence as a sequence of individual word vectors using either Word2Vec or more recent approaches such as GloVe or CoVe . This is what we will do below. A highly effective end-to-end architecture ( source )Convolutional Neural Networks for Sentence Classification train very quickly and work well as an entry level deep learning architecture. While Convolutional Neural Networks (CNN) are mainly known for their performance on image data, they have been providing excellent results on text related tasks, and are usually much quicker to train than most complex NLP approaches (e.g. LSTMs and Encoder/Decoder architectures). This model preserves the order of words and learns valuable information on which sequences of words are predictive of our target classes. Contrary to previous models, it can tell the difference between “Alex eats plants” and “Plants eat Alex.” Training this model does not require much more work than previous approaches (see code for details) and gives us a model that is much better than the previous ones, getting 79.5% accuracy ! As with the models above, the next step should be to explore and explain the predictions using the methods we described to validate that it is indeed the best model to deploy to users. By now, you should feel comfortable tackling this on your own. FINAL NOTES Here is a quick recap of the approach we’ve successfully used: * Start with a quick and simple model * Explain its predictions * Understand the kind of mistakes it is making * Use that knowledge to inform your next step, whether that is working on your data, or a more complex model. These approaches were applied to a particular example case using models tailored towards understanding and leveraging short text such as tweets, but the ideas are widely applicable to a variety of problems . I hope this helped you, we’d love to hear your comments and questions! Feel free to comment below or reach out to @EmmanuelAmeisen here or on Twitter . -------------------------------------------------------------------------------- Want to learn applied Artificial Intelligence from top professionals in Silicon Valley or New York? Learn more about the Artificial Intelligence program. Are you a company working in AI and would like to get involved in the Insight AI Fellows Program? Feel free to get in touch . * Machine Learning * Business * Artificial Intelligence * Tutorial * Insight Ai One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. 1K Blocked Unblock Follow FollowingEMMANUEL AMEISEN Program Director at Insight AI @EmmanuelAmeisen FollowINSIGHT DATA Insight Fellows Program —Your bridge to careers in Data Science and Data Engineering. * 1K * * * Never miss a story from Insight Data , when you sign up for Medium. Learn more Never miss a story from Insight Data Get updates Get updates","After leading hundreds of projects a year and gaining advice from top teams all over the United States, we wrote this post to explain how to build Machine Learning solutions to solve problems like the ones mentioned above.",How to solve 90% of NLP problems,Live,152 399,"Homepage Follow Sign in Get started Tim Bohn Blocked Unblock Follow Following Sr. Solution Architect, IBM Data Science Elite team. Travel (50+ countries), Pickleball and Technology. Tweets are personal opinions. Dec 13 -------------------------------------------------------------------------------- UNFRIENDLY SKIES: PREDICTING FLIGHT CANCELLATIONS USING WEATHER DATA, PART 3 Tim Bohn and Ricardo Balduino Piarco Airport, Trinidad in the 1950s, Copyright John Hill, Creative Commons Attribution-Share Alike 4.0In Part 1 of this series, we wrote about our goal to explore a use case and use various machine learning platforms to see how we might build classification models with those platforms to predict flight cancellations. Specifically, we hoped to predict the probability of the cancellation of flights between the ten U.S. airports most affected by weather. We used historical flight data and historical weather data to make predictions for upcoming flights. In Part 2 , we started our exploration with IBM SPSS Modeler and APIs from The Weather Company . With this post, we look at IBM’s Data Science Experience (DSX). TOOLS USED IN THIS USE CASE SOLUTION DSX is a collaborative platform for data scientists, built on open-source components and IBM added value, which is available in the cloud or on-premise. In the simplest terms, DSX is a managed Apache Spark cluster with a Notebook front-end. By default, it includes integration with data tools like a data catalog and data refinery, Watson Machine Learning services, collaboration capability, Model Management, and the ability to automatically review a model’s performance and refresh/retrain the model with new data — and IBM is quickly adding more capabilities. Read here to see what IBM is doing lately for data science. A PYTHON NOTEBOOK SOLUTION In this case, we followed roughly the same steps we used in the SPSS model from Part 2, only this time we wrote python code in a Jupyter notebook to get similar results. We encourage readers to come up with their own solutions. Let us know. We’d love to feature your approaches in future blog posts. The first step of the iterative process is gathering and understanding the data needed to train and test our model. Since we did this work for part 2, we made use of the analysis here. Flights data — We gathered data for 2016 flights from the US Bureau of Transportation Statistics website. The website allowed us to export one month at a time, so we ended up with twelve csv (comma separated value) files. Importing those as dataframes and merging into a single dataframe was straightforward. Figure 1 — Gathering and preparing flight data in IBM DSXWeather data — With the latitude and longitude of the 10 Most Weather-Delayed U.S. Major Airports , we used one of the Weather Company’s API’s to get the historical hourly weather data for all of 2016 for each of the 10 airport locations and created a csv file that became our data set in the notebook. Combined flights and weather data — To each flight in the first data set, we added two new columns: ORIGIN and DEST, containing the respective airport codes. Next, we merged flight data and the weather data so that the resulting dataframe contained the flight data along with the weather for the corresponding Origin and Destination airports. DATA PREPARATION, MODELING, AND EVALUATION To start preparing the data, we used the combined flights and weather data from the previous step and performed some cleanup. We deleted columns of features that we didn’t need, and replaced null values in rows where flight cancellations were not related to weather conditions. Next, we took the features we discovered when we created a model using SPSS (such as flight date, hour, day of the week, origin and destination airport codes, and weather conditions) and we used them as inputs to our python model. We also chose the target feature for the model to predict: the cancellation status. We deleted the remaining features. Next, we ran OneHotEncode r on the four categorical features. One-hot encoding is a process by which categorical features get converted into a format that works better with certain algorithms, like classification and regression. Figure 2 shows the number of feature columns, expanded significantly with one hot encoding. Figure 2 — One-hot encoding expands 4 feature columns into many moreInterestingly, the flight data is heavily imbalanced. Specifically, as seen in Figure 3, of all the flights in the data set only a small percentage are actually cancelled. Figure 3 — Historical data: distribution of cancelled (1) and non-cancelled (0) flightsTo address that skewedness in the original data, we tried oversampling the minority class, under sampling the majority class, and a combination of both — but none of these approaches worked well. We then tried something called SMOTE (Synthetic Minority Over-Sampling Technique), an algorithm that provides an advanced over-sampling algorithm to deal with imbalanced datasets. Since it generates synthetic examples rather than just using replication, it helped our selected model work more effectively by mitigating the problem of overfitting that random oversampling can cause. SMOTE isn’t considered effective for high dimensional data, but that isn’t the case here. In Figure 4, we notice a balanced distribution between cancelled and non-cancelled flights after running the data through SMOTE. Figure 4 — Distribution of cancelled and non-cancelled flights after using SMOTEIt’s important to mention is that we applied SMOTE only to the training data set, not the test data set. A detailed blog by Nick Becker guided our choices in the notebook. At this point, we used the Random Forest Classifier for our model. It did the best when we used SPSS so we used again in our notebook. We have several ideas for a second iteration of our model in order to tune it, one of which is to try multiple algorithms to see how they compare. Since this use case deals with classification analysis, we used some of the common ways to evaluate the performance of the model: the confusion matrix, F1 score and ROC curve, among some others. Figures 5 and 6 show the results. Figure 5 — Test/Validation Results Figure 6 — ROC curve for training data setFigure 6 is the ROC curve from the training data set. Figure 5 shows us that the results from the training and test data sets are pretty close, which is a good indication of consistency, though we realize that with some tuning it could get better. Nevertheless, we decided that the results were still good for the purposes of our discussion in this blog, and we stopped our iterations here. We encourage readers to refine the model further or even to use other models to solve this use case. CONCLUSION This was a project to compare creating a model in IBM’s SPSS with IBM’s Data Science Experience . SPSS offers a no-code experience while DSX offers the best of open-source coding capability with many IBM value adds. SPSS is an amazing product and gets better with every release, adding many new capabilities. IBM’s Data Science Experience is a great platform for both the beginning and experienced data scientist. Anyone can log in and have immediate access to a managed Spark cluster with a choice of a Jupyter notebook front-end using Scala, Python or R, SPSS and visual data modeler (no coding). It offers easy collaboration with other users, including adding other data scientists who could then look over our shoulders and make suggestions. The community is active and has already contributed dozens of tutorials, data sets and notebooks . If we had added Watson Machine Learning, we could very easily have deployed and managed our model with an instant REST endpoint to call from any application. If our data was changing, we could have WML review our model periodically and retrain it with any new data if our metric (ROC Curve) value fell below a given threshold. That, along with new data cataloging and data refinery tooling added recently, make this a platform worth checking out for any data science project. SPSS has a lot, but not everything. Writing the python code in a notebook was a bit more time-consuming than what we did in SPSS, but it also gave quite a bit more flexibility and freedom. We had access to everything in the python libraries, and of course, one of the benefits of python as an open-source language is the trove of helpful examples. I would say both platforms have their place, and neither can claim to be better for everything. Those doing data science for the first time will probably find SPSS an easier place to start given its drag-and-drop user interface. Those who have come out of school as programming wizards will want to write code, and DSX will give them a great way to do that without worrying about installing, configuring, and correctly integrating various product versions. RESOURCES The IBM notebook and data that form the basis for this blog are available on Github . * Machine Learning * Weather * Airlines One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. 14 Blocked Unblock Follow FollowingTIM BOHN Sr. Solution Architect, IBM Data Science Elite team. Travel (50+ countries), Pickleball and Technology. Tweets are personal opinions. FollowINSIDE MACHINE LEARNING Deep-dive articles about machine learning and data. Curated by IBM Analytics. * 14 * * * Never miss a story from Inside Machine learning , when you sign up for Medium. Learn more Never miss a story from Inside Machine learning Get updates Get updates","In Part 1 of this series, we wrote about our goal to explore a use case and use various machine learning platforms to see how we might build classification models with those platforms to predict…","Predicting Flight Cancellations Using Weather Data, Part 3",Live,153 404,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix * Tutorials * Load dashDB Data with Apache Spark * Load Cloudant Data in Apache Spark Using a Python Notebook * Load Cloudant Data in Apache Spark Using a Scala Notebook * Build SQL Queries * Use the Machine Learning Library * Build a Custom Library for Apache Spark * Sentiment Analysis of Twitter Hashtags * Use Spark Streaming * Launch a Spark job using spark-submit * Sample Notebooks * Sample Python Notebook: Precipitation Analysis * Sample Python Notebook: NY Motor Vehicle Accidents Analysis * BigInsights * Get Started * BigInsights on Cloud for Analysts * BigInsights on Cloud for Data Scientists * Perform Text Analytics on Financial Data * Perform Sentiment Analysis * Sample Scripts * Compose * Get Started * Create a Deployment * Add a Database and Documents * Back Up and Restore a Deployment * Enable Two-Factor Authentication * Add Users * Enable Add-Ons for Your Deployment * Compose Enterprise * Get Started * Cloudant * Get started * Copy a sample database * Create a database * Change database permissions * Connect to Bluemix * Developing against Cloudant * Intro to the HTTP API * Execute common API commands * Set up pre-authenticated cURL * Database Replication * Use cases for replication * Create a replication job * Check replication status * Set up replication with cURL * Indexes and Queries * Use the primary index * MapReduce and the secondary index * Build and query a search index * Use Cloudant Query * Cloudant Geospatial * Integrate * Create a Data Warehouse from Cloudant Data * Store Tweets Using Cloudant, dashDB, and Node-RED * Load Cloudant Data in Apache Spark Using a Scala Notebook * Load Cloudant Data in Apache Spark Using a Python Notebook * dashDB * dashDB Quick Start * Get * Get started with dashDB on Bluemix * Load data from the desktop into dashDB * Load from Desktop Supercharged with IBM Aspera * Load data from the Cloud into dashDB * Move data to the Cloud with dashDB’s MoveToCloud script * Load Twitter data into dashDB * Load XML data into dashDB * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB * Load JSON Data from Cloudant into dashDB * Integrate dashDB and Informatica Cloud * Load geospatial data into dashDB to analyze in Esri ArcGIS * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion Workbench (DCW) * Install IBM Database Conversion Workbench * Convert data from Oracle to dashDB * Convert IBM Puredata System for Analytics to dashDB * From Netezza to dashDB: It’s That Easy! * Use Aginity Workbench for IBM dashDB * Build * Create Tables in dashDB * Connect apps to dashDB * Analyze * Use dashDB with Watson Analytics * Perform Predictive Analytics and SQL Pushdown * Use dashDB with Spark * Use dashDB with Pyspark and Pandas * Use dashDB with R * Publish apps that use R analysis with Shiny and dashDB * Perform market basket analysis using dashDB and R * Connect R Commander and dashDB * Use dashDB with IBM Embeddable Reporting Service * Use dashDB with Tableau * Leverage dashDB in Cognos Business Intelligence * Integrate dashDB with Excel * Extract and export dashDB data to a CSV file * Analyze With SPSS Statistics and dashDB * REST API * Load delimited data using the REST API and cURL * DataWorks * Get Started * Connect to Data in IBM DataWorks * Load Data for Analytics in IBM DataWorks * Blend Data from Multiple Sources in IBM DataWorks * Shape Raw Data in IBM DataWorks * DataWorks API USE DASHDB WITH TABLEAUJess Mantaro / July 17, 2015Watch how quick and easy it is to perform analytics with dashDB and TableauYou can also read a transcript of this videoRead the tutorial (PDF) Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",Watch how quick and easy it is to perform analytics with dashDB and Tableau. ,Use dashDB with Tableau,Live,154 412,"Compose The Compose logo Articles Sign in Free 30-day trialMETRICS MAVEN: CALCULATING AN EXPONENTIALLY WEIGHTED MOVING AVERAGE IN POSTGRESQL Published Mar 8, 2017 metrics maven postgresql Metrics Maven: Calculating an Exponentially Weighted Moving Average in PostgreSQLIn our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll walk through how and why to calculate an exponentially weighted moving average. We've covered a few different kinds of averages in this series. We had a look at mean , dug into weighted averages , showed a couple methods for calculating a simple moving average , generated a cumulative moving average in the same article , and also produced a 7-day weighted moving average . In this article we're going to add an exponentially weighted moving average to the group. We'll start by getting a basic understanding of what an exponentially weighted moving average is and why we would want to use it. EXPONENTIALLY WEIGHTED MOVING AVERAGE The exponentially weighted moving average, sometimes also just called exponential moving average, (EWMA or EMA, for short) is used for smoothing trend data like the other moving averages we've reviewed. Similar to the weighted moving average we covered in our last article, weights are applied to the data such that dates further in the past will receive less weight (and therefore be less impactful to the result) than more recent dates. Rather than decreasing linearly, however, like we saw with the weighted moving average, the weight for an EWMA decreases exponentially for each time period further in the past. Additionally, the result of an EWMA is cumulative because it contains the previously calculated EWMA in its calculation of the current EWMA. Because of this, all the data values have some contribution in the result, though that contribution diminishes as each next period is calculated. An exponentially weighted moving average is often applied when there is a large variance in the trend data, such as for volatile stock prices. It can reduce the noise and help make the trend clearer. Let's get into our example and see how this works. OUR DATA For EWMA, we're going to use the same daily summary data from our hypothetical pet supply company that we used in the previous article on weighted moving averages . Our data table is called ""daily_orders_summary"" and looks like this: date | total_orders | total_order_items | total_order_value | average_order_items | average_order_value -------------------------------------------------------------------------------------------------------------- 2017-01-01 | 14 | 18 | 106.84 | 1.29 | 7.63 2017-01-02 | 10 | 21 | 199.79 | 2.10 | 19.98 2017-01-03 | 12 | 17 | 212.98 | 1.42 | 17.75 2017-01-04 | 12 | 15 | 100.93 | 1.25 | 8.41 2017-01-05 | 10 | 13 | 108.54 | 1.30 | 10.85 2017-01-06 | 14 | 20 | 216.78 | 1.43 | 15.48 2017-01-07 | 13 | 16 | 198.32 | 1.23 | 15.26 2017-01-08 | 10 | 12 | 124.67 | 1.20 | 12.47 2017-01-09 | 10 | 16 | 140.88 | 1.60 | 14.09 2017-01-10 | 17 | 19 | 136.98 | 1.12 | 8.06 2017-01-11 | 12 | 14 | 99.67 | 1.17 | 8.31 2017-01-12 | 11 | 15 | 163.52 | 1.36 | 14.87 2017-01-13 | 10 | 18 | 207.43 | 1.80 | 20.74 2017-01-14 | 14 | 20 | 199.68 | 1.43 | 14.26 2017-01-15 | 16 | 22 | 207.56 | 1.38 | 12.97 2017-01-16 | 14 | 19 | 176.76 | 1.36 | 12.63 2017-01-17 | 13 | 18 | 184.48 | 1.38 | 14.19 2017-01-18 | 14 | 25 | 265.98 | 1.79 | 19.00 2017-01-19 | 10 | 17 | 178.42 | 1.70 | 17.84 2017-01-20 | 19 | 24 | 139.67 | 1.26 | 7.35 2017-01-21 | 15 | 21 | 187.66 | 1.40 | 12.51 2017-01-22 | 19 | 24 | 226.98 | 1.26 | 11.95 2017-01-23 | 17 | 24 | 212.64 | 1.41 | 12.51 2017-01-24 | 16 | 21 | 187.43 | 1.31 | 11.71 2017-01-25 | 19 | 27 | 244.67 | 1.42 | 12.88 2017-01-26 | 20 | 29 | 267.44 | 1.45 | 13.37 2017-01-27 | 17 | 25 | 196.43 | 1.47 | 11.55 2017-01-28 | 21 | 28 | 234.87 | 1.33 | 11.18 2017-01-29 | 18 | 29 | 214.66 | 1.61 | 11.93 2017-01-30 | 14 | 20 | 199.68 | 1.43 | 14.26 2017-02-01 | 19 | 27 | 189.98 | 1.42 | 10.00 2017-02-02 | 22 | 31 | 274.98 | 1.41 | 12.50 2017-02-03 | 20 | 28 | 213.76 | 1.40 | 10.69 2017-02-04 | 21 | 30 | 242.78 | 1.43 | 11.56 2017-02-05 | 22 | 34 | 267.88 | 1.55 | 12.18 2017-02-06 | 19 | 24 | 209.56 | 1.26 | 11.03 2017-02-07 | 21 | 33 | 263.76 | 1.57 | 12.56 IT'S ALL ABOUT THE LAMBDA As mentioned above, the weight for EWMA decreases exponentially for each time period in the past. The further in the past, the less weight is given. To apply the weights for our data, we'll need a smoothing parameter (also called lambda ) which will act as a multiplier on the data values. This smoothing parameter will be a value between 0 and 1 and is typically 2 divided by the sum of the length of days. Since we'll stick with a 7-day range, our lambda would be 2 / (1 + 7) which comes out to 0.25. The formula for calculating an EWMA boils down to this: (Current period data value * lambda) + (Previous period EWMA * (1 - lambda)) = Current period EWMA An alternative formula which produces the same result is: ((Current period data value - Previous period EWMA) * lambda) + Previous period EWMA = Current period EWMA Now that we know what our lambda is and we have the formula we're going to apply, it's time to run our query: WITH recursive exponentially_weighted_moving_average (date, average_order_value, ewma, rn) AS ( -- Initiate the ewma using the 7-day simple moving average (sma) SELECT rows.date, rows.average_order_value, sma.sma AS ewma, rows.rn FROM ( SELECT date, average_order_value, ROW_NUMBER() OVER(ORDER BY date) rn FROM daily_orders_summary ) rows JOIN ( SELECT date, ROUND(AVG(average_order_value) OVER(ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW), 2) AS sma FROM daily_orders_summary ) sma ON sma.date = rows.date WHERE rows.rn = 7 -- start on the 7th day since we're using the 7-day sma UNION ALL -- Perform the ewma calculation for all the following rows SELECT rows.date, rows.average_order_value , ROUND((rows.average_order_value * 0.25) + (ewma.ewma * (1 - 0.25)), 2) AS ewma --, ROUND((((rows.average_order_value - ewma.ewma) * 0.25) + ewma.ewma), 2) AS ewma -- alternative formula , rows.rn FROM exponentially_weighted_moving_average ewma JOIN ( SELECT date, average_order_value, ROW_NUMBER() OVER(ORDER BY date) rn FROM daily_orders_summary ) rows ON ewma.rn + 1 = rows.rn WHERE rows.rn That's a lot to take in all at once so let's break it down and learn how it works. USING A RECURSIVE CTE First, we're creating a recursive CTE ( common table expression using WITH ) called ""exponentially_weighted_moving_average"" that returns 4 field values: date, average order value, the ewma, and a row number. We're using this approach because the EWMA calculation requires the previous period's EWMA. A recursive CTE can provide the previous EWMA calculation to us for each period. Note that using a recursive CTE on a large data set is not going to be your best option. Performance will take a big dive since the query will recurse through all the data. If you have a large data set that you need to calculate EWMA for, then you should consider using the procedural language options for PostgreSQL such as PL/Python or PL/Perl. You can learn more about recursive CTEs and procedural language options in the official PostgreSQL documentation. INITIALIZING THE EWMA The first query in our WITH block is the EWMA initialization. Because the calculation requires the previous period EWMA, we have to give it something to start with. A common approach is to use the simple moving average for the length of the time period as the initial EWMA. That's what we've done here. Because we're calculating a 7-day EWMA (our lambda is based on a 7-day range), we have a sub-query called ""sma"" where we're calculating the 7-day simple moving average for the 7th day (we need 7 days to get the SMA) and using that as our EWMA starting point. If you're not familiar with the simple moving average or you just need a refresher, check out our article on basic moving averages . You could also simply initialize the EWMA with the actual data value for that date. Of course the results will be different. Play around with what seems to work best for your data on how to initialize the EWMA since that will impact all the rest of your calculations. Most of the data in that first query comes from another sub-query called ""rows"" that employes the ROW_NUMBER() window function. We are using this sub-query to generate row numbers for each of the rows, which allows us to identify the 7th row for initialization. If you're not familiar with window functions, take a gander at our article on window functions . RECURSING THROUGH THE DATA TO CALCULATE EWMA The UNION ALL and the next query in the WITH block are where the recursion and the calculation of EWMA occur. We've got the same ""rows"" subquery, but in this case, we only care about the rows following the 7th row since that's where we'll apply our EWMA calculation. We're joing the ""rows"" sub-query to our recursive CTE ""exponentially_weighted_moving_average"" (aliased as ""ewma"") on row number where the ""ewma"" row is 1 less than the ""rows"" row. In this way we can use the previously-calculated EWMA from the ""ewma"" CTE and the current data value (average_order_value in this case) from the ""rows"" sub-query. To get the calculated EWMA for the current row, we're applying the formula in SQL as: ROUND((rows.average_order_value * 0.25) + (ewma.ewma * (1 - 0.25)), 2) AS ewma What we're doing here is... * multiplying the current period data value (rows.average_order_value) with the 7-day range lambda we previously determined (0.25) * multiplying the previous period EWMA (ewma.ewma) with 1 minus our lambda (1 - 0.25) * adding those two values together to get the current period EWMA * rounding to 2 decimal places (which we learned about in the Making Data Pretty article ) Note that we've also included the alternative formula in the SQL, just commented out. You can use either one. RETURNING RESULTS Finally we're selecting the fields we're interested in for our report from the recursive CTE. Here's what those results look like: date | average_order_value | ewma ---------------------------------------- 2017-01-07 | 15.26 | 13.62 2017-01-08 | 12.47 | 13.33 2017-01-09 | 14.09 | 13.52 2017-01-10 | 8.06 | 12.16 2017-01-11 | 8.31 | 11.20 2017-01-12 | 14.87 | 12.12 2017-01-13 | 20.74 | 14.28 2017-01-14 | 14.26 | 14.28 2017-01-15 | 12.97 | 13.95 2017-01-16 | 12.63 | 13.62 2017-01-17 | 14.19 | 13.76 2017-01-18 | 19.00 | 15.07 2017-01-19 | 17.84 | 15.76 2017-01-20 | 7.35 | 13.66 2017-01-21 | 12.51 | 13.37 2017-01-22 | 11.95 | 13.02 2017-01-23 | 12.51 | 12.89 2017-01-24 | 11.71 | 12.60 2017-01-25 | 12.88 | 12.67 2017-01-26 | 13.37 | 12.85 2017-01-27 | 11.55 | 12.53 2017-01-28 | 11.18 | 12.19 2017-01-29 | 11.93 | 12.13 2017-01-30 | 14.26 | 12.66 2017-02-01 | 10.00 | 12.00 2017-02-02 | 12.50 | 12.13 2017-02-03 | 10.69 | 11.77 2017-02-04 | 11.56 | 11.72 2017-02-05 | 12.18 | 11.84 2017-02-06 | 11.03 | 11.64 2017-02-07 | 12.56 | 11.87 Let's look at just one date to review how the EWMA was calculated. We'll use January 21. By applying the formula for EWMA, we get: (Current period data value * lambda) + (Previous period EWMA * (1 - lambda)) = Current period EWMA (12.51 * 0.25) + (13.66 * (1 - 0.25)) = 13.37225 SEEING TRENDS Now that we have our EWMA, let's plot our average order value, the simple moving average, the weighted moving average, and the EWMA together to see how the trend lines compare: We can see that the average order value by itself is pretty volatile making it difficult to see the overall trend. The simple moving average helps smooth things out, but over- or under-corrects in some places. The weighted moving average smooths the trend out further and makes it easier to see the rise that happened until about the 3rd week of January and then the slight decline from then. The exponentially weighted moving average follows the true data values better than the other two metrics while still smoothing the trend line. WRAPPING UP In this article we learned how to calculate an exponentially weighted moving average using a recursive CTE. We discussed when it's useful to apply and compared the results to other average types we looked at in previous articles. With each metric we are better able to zero in on just how our business is performing. In our next article, we'll be taking a look at CROSSTAB again and covering some aspects there that we didn't have a chance to get to in our previous article on pivoting in Postgres . Image by: msandersmusic Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith ’s author page and keep reading.RELATED ARTICLES Feb 7, 2017METRICS MAVEN: CALCULATING A WEIGHTED MOVING AVERAGE IN POSTGRESQL In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the… Lisa Smith Jan 9, 2017METRICS MAVEN: CALCULATING A WEIGHTED AVERAGE IN POSTGRESQL In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the… Lisa Smith Dec 6, 2016METRICS MAVEN: MODE D'EMPLOI - FINDING THE MODE IN POSTGRESQL In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the… Lisa Smith Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","In this article, we'll walk through how and why to calculate an exponentially weighted moving average.",Metrics Maven: Calculating an Exponentially Weighted Moving Average in PostgreSQL,Live,155 419,"Compose The Compose logo Articles Sign in Free 30-day trialDATALAYER EXPOSED: JONAS HELFER & JOINS ACROSS DATABASES WITH GRAPHQL Published Jun 26, 2017 datalayer graphql join DataLayer Exposed: Jonas Helfer & Joins Across Databases with GraphQLWanting something to make Monday mornings a bit more exciting? Well, for the next few weeks, we're bringing you a new video from this year's DataLayer Conference. Up this week is Jonas Helfer from Meteor discussing joins across databases with GraphQL. This year, we were joined by Jonas Helfer from Meteor. More specifically, he works on Meteor's Apollo Project with last year's speaker Sashko Stubailo (you can see his talk here ). Jonas' presentation is centered around joins across databases with GraphQL. As it's becoming more and more commons for organizations to have a backend architecture powered by multiple databases and microservices. But as the number of these databases and services grows, so scaling also becomes more difficult. Jonas showed how GraphQL can be leveraged to pull data from multiple databases in a unified way. Previous DataLayer 2017 talks: * Charity Majors' presentation on observability * Ross Kukulinski's presentation on the state of containers * Antonio Chavez's presentation on the why he left MongoDB Be sure to tell us what you think using hashtag #DataLayerConf and check back next Monday for the next talk at DataLayerConf. -------------------------------------------------------------------------------- We're in the planning stages for DataLayer 2018 right now so, if you have an idea for a talk, start flushing that out. We'll have a CFP, followed by a blind submission review, and then select our speakers, who we'll fly to DataLayer to present. Sounds fun, right? Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the beach, reading, spending time with his wife and daughter and tinkering. Love this article? Head over to Thom Crowe ’s author page and keep reading.RELATED ARTICLES Aug 26, 2016SCYLLA 1.3, CONFERENCES, APOLLO, COCKROACHDB, GRPC, REDBOT AND MORE : COMPOSE'S LITTLE BITS 45 Scylla is thrifty with 1.3, Conferences to go to, Apollo's typescripted GraphQL re-write, CockroachDB's code yellow, gRPC hit… Dj Walker-Morgan Jun 19, 2017DATALAYER EXPOSED: ANTONIO CHAVEZ & WHY WE LEFT MONGODB It's Monday, so that means time for a new video from this year's DataLayer conference. Up this week is Antonio Chavez who tal… Thom Crowe Jun 12, 2017DATALAYER EXPOSED: ROSS KUKULINSKI & THE STATE OF STATE IN CONTAINERS We're continuing to bring you video of all the sessions from this year's DataLayer conference, and next up is Ross Kukulinski… Thom Crowe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL JanusGraph Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",Jonas Helfer from Meteor discussing joins across databases with GraphQL.,DataLayer Exposed: Jonas Helfer & Joins Across Databases with GraphQL,Live,156 420,"KDNUGGETS Data Mining, Analytics, Big Data, and Data Science Subscribe to KDnuggets News | Follow | Contact * Data Mining Software * News * Top stories * Opinions * Tutorials * Jobs * Academic * Companies * Courses * Datasets * Education * Meetings * Polls * Webinars KDnuggets Home » News » 2016 » Jun » Tutorials, Overviews » Data Science of Variable Selection: A Review ( 16:n20 )LATEST NEWS, STORIES * Ten Simple Rules for Effective Statistical Practice: A... Bank of America: Statistician From Research to Riches: Data Wrangling Lessons from P... Microsoft: Sr. Applied Data Scientist. Achieving End-to-end Security for Apache Spark with Da... More News & Stories | Top Stories DATA SCIENCE OF VARIABLE SELECTION: A REVIEW Previous post Next post Tweet Tags: Algorithms , Big Data , Feature Selection , Statistics -------------------------------------------------------------------------------- There are as many approaches to selecting features as there are statisticians since every statistician and their sibling has a POV or a paper on the subject. This is an overview of some of these approaches. commentsBy Thomas Ball, Advanced Analytics Professional . Data scientists are always stressing over the “best” approach to variable selection, particularly when faced with massive amounts of information -- a frequent occurrence these days. ""Massive"" by today's standards means terabytes of data and tens, if not hundreds, of millions of features or predictors. There are many reasons for this “stress” but the reality is that a single, canonical solution does not exist. There are as many approaches to selecting features as there are statisticians since every statistician and their sibling has a POV or a paper on the subject. Why Implement Machine Learning Algorithms From Scratch? For years, there have been rumors that Google uses all available features in building its predictive algorithms. To date however, no disclaimers, explanations or working papers have emerged that clarify and/or dispute this rumor. Not even their published patents help in the understanding. As a result, no one external to Google knows what they are doing, to the best of my knowledge. One of the biggest problems in predictive modeling is the conflation between classic hypothesis testing with careful model specification vis-a-vis pure data mining. The classically trained can get quite dogmatic about the need for ""rigor"" in model design and development. The fact is that when confronted with massive numbers of candidate predictors and multiple possible targets or dependent variables, the classic framework neither works, holds nor provides useful guidance – how does anyone develop a finite set of hypotheses with millions of predictors? Numerous recent papers delineate this dilemma from Chattopadhyay and Lipson's brilliant paper Data Smashing: Uncovering Lurking Order in Data ( available here ) who state, ""The key bottleneck is that most data comparison algorithms today rely on a human expert to specify what ‘features’ of the data are relevant for comparison. Here, we propose a new principle for estimating the similarity between the sources of arbitrary data streams, using neither domain knowledge nor learning."" To last year's AER paper on Prediction Policy Problems by Kleinberg, et al., (available here ) which makes the case for data mining and prediction as useful tools in economic policy making, citing instances where ""causal inference is not central, or even necessary."" The fact is that the bigger, $64,000 question is the broad shift in thinking and challenges to the classic hypothesis-testing framework implicit in, e.g., this Edge.org symposium on ""obsolete"" scientific thinking (available here ) as well as this recent article by Eric Beinhocker on the ""new economics"" (available here ) which presents some radical proposals for integrating widely disparate disciplines such as behavioral economics, complexity theory, network and portfolio theory into a platform for policy implementation and adoption. Needless to say, these discussions go far beyond merely statistical concerns and suggest that we are undergoing a fundamental shift in scientific paradigms. The shifting views are as fundamental as the distinctions between reductionistic, Occam's Razor like model-building vs Epicurus' expansive Principle of Plenitude or multiple explanations which roughly states that if several findings explain something, retain them all (see, e.g., here ). Of course, guys like Beinhocker are totally unencumbered with practical, in the trenches issues regarding applied, statistical solutions to this evolving paradigm. Wrt the nitty-gritty questions of ultra-high dimensional variable selection, there are many viable approaches to model building that leverage, e.g., Lasso, LAR, stepwise algorithms or ""elephant models” that use all of the available information. The reality is that, even with AWS or a supercomputer, you can't use all of the available information at the same time – there simply isn’t enough RAM to load it all in. What does this mean? Workarounds have been proposed, e.g., the NSF's Discovery in Complex or Massive Datasets: Common Statistical Themes to ""divide and conquer"" or ""bags of little jacknife"" algorithms for massive data mining, e.g., Wang, et al's paper, A Survey of Statistical Methods and Computing for Big Data (available here ) as well as Leskovec, et al's book Mining of Massive Datasets (available here ). There are now literally hundreds, if not thousands of papers that deal with various aspects of these challenges, all proposing widely differing analytic engines as their core from so-called “D Bayesian tensor models to classic, supervised logistic regression, and more. Fifteen years or so years ago, the debate largely focused on questions concerning the relative merits of hierarchical Bayesian solutions vs frequentist finite mixture models. In a paper addressing these issues, Ainslie, et al. (available here ) came to the conclusion that, in practice, the differing theoretical approaches produced largely equivalent results with the exception of problems involving sparse and/or high dimensional data -- where HB models had the advantage. Today with the advent of D&C-type workarounds, any arbitrage HB models may have historically enjoyed are rapidly being eliminated. The basic logic of these D&C-type workarounds are, by and large, extensions of Breiman's famous random forest technique which relied on bootstrapped resampling of observations and features. Breiman did his work in the late 90s on a single CPU when massive data meant a few dozen gigs and a couple of thousand features processed over a couple of thousand iterations. On today's massively parallel, multi-core platforms, it is possible to run algorithms analyzing terabytes of data containing tens of millions of features that build millions of ""RF"" mini-models in a few hours. Theoretically, it’s possible to build models using petabyes of data with these workarounds but the present IT platforms and systems won’t execute that yet – to the best of my knowledge (if any knows where this is being done and how, please feel free to share that information). There are any number of important questions coming out of all of this. One has to do with a concern over a possible loss of precision due to the approximating nature of these workarounds. This issue has been addressed by Chen and Xie in their paper, A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data (available here ) where they conclude that these approximations are indistinguishably different from ""full information"" models. A second concern which, to the best of my knowledge hasn't been adequately addressed by the literature, has to do with what is done with the results (i.e., the ""parameters"") from potentially millions of predictive mini-models once the workarounds have been rolled up and summarized. In other words, how does one execute something as simple as ""scoring"" new data with these results? Are the mini-model coefficients to be saved and stored or does one simply rerun the D&C algorithm(s) on new data? In his book, Numbers Rule Your World (available here ), Kaiser Fung describes the dilemma Netflix faced when presented with an ensemble of only 104 models handed over by the winners of their competition. The winners had, indeed, minimized the MSE vs all other competitors but this translated into only a several decimal place improvement in accuracy on the 5-point, Likert-type rating scale used by their movie recommender system. In addition, the IT maintenance required for this small ensemble of models cost much more than any savings seen from the ""improvement"" in model accuracy. Then there's the whole question of whether ""optimization"" is even possible with information of this magnitude. For instance, Emmanuel Derman, the physicist and financial engineer, in his autobiography My Life as a Quant suggests that optimization is an unsustainable myth, at least in financial engineering. Finally, questions concerning relative feature importance with massive numbers of features have yet to be addressed. There are no easy answers wrt questions concerning the need for variable selection and the new challenges opened up by the current, Epicurean workarounds remain to be resolved. The bottom line is that we are all data scientists now. Bio: Thomas Ball is an advanced analytics leader with Fortune 500 and start-up experience. He has led teams in management consulting, digital media, financial and health care industries. Source : Originally posted anonymously by the author to a thread on Stack Exchange's statistical Q&A site, Cross Validated . Reposted with permission. Related: * Datasets Over Algorithms * Why Implement Machine Learning Algorithms From Scratch? * Beyond One-Hot: an exploration of categorical variables -------------------------------------------------------------------------------- Previous post Next post -------------------------------------------------------------------------------- MOST POPULAR LAST 30 DAYS Most viewed 1. 7 Steps to Mastering Machine Learning With Python R vs Python for Data Science: The Winner is ... What is the Difference Between Deep Learning and “Regular” Machine Learning? TensorFlow Disappoints - Google Deep Learning falls shallow 9 Must-Have Skills You Need to Become a Data Scientist Top 10 Data Analysis Tools for Business How to Explain Machine Learning to a Software Engineer Most shared 1. What is the Difference Between Deep Learning and “Regular” Machine Learning? Data Science of Variable Selection: A Review R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results A Visual Explanation of the Back Propagation Algorithm for Neural Networks Machine Learning Key Terms, Explained How to Build Your Own Deep Learning Box Big Data Business Model Maturity Index and the Internet of Things (IoT) MORE RECENT STORIES * Predicting purchases at retail stores using HPE Vertica and Da... Top tweets, Jun 15-21: Predicting UEFA Euro2016; Visual Exp... Strata + Hadoop World, New York City, Sep 26-29 – KDnugg... Cisco 2016 Data and Analytics Conference, Sep 19-21, Chicago Machine Learning Trends and the Future of Artificial Intelligence Microsoft: Senior Software Engineer. Mining Twitter Data with Python Part 3: Term Frequencies History of Data Mining Bank of Ireland: Senior Data Scientist within the Advanced Ana... DuPont Pioneer: Data Scientist – Encirca KDnuggets 16:n22, Jun 22: Data Science Blog Contest; Free M... KDnuggets Blog Contest: Automated Data Science and Machine Lea... Data Science Career Days at Metis, NYC – June 23, SF ... A Review of Popular Deep Learning Models HPE Haven OnDemand Text Extraction API Cheat Sheet for Developers Standards-based Deployment of Predictive Analytics Data Science for Internet of Things course, Online or London How to Compare Apples and Oranges, Part 2 – Categorical ... Top Stories, June 13-19: A Visual Explanation of the Back Prop... Chief Data Officer Forum Insurance 2016, Sep 15, Chicago KDnuggets Home » News » 2016 » Jun » Tutorials, Overviews » Data Science of Variable Selection: A Review ( 16:n20 ) © 2016 KDnuggets. About KDnuggets Subscribe to KDnuggets News | Follow @kdnuggets | | X",There are as many approaches to selecting features as there are statisticians since every statistician and their sibling has a POV or a paper on the subject. This is an overview of some of these approaches.,Data Science of Variable Selection,Live,157 424,"RStudio Blog * Home * Subscribe to feed D3HEATMAP: INTERACTIVE HEAT MAPS June 24, 2015 in Packages | Tags: d3 , htmlwidgets We’re pleased to announce d3heatmap , our new package for generating interactive heat maps using d3.js and htmlwidgets . Tal Galili , author of dendextend , collaborated with us on this package. d3heatmap is designed to have a familiar feature set and API for anyone who has used heatmap or heatmap.2 to create static heatmaps. You can specify dendrogram, clustering, and scaling options in the same way. d3heatmap includes the following features: * Shows the row/column/value under the mouse cursor * Click row/column labels to highlight * Drag a rectangle over the image to zoom in * Works from the R console, in RStudio, with R Markdown , and with Shiny INSTALLATION install.packages(""d3heatmap"") EXAMPLES Here’s a very simple example (source: flowingdata ): library(d3heatmap) url <- ""http://datasets.flowingdata.com/ppg2008.csv"" nba_players <- read.csv(url, row.names = 1) d3heatmap(nba_players, scale = ""column"") You can easily customize the colors using the colors parameter. This can take an RColorBrewer palette name, a vector of colors, or a function that takes (potentially scaled) data points as input and returns colors. Let’s modify the previous example by using the ""Blues"" colorbrewer palette, and dropping the clustering and dendrograms: d3heatmap(nba_players, scale = ""column"", dendrogram = ""none"", color = ""Blues"") If you want to use discrete colors instead of continuous, you can use the col_* functions from the scales package. d3heatmap(nba_players, scale = ""column"", dendrogram = ""none"", color = scales::col_quantile(""Blues"", NULL, 5)) Thanks to integration with the dendextend package, you can customize dendrograms with cluster colors: d3heatmap(nba_players, colors = ""Blues"", scale = ""col"", dendrogram = ""row"", k_row = 3) For issue reports or feature requests, please see our GitHub repo . SHARE THIS: * Reddit * More * * Email * Facebook * * Print * Twitter * * LIKE THIS: Like Loading...RELATED SEARCH LINKS * Contact Us * Development @ Github * RStudio Support * RStudio Website * R-bloggers CATEGORIES * Featured * News * Packages * R Markdown * RStudio IDE * Shiny * shinyapps.io * Training * Uncategorized ARCHIVES * May 2016 * April 2016 * March 2016 * February 2016 * January 2016 * December 2015 * October 2015 * September 2015 * August 2015 * July 2015 * June 2015 * May 2015 * April 2015 * March 2015 * February 2015 * January 2015 * December 2014 * November 2014 * October 2014 * September 2014 * August 2014 * July 2014 * June 2014 * May 2014 * April 2014 * March 2014 * February 2014 * January 2014 * December 2013 * November 2013 * October 2013 * September 2013 * June 2013 * April 2013 * February 2013 * January 2013 * December 2012 * November 2012 * October 2012 * September 2012 * August 2012 * June 2012 * May 2012 * January 2012 * October 2011 * June 2011 * April 2011 * February 2011 EMAIL SUBSCRIPTION Enter your email address to subscribe to this blog and receive notifications of new posts by email. Join 19,578 other followers RStudio is an affiliated project of the Foundation for Open Access Statistics 16 COMMENTS June 24, 2015 at 10:25 pm SF99 Trying out pkg: d3heatmap_0.6.0 in Rstudio. Like it – but documentation with simple, clear examples is sparse …without clear documentation = difficult to use!. In the included doc example: x <- mtcars # [c(2:4,7),1:4] d3heatmap(x, k_row = 4, k_col = 2) what are the function args: k_row, k_col, scale and some of the other function args? Even in this Rstudio post, the example at the beginning does not work at all: QUOTE: ———————————– Here’s a very simple example (source: flowingdata): url <-"" http://datasets.flowingdata.com/ppg2008.csv" ; nba_players END QUOTE —————————- yields: ""error: nba-players not defined"" (?) IN SUMMARY: Urgently needed – documentation with clear, stepXstep examples. Please? Without it, this potentially fine pkg is an exercise in frustration… Thank you! * June 25, 2015 at 5:11 am Joe Cheng Thank you for the feedback, the first code sample that defined nba_players was truncated due to my poor WordPress skills. I’ve fixed it, so you should be able to step through each of the examples now. The k_row, k_col, scale, and all other parameters are currently documented only in the R help, i.e. ?d3heatmap::d3heatmap. Other than k_row/k_col, all the stats-related parameters are identical to heatmap and heatmap.2, if you’re familiar with those functions. Hopefully we can find the time after the useR conference next week to add more documentation. In the meantime, if you have any specific questions feel free to leave additional comments or email me at joe@rstudio.com . Thanks again! * June 25, 2015 at 10:47 am SF99 Thank you for the quickly reply, Joe! No, I was not familiar with the with the heatmap and heatmap.2 pkgs. But since d3heatmap is an evolution above the latter 2 pkgs, I’d like to suggest that it should include a _”self-contained”_ help file with clear, stepXstep examples – (so the user does not need to refer to other pkgs, in order to use d3heatmap). Again Joe – looking forward to be a frequent user of your _excellent_ d3heatmap pkg . Thanks! SF99 * June 24, 2015 at 10:45 pm Alberto Jaimes Romero Hi there, there is some missed code. I had to read the dowloaded file, easy; and transform it into a matrix. Greetings * June 25, 2015 at 5:13 am Joe Cheng Sorry about that, the code sample in the blog post was indeed truncated. I’ve fixed it now. June 25, 2015 at 4:48 am GD It is very beautiful indeed! How can I center the plot in an rmarkdown document? * June 25, 2015 at 5:19 am Joe Cheng There’s not an official way to center htmlwidgets in rmd documents right now, I don’t think. But in a pinch either of these two approaches will work: 1) Add width=”100%” as a parameter to the d3heatmap. That counts as centered, right?😉 2) Wrap it with a div: tags$div(style=” margin-right: auto”, d3heatmap::d3heatmap(mtcars, width=500) ) The width:500px and width=500 can be any number, but they have to match. * June 25, 2015 at 5:21 am Joe Cheng I forgot to mention in my previous solution #2, you also need to call library(htmltools). Another way to go is to include this in your first code chunk: “`{r echo=FALSE} library(htmltools) tags$style(“ }”) “` This will cause any d3heatmap in the document to be centered. June 25, 2015 at 5:21 am Joe Cheng Ugh, the “` should be three backticks. * June 25, 2015 at 8:50 am GD The row/column/value under the mouse cursor does not appear when I see an rmarkdown html in firefox (updated to latest version). * June 25, 2015 at 12:29 pm Joe Cheng I can’t reproduce this–do you have an example Rmd you can email me? (joe@rstudio.com) * June 25, 2015 at 3:38 pm ΓΔ 047 Try this on firefox http://www.htmlwidgets.org/showcase_d3heatmap.html * June 25, 2015 at 11:39 am IP Great tutorial – question about the tooltip though. Since this takes a matrix, the ‘on hover’ box shows ‘row’, ‘column’ and ‘value’. Is there a way to specify names for these? * June 25, 2015 at 12:37 pm Joe Cheng It’s not possible at the moment. Do you mind filing an issue here? https://github.com/rstudio/d3heatmap/issues/new June 28, 2015 at 1:54 pm dendextend version 1.0.1 + useR!2015 presentation | R-statistics blog […] between Joe and I). You are invited to see lively examples of the package in the post at the RStudio blog. Here is just one quick […] July 1, 2015 at 12:44 pm Guillaume Devailly Nice. It is quite slow for big heatmap (i.e. 600 x 600), and sometimes even fails… Plot.ly appears faster (at least on Firefox and Chrome): https://gdevailly.shinyapps.io/Heatmap (with plot.ly) http://moderndata.plot.ly/dashboards-in-r-with-shiny-plotly/ (tutorial) https://gdevailly.shinyapps.io/d3heatmap (with d3heatmap, dirty test, quite slow to display and sometimes fails) It’s a shame as d3heatmap function are much more R user friendly and plot.ly do not do trees. « RStudio adds custom domains, bigger data and package support to shinyapps.io DT: An R interface to the DataTables library »Blog at WordPress.com. The Tarski Theme . Subscribe to feed. FollowFOLLOW “RSTUDIO BLOG” Get every new post delivered to your Inbox. Join 19,578 other followers Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","We’re pleased to announce d3heatmap, our new package for generating interactive heat maps using d3.js and htmlwidgets. Tal Galili, author of dendextend, collaborated with us on this package. …",d3heatmap: Interactive heat maps,Live,158 426,"G. Adam Cox Blocked Unblock Follow Following May 16 -------------------------------------------------------------------------------- CITIZEN SCIENTIST FINDS “DEATH STAR” IN SETI DATA SET OUTER SPACE RADIOWAVE METADATA, #ALIENLIFE, #CLICKBAIT In preparation for the SETI Institute’s Hackathon and Code Challenge , a citizen scientist, Dr. Arun Ramamoorthy , who is also a researcher from Arizona State University, was looking at data from the SETI@IBMCloud project. (The SETI@IBMCloud project and Hackathon/Code Challenge are separate projects, but related.) In particular, he was looking at the metadata for the raw SETI data, found in a table called SignalDB . To get started, he wanted to simply visualize this data as a sphere. Amongst other things, the SignalDB database contains the Right Ascension (RA) and Declination (DEC) coordinates for all “Candidate” events observed by the SETI Institute from 2013 to 2015. The RA and DEC values specify the location of objects in the sky. By mapping the (RA,DEC) coordinates to a sphere and then rotating through a small angle multiple times, Dr. Ramamoorthy was able to create this “Death Star” GIF: Dr. Arun Ramamoorthy ’s “Death Star” of radio signal observations from the SETI@IBMCloud data set.The block of data points (where the Death Star’s “superlaser” shoots out of) maps to the location of a number of star systems in the “Kepler field.” This is a patch of sky observed by NASA’s Kepler spacecraft where thousands of exoplanets were discovered before the spacecraft malfunctioned in 2012. Since the SETI Institute tends to observes stars with known exoplanets, this field shows up predominantly because of the large number of observations made in this area. The Kepler Field shows us that, on average, 1.6 exoplanets orbit each star in our galaxy. This means there are roughly 160 billion planets in our galaxy, 40 billion of which may be rocky planets within the habitable zone. What’s the likelihood of any one of these planets hosting intelligent life? If you’re interested in joining the SETI Institute on its mission, or looking at the data yourself, register for the upcoming SETI Institute hackathon and code challenge . If you enjoyed this article, or are just plain enthusiastic about the Kepler field, please ♡ it to recommend it to other Medium readers. Thanks to Mike Broberg . * Astronomy * Data Science * Data Visualization * SETI Blocked Unblock Follow FollowingG. ADAM COX FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform. * Share * * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","In preparation for the SETI Institute’s Hackathon and Code Challenge, a citizen scientist, Dr. Arun Ramamoorthy, who is also a researcher from Arizona State University, was looking at data from the…",Citizen Scientist finds “Death Star” in SETI data set,Live,159 430,"LOCATION TRACKER – PART 2 markwatson / August 11, 2016Want to scale your apps to millions of users while maintaining the ability to safely and securely (and easily!) sync private information to Cloudant? In this tutorial we’ll show you how. In Location Tracker – Part 1 we showed you how to create an iOS app that tracks your location, syncs with Cloudant, and performs geo queries to find nearby points of interest. We showed you how to use the database-per-user design pattern to take advantage of Cloudant’s powerful sync capabilities while ensuring a user’s location information remains private. We also discussed how the database-per-user design pattern works well for small- to medium-sized apps, but not so much when you want to scale to millions of users. In this tutorial we’ll show you how we extended Location Tracker to do just that. A REFRESHER The Location Tracker app is an iOS app developed in Swift that tracks user locations and syncs those locations to Cloudant. As a user moves, and new locations are recorded, the app queries the server for points of interest near the user’s location. Below is a screenshot of the Location Tracker app. Blue pins mark each location recorded by the app. A blue line is drawn over the path the user has travelled. Each time the Location Tracker app records a new location, a radius-based geo query is performed in Cloudant to find nearby points of interest (referred to in the app as “places”). The radius is represented by a green circle. Places are displayed as green pins. In Part 1 we identified five key requirements for the Location Tracker app: 1. Track location in the foreground and background. 2. Use geospatial queries to find points of interest within a specified radius. 3. Run offline. 4. Keep user location information private. 5. Provide ability to consolidate and analyze all locations. To satisfy requirements #4 and #5, we implemented the database-per-user design pattern. It was a great first step for learning how to use Cloudant Sync for syncing personal or private data, but this design becomes problematic when scaling to millions or even tens of thousands of users. In this post Glynn Bird points out a few of the issues with scaling using the database-per-user pattern: * Backup – How do you design a backup-and-restore plan for millions or even thousands of databases? * Reporting – How do you generate reports across millions of databases? * Change control – How do you propagate data updates across millions of databases? To help provide a solution to these issues (and more), a team of IBMers built Cloudant Envoy . CLOUDANT ENVOY FTW! Cloudant Envoy is a microservice that acts as a replication target for your PouchDB web app or Cloudant Sync-based native app. Envoy allows your client-side code to adopt a “one database per user” design pattern, with a copy of a user’s data stored on the mobile device and synced to the cloud when online, while invisibly storing all the users’ data in one large database. This prevents the proliferation of databases that occurs as users are added and facilitates simpler backup and server-side reporting.This is how Cloudant Envoy is described on GitHub. Let’s break down this description and unpack the relevant points for Location Tracker: Cloudant Envoy is a microservice that acts as a replication target for your PouchDB web app or Cloudant Sync-based native app. In Part 1 we showed how the Location Tracker iOS app targeted user-specific databases in Cloudant for replication. In this tutorial we’ll show how (without it even knowing it) the iOS app will target Cloudant Envoy. Envoy allows your client-side code to adopt a “one database per user” design pattern, with a copy of a user’s data stored on the mobile device and synced to the cloud when online… From the beginning, the Location Tracker iOS app was built using the database-per-user design pattern. Each user’s locations are stored locally on the iOS device and synced to Cloudant when online. This doesn’t change when replicating to Envoy. In fact zero changes were required to the iOS app to support Envoy. …while invisibly storing all the users’ data in one large database. This prevents the proliferation of databases that occurs as users are added and facilitates simpler backup and server-side reporting. Using Cloudant Envoy we can store all private location data in a single database. This makes it easier for backend developers or data scientists to work with the data and addresses the three problems we mentioned with the database-per-user pattern: backup, reporting, and change control. ARCHITECTURE In Part 1 we implemented the database-per-user design pattern and created a database for each user to track that user’s location. This is what our architecture diagram looked like: Location tracker server v1: Users hit Node.js server, location syncs directly to Cloudant, many unique small DBs in Cloudant. User registration and geo queries were performed through a Node.js application running on IBM Bluemix, while locations were synced directly to user-specific databases in Cloudant. User-specific databases were configured to replicate to a centralized database to store all locations. With Cloudant Envoy our architecture is greatly simplified: Location tracker server v2: Users hit improved Node.js server, location syncs to Cloudant Envoy proxy, a single big DB in Cloudant. Here in Part 2, user registration and geo queries are still performed through a custom Node.js app, but now all location replication is routed through Envoy and stored in a single, centralized database. We are no longer connecting directly to Cloudant. We no longer have to create databases for every user, or configure replication from those databases to our centralized location database, and we continue to satisfy our requirements, including: * Keep user location information private – This is handled completely by Envoy. Users can only access their own locations. * Provide ability to consolidate and analyze all locations – By default, with Envoy, all locations are stored in the same database. No need for replication or data duplication. THE NEW SERVER In Part 1 we discussed the Location Tracker Server, a Node.js application that provides RESTful APIs for registering new users and querying places using Cloudant Geo . For this tutorial we have created a new server to perform these functions and configure support for Cloudant Envoy. That server is called the Location Tracker Envoy Server . When you install the Location Tracker Envoy Server three databases will be created in your Cloudant instance: 1. envoyusers – This database is used by the server and by Cloudant Envoy to manage and authenticate users. 2. lt_locations_all_envoy – This database is used to keep track of all locations synced from iOS devices to Cloudant through Envoy. 3. lt_places – This database contains a list of places that the Location Tracker app will query. Follow the instructions on the Location Tracker Envoy Server GitHub page to get the Location Tracker Envoy Server up and running locally or on Bluemix. THE SAME CLIENT As mentioned previously, zero changes were required to the iOS app to support sync with Envoy. The iOS app is given the location replication target on login. Envoy is a drop-in replacement for Cloudant replication. Instead of returning the path to a user-specific database for replication, the server returns the path to the Envoy instance. Once you’ve set up the Location Tracker Envoy Server, follow the instructions on the Location Tracker App GitHub page to get the Location Tracker App up and running in Xcode. HOW IT WORKS In the rest of this tutorial we’ll provide more detail on how we are using Envoy. For more information on how the app tracks locations or queries for points of interest, please check out Part 1 . This tutorial focuses on Cloudant Envoy and the changes made to the backend to support Envoy. USER REGISTRATION Cloudant Envoy has a few different options for managing users. You can configure which method to use with the ENVOY_AUTH environment variable. This variable must be set on both the Cloudant Envoy app and Location Tracker Envoy Server app in Bluemix. See the Cloudant Envoy documentation for more information regarding the different authentication options available. By default users are stored in a database called envoyusers . The user registration process has been greatly simplified from Part 1. The same PUT request is sent from the iOS app: { ""username"": ""markwatson"", ""password"": ""passw0rd"", ""type"": ""user"", ""_id"": ""markwatson"" } However, the backend processing of this request is much simpler. Previously the backend would create new databases, set up API keys and passwords, and configure continuous replication between the new databases and the centralized locations database. When the new Node.js server receives the PUT request the following steps are executed: 1. Check if the user exists with the specified id. If the user already exists, then return a status of 409 to the client. 2. Store the user in the users database with their id and password (hashed). That’s it! USER LOGIN Users are logged in immediately after registering. Again, no changes were made to the iOS app. The app sends the following request to the Node.js server: { ""username"": ""markwatson"", ""password"": ""passw0rd"" } And the server replies with a response in the same format as the previous version of the server: { ""ok"": true, ""api_key"": ""markwatson"", ""api_password"": ""passw0rd"", ""location_db_name"": ""lt_locations_all_envoy"", ""location_db_host"": ""cloudant-envoy-XXXX.mybluemix.net"" } The motivation here is backwards compatibility. The app expects to receive the API key, password, database, and host to sync to. In Part 1 this was the user-specific database, but as you can see now, the server is sending the information required to sync with Envoy. The api_key and api_password fields now take the user’s username and password as their values. This is what is expected by Envoy, and by using this format the code maintains backwards compatibility with our server from Part 1. Correspondingly, the unique values from our old database-per-user pattern — location_db_name and location_db_host — now take standardized values: ""lt_locations_all_envoy"" and the Envoy host, respectively. SYNCING LOCATIONS Syncing locations between the client and the server has not changed. Envoy implements the same replication protocol as Cloudant, making the migration completely transparent to the client. The Location Tracker App uses Cloudant Sync for iOS to sync with Envoy the same exact way it would sync directly to Cloudant. THE DATA How does Envoy know who owns the data when they are all stored in the same database? There are a few different ways that Envoy can identify who owns the data, but the same principle is applied in each case: 1. When saving new locations, alter the data to include the authenticated user’s information. 2. When retrieving the locations, use the authenticated user’s information to filter the data. Envoy modifies each document on the way in and filters each document on the way out. Envoy provides different options for adding ownership information to the data. These options can be configured by setting the ENVOY_ACCESS environment variable in Cloudant Envoy. See the Cloudant Envoy documentation for more information. By default, Envoy stores the ownership of a document in the _id field of the document. It prepends the sha1 hash of the username to the id. Here’s an example document: { ""_id"": ""c00268ec8506774f20229f1eb9142e0d1f1a938b-014144EE-25BB-4251-94AC-A7BBD3C04CB5"", ""_rev"": ""1-8801d4d9a3bf8a539692af9697b89eb5"", ""created_at"": 1468251203669.592, ""geometry"": { ""type"": ""Point"", ""coordinates"": [ -122.39203496, 37.5668706 ] }, ""properties"": { ""timestamp"": 1468251203669.592, ""username"": ""envoy_user1"", ""background"": false }, ""type"": ""Feature"" } This document was stored for envoy_user1 . Every one of envoy_user1 ‘s documents stored by Envoy will have an _id prepended with c00268ec8506774f20229f1eb9142e0d1f1a938b . DIFF PART1 PART2 To summarize, let’s do a diff with Part 1 of the Location Tracker to see the significant changes that we’ve made to help scale our solution: * Only in Part 1 – Create new database for each user. * Only in Part 1 – Configure replication between each user-specific database and consolidated location database. * Only in Part 1 – Tell iOS app to sync directly to user-specific database in Cloudant. * Only in Part 2 – Tell iOS app to sync with Cloudant Envoy. * iOS App – No changes. Cloudant was built to scale, but creating millions of databases for millions of users is not scalable. Cloudant Envoy stores your private data in a single database that can scale to support millions of users while allowing you to reap all the benefits of Cloudant Sync. Using Cloudant Envoy we have not only improved our ability to scale, but we have simplified almost every aspect of our solution. CONCLUSION In this tutorial, we showed how to use Cloudant Envoy to scale Cloudant’s data replication & synchronization capabilities to millions of mobile users. We showed you how Cloudant Envoy provides a drop-in replacement for Cloudant replication that allows you to safely and securely sync private location information into a single, consolidated database. Cloudant Envoy is still in beta, but we’re really excited about its potential and urge you to start experimenting with it today. For more information regarding the Location Tracker and Cloudant Envoy please see the following links: * Location Tracker – Part 1 * Cloudant Envoy GitHub page * Scaling Offline First with Envoy SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Click to share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: cloudant / Cloudant Envoy / database per user / Mobile / Offline First / swift Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Our Location Tracker example app shows how simple it is to use Cloudant with Swift + GeoJSON. It's offline-first and scales up the database per user pattern.,Location Tracker – Part 2,Live,160 433,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix * Tutorials * Load dashDB Data with Apache Spark * Load Cloudant Data in Apache Spark Using a Python Notebook * Load Cloudant Data in Apache Spark Using a Scala Notebook * Build SQL Queries * Use the Machine Learning Library * Build a Custom Library for Apache Spark * Sentiment Analysis of Twitter Hashtags * Use Spark Streaming * Launch a Spark job using spark-submit * Sample Notebooks * Sample Python Notebook: Precipitation Analysis * Sample Python Notebook: NY Motor Vehicle Accidents Analysis * BigInsights * Get Started * BigInsights on Cloud for Analysts * BigInsights on Cloud for Data Scientists * Perform Text Analytics on Financial Data * Sample Scripts * Compose * Get Started * Create a Deployment * Add a Database and Documents * Back Up and Restore a Deployment * Enable Two-Factor Authentication * Add Users * Enable Add-Ons for Your Deployment * Compose Enterprise * Get Started * Cloudant * Get started * Copy a sample database * Create a database * Change database permissions * Connect to Bluemix * Developing against Cloudant * Intro to the HTTP API * Execute common API commands * Set up pre-authenticated cURL * Database Replication * Use cases for replication * Create a replication job * Check replication status * Set up replication with cURL * Indexes and Queries * Use the primary index * MapReduce and the secondary index * Build and query a search index * Use Cloudant Query * Cloudant Geospatial * Integrate * Create a Data Warehouse from Cloudant Data * Store Tweets Using Cloudant, dashDB, and Node-RED * Load Cloudant Data in Apache Spark Using a Scala Notebook * Load Cloudant Data in Apache Spark Using a Python Notebook * dashDB * dashDB Quick Start * Get * Get started with dashDB on Bluemix * Load data from the desktop into dashDB * Load from Desktop Supercharged with IBM Aspera * Load data from the Cloud into dashDB * Move data to the Cloud with dashDB’s MoveToCloud script * Load Twitter data into dashDB * Load XML data into dashDB * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB * Load JSON Data from Cloudant into dashDB * Integrate dashDB and Informatica Cloud * Load geospatial data into dashDB to analyze in Esri ArcGIS * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion Workbench (DCW) * Install IBM Database Conversion Workbench * Convert data from Oracle to dashDB * Convert IBM Puredata System for Analytics to dashDB * From Netezza to dashDB: It’s That Easy! * Use Aginity Workbench for IBM dashDB * Build * Create Tables in dashDB * Connect apps to dashDB * Analyze * Use dashDB with Watson Analytics * Perform Predictive Analytics and SQL Pushdown * Use dashDB with Spark * Use dashDB with Pyspark and Pandas * Use dashDB with R * Publish apps that use R analysis with Shiny and dashDB * Perform market basket analysis using dashDB and R * Connect R Commander and dashDB * Use dashDB with IBM Embeddable Reporting Service * Use dashDB with Tableau * Leverage dashDB in Cognos Business Intelligence * Integrate dashDB with Excel * Extract and export dashDB data to a CSV file * Analyze With SPSS Statistics and dashDB * REST API * Load delimited data using the REST API and cURL * DataWorks * Get Started * Connect to Data in IBM DataWorks * Load Data for Analytics in IBM DataWorks * Blend Data from Multiple Sources in IBM DataWorks * Shape Raw Data in IBM DataWorks * DataWorks API USE THE MACHINE LEARNING LIBRARYJess Mantaro / October 22, 2015Learn how to use the Apache® Spark™ Machine Learning Library (MLlib) in IBMAnalytics for Apache Spark on IBM Bluemix. Apache® Spark™ includes extensionlibraries that can be used for SQL and DataFrames, streaming, machine learning,and graph analysis. In this video, you’ll see how to use machine learningalgorithms to determine the top drop off location for New York City taxis usinga popular algorithm known as KMeans.You can also read a transcript of this videoRELATED LINKS * Build SQL Queries * Load and Filter Cloudant Data with Apache Spark * Load and Analyze dashDB Data with Apache SparkTRY THE TUTORIALLearn how to use Apache® Spark™ machine learning algorithms to determine the topdrop off location for New York City taxis using the KMeans algorithm.WHAT YOU’LL LEARNAt the end of this tutorial, you should be able to: * download New York City taxi cab data in CSV format. * create a Scala notebook in IBM Analytics for Apache Spark. * load a CSV file into a Scala notebook. * use the KMeans and Vectors algorithms to analyze the data.BEFORE YOU BEGINWatch the Getting Started on Bluemix video to create a Bluemix account and add the IBM Analytics for Apache Spark service.PROCEDURE 1: DOWNLOAD NEW YORK CITY TAXI CAB DATA 1. Navigate to the NYC OpenData site. 2. Click Transportation . 3. For the search criteria, type taxi . 4. Select the trip data of your choice, and download the data in CSV format. We recommend you select the 2013_Green_Taxi_Trip_data.csv file, or change the code found later in this tutorial to match the selected year.PROCEDURE 2: CREATE A SCALA NOTEBOOK 1. Sign in to Bluemix . 2. Access the Dashboard , and open the Apache Spark instance. 3. Click New Notebook , select Scala , type a name for the notebook, and click Create . 4. Click Add Data Source in the right sidebar. 5. Drag and drop the CSV file you downloaded in procedure 1 into the box labelled Drop file to add data source . 6. Paste the following code into the first cell in the notebook, and then click the Run icon on the toolbar. This first cell contains two commands that set up use of the Apache® Spark™ machine learning algorithms KMeans and Vectors. Commands: import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors 7. Paste the following code into the second cell, and then click Run . Replace nyctaxisub.csv with file name you used. This command reads the contents of the file and assigns it to the taxifile variable. For example, the filename could be 2013_Green_Taxi_Trip_data.csv. Command: val taxifile = sc.textFile(""swift://notebooks.spark/filename"") 8. Paste the following code into the third cell, and then click Run . This command shows what the data in this file looks like. When it displays, you’ll see that the first row will be the header for the columns, and the second row actually shows data. So, here is the first row which shows a few different things, but of particular interest is the dropoff_latitude and dropoff_longitude. And in the next row, we actually see data. Command: taxifile.take(2) 9. Paste the following code into the fourth cell. This command filters this data, so we only see the records from 2013. And we also want to make sure that the dropoff_latitude and dropoff_longitude aren’t null. If you downloaded a different data set, the column numbers may be different. Commands: val taxidata=taxifile.filter(_.contains(""2013"")). filter(_.split("","") (4) !=""""). filter(_.split("","") (18) !="""") 10. Paste the following code into the fifth cell, and then click Run . This filters the data containing drop off areas with latitudes and longitudes that are roughly in the Manhattan area. Commands: val taxifence = taxidata.filter(_.split("","")(4).toDouble>40.70). filter(_.split("","") (4).toDouble<40.86). filter(_.split("","") (18).toDouble>(-74.02)). filter(_.split("","") (18).toDouble<(-73.93)) 11. Paste the following code into the sixth cell, and then click Run . This command takes this data and puts it in a vector which will be used as input for the KMeans algorithm. Command: val taxi=taxifence.map(line=>Vectors.dense(line.split(',').slice(17,19).map(_.toDouble))) 12. Paste the following SQL statement into the sixth cell, and then click Run . This final cell contains commands to invoke the KMeans algorithm. In this case, we’ however, the parameters could be changed in this cell to determine the top three or the top ten locations. It’s also interesting to note that Apache® Spark™ machine learning provides other algorithms for collaborative filtering, clustering, and classification. Commands: val model=KMeans.train(taxi,1,1) val clusterCenters=model.clusterCenters.map(_.toArray) clusterCenters.foreach(lines= println(lines(0),lines(1)))Select and copy the coordinates. Then, open a browser, and paste the coordinatesinto a map program such as Google Maps to see the location on the map. Find morevideos in the Spark Learning Center at http://developer.ibm.com/clouddataservices/spark .Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",How to use the Spark machine learning programming model in IBM Analytics for Apache Spark on IBM Bluemix,Use the Machine Learning Library in Spark,Live,161 439,"CURTIS MILLER'S PERSONAL WEBSITE Skip to content * Home * Blog * Professional Information * My Resume * Educational History * Graduate Education * Data Mining * Database Systems * Linear Models * Machine Learning * Multilinear Models * Probabilistic Models * Reading Course on Time Series Models * Visualization * Undergraduate Education * Applied Statistics I * Applied Statistics II * Construction of Knowledge * Game Theory * Hinckley Institute of Politics Washington, D.C. Internship * Honor’s Core in Intellectual Traditions 2 * Honor’s Thesis * Industrial-Organizational Economics * International Economics * Marxist Economics * Monetary Theory and Policy * Principles of Econometrics * Principles of Macroeconomics * Stochastic Processes and Simulation I * Stochastic Processes and Simulation II * The Valley as a Laboratory * General Education * Biological Science * Composition * Depth Course * Humanities/Diversity * Interdisciplinary * Lifelong Wellness * Physical Science * About Me * Hobbies CURTIS MILLER'S PERSONAL WEBSITE CURTIS MILLER'S PERSONAL WEBSITE, WITH RESUME, PORTFOLIO, BLOG, ETC. Skip to content * Home * Blog * Professional Information * My Resume * Educational History * Graduate Education * Data Mining * Database Systems * Linear Models * Machine Learning * Multilinear Models * Probabilistic Models * Reading Course on Time Series Models * Visualization * Undergraduate Education * Applied Statistics I * Applied Statistics II * Construction of Knowledge * Game Theory * Hinckley Institute of Politics Washington, D.C. Internship * Honor’s Core in Intellectual Traditions 2 * Honor’s Thesis * Industrial-Organizational Economics * International Economics * Marxist Economics * Monetary Theory and Policy * Principles of Econometrics * Principles of Macroeconomics * Stochastic Processes and Simulation I * Stochastic Processes and Simulation II * The Valley as a Laboratory * General Education * Biological Science * Composition * Depth Course * Humanities/Diversity * Interdisciplinary * Lifelong Wellness * Physical Science * About Me * Hobbies Search for:TAGS 2008 2010 2014 2016 activism cormania debugging democrats donald trump election financial crisis financial sector game art game design gamemaker: studio gml google gui hillary clinton honor 3700 jill stein programming republicans salt lake city statistics suburbs university of utah visualization washington west jordanTOP POSTS & PAGES * An Introduction to Stock Market Data Analysis with R (Part 1) * An Introduction to Stock Market Data Analysis with Python (Part 1) * On Programming Languages; Why My Dad Went From Programming to Driving a Bus CATEGORIES * AAR (1) * Economics and Finance (7) * Game Programming (9) * HONOR 3700 (14) * Politics (8) * Python (4) * R (10) * Statistics and Data Science (13) * Uncategorized (4) SOCIAL * View NTGuardian’s profile on Twitter * View curtis-miller-41568095’s profile on LinkedIn * View ntguardian’s profile on GitHub * View UCUmC4ZXoRPmtOsZn2wOu9zg’s profile on YouTube * View 101301351154608272073’s profile on Google+ Follow Curtis Miller's Personal Website on WordPress.comSUBSCRIBE VIA RSS * RSS - Posts * RSS - Comments ARCHIVES Archives Select Month March 2017 (3) February 2017 (1) December 2016 (2) November 2016 (3) October 2016 (3) September 2016 (4) August 2016 (4) June 2016 (1) August 2015 (1) July 2015 (4) June 2015 (4) December 2014 (2) November 2014 (4) October 2014 (2) September 2014 (6) January 2012 (2)BLOGROLL * FiveThirtyEight * Planet SciPy * R-Bloggers * YoYo Games Blog LINKS * R-users * University of Utah Mathematics Department JOBS FOR R USERS * Primary Research Analyst @ Boston, Massachusetts, United States March 22, 2017 * Data Engineer (m/w) March 21, 2017 * Authoring Video courses on R (Packt) March 17, 2017 JOBS FOR PYTHON USERS * Python Developer (Data Science Team) - Owlstone Medical Ltd * Python Programmer - Advent Global Solutions Inc * Python Developer - Fulfil.IO Inc. Posted on March 27, 2017 March 27, 2017 Economics and Finance , R , Statistics and Data ScienceAN INTRODUCTION TO STOCK MARKET DATA ANALYSIS WITH R (PART 1) Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evaluating trading strategies (see part 1 and part 2 ). These have been my most popular posts, up until I published my article on learning programming languages (featuring my dad’s story as a programmer), and has been translated into both Russian (which used to be on backtest.ru at a link that now appears to no longer work) and Chinese ( here and here ). R has excellent packages for analyzing stock data, so I feel there should be a “translation” of the post for using R for stock data analysis. This post is the first in a two-part series on stock data analysis using R, based on a lecture I gave on the subject for MATH 3900 (Data Science) at the University of Utah . In these posts, I will discuss basics such as obtaining the data from Yahoo! Finance using pandas, visualizing stock data, moving averages, developing a moving-average crossover strategy, backtesting, and benchmarking. The final post will include practice problems. This first post discusses topics up to introducing moving averages. NOTE: The information in this post is of a general nature containing information and opinions from the author’s perspective. None of the content of this post should be considered financial advice. Furthermore, any code written here is provided without any form of guarantee. Individuals who choose to use it do so at their own risk. INTRODUCTION Advanced mathematics and statistics have been present in finance for some time. Prior to the 1980s, banking and finance were well-known for being “boring”; investment banking was distinct from commercial banking and the primary role of the industry was handling “simple” (at least in comparison to today) financial instruments, such as loans. Deregulation under the Regan administration, coupled with an influx of mathematical talent, transformed the industry from the “boring” business of banking to what it is today, and since then, finance has joined the other sciences as a motivation for mathematical research and advancement. For example one of the biggest recent achievements of mathematics was the derivation of the Black-Scholes formula , which facilitated the pricing of stock options (a contract giving the holder the right to purchase or sell a stock at a particular price to the issuer of the option). That said, bad statistical models, including the Black-Scholes formula, hold part of the blame for the 2008 financial crisis . In recent years, computer science has joined advanced mathematics in revolutionizing finance and trading , the practice of buying and selling of financial assets for the purpose of making a profit. In recent years, trading has become dominated by computers; algorithms are responsible for making rapid split-second trading decisions faster than humans could make (so rapidly, the speed at which light travels is a limitation when designing systems ). Additionally, machine learning and data mining techniques are growing in popularity in the financial sector, and likely will continue to do so. In fact, a large part of algorithmic trading is high-frequency trading (HFT) . While algorithms may outperform humans, the technology is still new and playing an increasing role in a famously turbulent, high-stakes arena. HFT was responsible for phenomena such as the 2010 flash crash and a 2013 flash crash prompted by a hacked Associated Press tweet about an attack on the White House. My articles, however, will not be about how to crash the stock market with bad mathematical models or trading algorithms. Instead, I intend to provide you with basic tools for handling and analyzing stock market data with R. We will be using stock data as a first exposure to time series data , which is data considered dependent on the time it was observed (other examples of time series include temperature data, demand for energy on a power grid, Internet server load, and many, many others). I will also discuss moving averages, how to construct trading strategies using moving averages, how to formulate exit strategies upon entering a position, and how to evaluate a strategy with backtesting. DISCLAIMER: THIS IS NOT FINANCIAL ADVICE!!! Furthermore, I have ZERO experience as a trader (a lot of this knowledge comes from a one-semester course on stock trading I took at Salt Lake Community College)! This is purely introductory knowledge, not enough to make a living trading stocks. People can and do lose money trading stocks, and you do so at your own risk! GETTING AND VISUALIZING STOCK DATA GETTING DATA FROM YAHOO! FINANCE WITH QUANTMOD Before we analyze stock data, we need to get it into some workable format. Stock data can be obtained from Yahoo! Finance , Google Finance , or a number of other sources, and the quantmod package provides easy access to Yahoo! Finance and Google Finance data, along with other sources. In fact, quantmod provides a number of useful features for financial modelling, and we will be seeing those features throughout these articles. In this lecture, we will get our data from Yahoo! Finance. # Get quantmod if (!require(""quantmod"")) { install.packages(""quantmod"") library(quantmod) } start <- as.Date(""2016-01-01"") end <- as.Date(""2016-10-01"") # Let' Apple's ticker symbol is AAPL. We use the # quantmod function getSymbols, and pass a string as a first argument to # identify the desired ticker symbol, pass 'yahoo' to src for Yahoo! # Finance, and from and to specify date ranges # The default behavior for getSymbols is to load data directly into the # global environment, with the object being named after the loaded ticker # symbol. This feature may become deprecated in the future, but we exploit # it now. getSymbols(""AAPL"", src = ""yahoo"", from = start, to = end) ## As of 0.4-0, 'getSymbols' uses env=parent.frame() and ## auto.assign=TRUE by default. ## ## This behavior will be phased out in 0.5-0 when the call will ## default to use auto.assign=FALSE. getOption(""getSymbols.env"") and ## getOptions(""getSymbols.auto.assign"") are now checked for alternate defaults ## ## This message is shown once per session and may be disabled by setting ## options(""getSymbols.warning4.0""=FALSE). See ?getSymbols for more details. ## [1] ""AAPL"" # What is AAPL? class(AAPL) ## [1] ""xts"" ""zoo"" # Let's see the first few rows head(AAPL) ## AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume ## 2016-01-04 102.61 105.37 102.00 105.35 67649400 ## 2016-01-05 105.75 105.85 102.41 102.71 55791000 ## 2016-01-06 100.56 102.37 99.87 100.70 68457400 ## 2016-01-07 98.68 100.13 96.43 96.45 81094400 ## 2016-01-08 98.55 99.11 96.76 96.96 70798000 ## 2016-01-11 98.97 99.06 97.34 98.53 49739400 ## AAPL.Adjusted ## 2016-01-04 102.61218 ## 2016-01-05 100.04079 ## 2016-01-06 98.08303 ## 2016-01-07 93.94347 ## 2016-01-08 94.44022 ## 2016-01-11 95.96942 Let’s briefly discuss this. getSymbols() created in the global environment an object called AAPL (named automatically after the ticker symbol of the security retrieved) that is of the xts class (which is also a zoo -class object). xts objects (provided in the xts package) are seen as improved versions of the ts object for storing time series data. They allow for time-based indexing and provide custom attributes, along with allowing multiple (presumably related) time series with the same time index to be stored in the same object. (Here is a vignette describing xts objects.) The different series are the columns of the object, with the name of the associated security (here, AAPL) being prefixed to the corresponding series. Yahoo! Finance provides six series with each security. Open is the price of the stock at the beginning of the trading day (it need not be the closing price of the previous trading day), high is the highest price of the stock on that trading day, low the lowest price of the stock on that trading day, and close the price of the stock at closing time. Volume indicates how many stocks were traded. Adjusted close (abreviated as “adjusted” by getSymbols() ) is the closing price of the stock that adjusts the price of the stock for corporate actions. While stock prices are considered to be set mostly by traders, stock splits (when the company makes each extant stock worth two and halves the price) and dividends (payout of company profits per share) also affect the price of a stock and should be accounted for. VISUALIZING STOCK DATA Now that we have stock data we would like to visualize it. I first use base R plotting to visualize the series. plot(AAPL[, ""AAPL.Close""], main = ""AAPL"") A linechart is fine, but there are at least four variables involved for each date (open, high, low, and close), and we would like to have some visual way to see all four variables that does not require plotting four separate lines. Financial data is often plotted with a Japanese candlestick plot , so named because it was first created by 18th century Japanese rice traders. Use the function candleChart() from quantmod to create such a chart. candleChart(AAPL, up.col = ""black"", dn.col = ""red"", theme = ""white"") With a candlestick chart, a black candlestick indicates a day where the closing price was higher than the open (a gain), while a red candlestick indicates a day where the open was higher than the close (a loss). The wicks indicate the high and the low, and the body the open and close (hue is used to determine which end of the body is the open and which the close). Candlestick charts are popular in finance and some strategies in technical analysis use them to make trading decisions, depending on the shape, color, and position of the candles. I will not cover such strategies today. (Notice that the volume is tracked as a bar chart on the lower pane as well, with the same colors as the corresponding candlesticks. Some traders like to see how many shares are being traded; this can be important in trading.) We may wish to plot multiple financial instruments together; we may want to compare stocks, compare them to the market, or look at other securities such as exchange-traded funds (ETFs) . Later, we will also want to see how to plot a financial instrument against some indicator, like a moving average. For this you would rather use a line chart than a candlestick chart. (How would you plot multiple candlestick charts on top of one another without cluttering the chart?) Below, I get stock data for some other tech companies and plot their adjusted close together. # Let's get data for Microsoft (MSFT) and Google (GOOG) (actually, Google is # held by a holding company called Alphabet, Inc., which is the company # traded on the exchange and uses the ticker symbol GOOG). getSymbols(c(""MSFT"", ""GOOG""), src = ""yahoo"", from = start, to = end) ## [1] ""MSFT"" ""GOOG"" # Create an xts object (xts is loaded with quantmod) that contains closing # prices for AAPL, MSFT, and GOOG stocks <- as.xts(data.frame(AAPL = AAPL[, ""AAPL.Close""], MSFT = MSFT[, ""MSFT.Close""], GOOG = GOOG[, ""GOOG.Close""])) head(stocks) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 105.35 54.80 741.84 ## 2016-01-05 102.71 55.05 742.58 ## 2016-01-06 100.70 54.05 743.62 ## 2016-01-07 96.45 52.17 726.39 ## 2016-01-08 96.96 52.33 714.47 ## 2016-01-11 98.53 52.30 716.03 # Create a plot showing all series as lines; must use as.zoo to use the zoo # method for plot, which allows for multiple series to be plotted on same # plot plot(as.zoo(stocks), screens = 1, lty = 1:3, xlab = ""Date"", ylab = ""Price"") legend(""right"", c(""AAPL"", ""MSFT"", ""GOOG""), lty = 1:3, cex = 0.5) What’s wrong with this chart? While absolute price is important (pricey stocks are difficult to purchase, which affects not only their volatility but your ability to trade that stock), when trading, we are more concerned about the relative change of an asset rather than its absolute price. Google’s stocks are much more expensive than Apple’s or Microsoft’s, and this difference makes Apple’s and Microsoft’s stocks appear much less volatile than they truly are (that is, their price appears to not deviate much). One solution would be to use two different scales when plotting the data; one scale will be used by Apple and Microsoft stocks, and the other by Google. plot(as.zoo(stocks[, c(""AAPL.Close"", ""MSFT.Close"")]), screens = 1, lty = 1:2, xlab = ""Date"", ylab = ""Price"") par(new = TRUE) plot(as.zoo(stocks[, ""GOOG.Close""]), screens = 1, lty = 3, xaxt = ""n"", yaxt = ""n"", xlab = """", ylab = """") axis(4) mtext(""Price"", side = 4, line = 3) legend(""topleft"", c(""AAPL (left)"", ""MSFT (left)"", ""GOOG""), lty = 1:3, cex = 0.5) Not only is this solution difficult to implement well, it is seen as a bad visualization method; it can lead to confusion and misinterpretation, and cannot be read easily. A “better” solution, though, would be to plot the information we actually want: the stock’s returns. This involves transforming the data into something more useful for our purposes. There are multiple transformations we could apply. One transformation would be to consider the stock’s return since the beginning of the period of interest. In other words, we plot: This will require transforming the data in the stocks object, which I do next. # Get me my beloved pipe operator! if (!require(""magrittr"")) { install.packages(""magrittr"") library(magrittr) } ## Loading required package: magrittr stock_return % t %>% as.xts head(stock_return) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 1.0000000 1.0000000 1.0000000 ## 2016-01-05 0.9749407 1.0045620 1.0009975 ## 2016-01-06 0.9558614 0.9863139 1.0023994 ## 2016-01-07 0.9155197 0.9520073 0.9791734 ## 2016-01-08 0.9203607 0.9549271 0.9631052 ## 2016-01-11 0.9352634 0.9543796 0.9652081 plot(as.zoo(stock_return), screens = 1, lty = 1:3, xlab = ""Date"", ylab = ""Return"") legend(""topleft"", c(""AAPL"", ""MSFT"", ""GOOG""), lty = 1:3, cex = 0.5) This is a much more useful plot. We can now see how profitable each stock was since the beginning of the period. Furthermore, we see that these stocks are highly correlated; they generally move in the same direction, a fact that was difficult to see in the other charts. Alternatively, we could plot the change of each stock per day. One way to do so would be to plot the percentage increase of a stock when comparing day to day , with the formula: But change could be thought of differently as: These formulas are not the same and can lead to differing conclusions, but there is another way to model the growth of a stock: with log differences. (Here, is the natural log, and our definition does not depend as strongly on whether we use or .) The advantage of using log differences is that this difference can be interpreted as the percentage change in a stock but does not depend on the denominator of a fraction. We can obtain and plot the log differences of the data in stocks as follows: stock_change % log %>% diff head(stock_change) ## AAPL.Close MSFT.Close GOOG.Close ## 2016-01-04 NA NA NA ## 2016-01-05 -0.025378648 0.0045516693 0.000997009 ## 2016-01-06 -0.019763704 -0.0183323194 0.001399513 ## 2016-01-07 -0.043121062 -0.0354019469 -0.023443064 ## 2016-01-08 0.005273804 0.0030622799 -0.016546113 ## 2016-01-11 0.016062548 -0.0005735067 0.002181138 plot(as.zoo(stock_change), screens = 1, lty = 1:3, xlab = ""Date"", ylab = ""Log Difference"") legend(""topleft"", c(""AAPL"", ""MSFT"", ""GOOG""), lty = 1:3, cex = 0.5) Which transformation do you prefer? Looking at returns since the beginning of the period make the overall trend of the securities in question much more apparent. Changes between days, though, are what more advanced methods actually consider when modelling the behavior of a stock. so they should not be ignored. MOVING AVERAGES Charts are very useful. In fact, some traders base their strategies almost entirely off charts (these are the “technicians”, since trading strategies based off finding patterns in charts is a part of the trading doctrine known as technical analysis ). Let’s now consider how we can find trends in stocks. A -day moving average is, for a series and a point in time , the average of the past days: that is, if denotes a moving average process, then: Moving averages smooth a series and helps identify trends. The larger is, the less responsive a moving average process is to short-term fluctuations in the series . The idea is that moving average processes help identify trends from “noise”. Fast moving averages have smaller and more closely follow the stock, while slow moving averages have larger , resulting in them responding less to the fluctuations of the stock and being more stable. quantmod allows for easily adding moving averages to charts, via the addSMA() function. candleChart(AAPL, up.col = ""black"", dn.col = ""red"", theme = ""white"") addSMA(n = 20) Notice how late the rolling average begins. It cannot be computed until 20 days have passed. This limitation becomes more severe for longer moving averages. Because I would like to be able to compute 200-day moving averages, I’m going to extend out how much AAPL data we have. That said, we will still largely focus on 2016. start = as.Date(""2010-01-01"") getSymbols(c(""AAPL"", ""MSFT"", ""GOOG""), src = ""yahoo"", from = start, to = end) ## [1] ""AAPL"" ""MSFT"" ""GOOG"" # The subset argument allows specifying the date range to view in the chart. # This uses xts style subsetting. Here, I'm using the idiom # 'YYYY-MM-DD/YYYY-MM-DD', where the date on the left-hand side of the / is # the start date, and the date on the right-hand side is the end date. If # either is left blank, either the earliest date or latest date in the # series is used (as appropriate). This method can be used for any xts # object, say, AAPL candleChart(AAPL, up.col = ""black"", dn.col = ""red"", theme = ""white"", subset = ""2016-01-04/"") addSMA(n = 20) You will notice that a moving average is much smoother than the actual stock data. Additionally, it’ a stock needs to be above or below the moving average line in order for the line to change direction. Thus, crossing a moving average signals a possible change in trend, and should draw attention. Traders are usually interested in multiple moving averages, such as the 20-day, 50-day, and 200-day moving averages. It’s easy to examine multiple moving averages at once. candleChart(AAPL, up.col = ""black"", dn.col = ""red"", theme = ""white"", subset = ""2016-01-04/"") addSMA(n = c(20, 50, 200)) The 20-day moving average is the most sensitive to local changes, and the 200-day moving average the least. Here, the 200-day moving average indicates an overall bearish trend: the stock is trending downward over time. The 20-day moving average is at times bearish and at other times bullish , where a positive swing is expected. You can also see that the crossing of moving average lines indicate changes in trend. These crossings are what we can use as trading signals , or indications that a financial security is changing direction and a profitable trade might be made. Visit next week to read about how to design and test a trading strategy using moving averages. # Package/system information sessionInfo() ## R version 3.3.3 (2017-03-06) ## Platform: i686-pc-linux-gnu (32-bit) ## Running under: Ubuntu 15.10 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] methods stats graphics grDevices utils datasets base ## ## other attached packages: ## [1] magrittr_1.5 quantmod_0.4-7 TTR_0.23-1 xts_0.9-7 ## [5] zoo_1.7-14 RWordPress_0.2-3 optparse_1.3.2 knitr_1.15.1 ## ## loaded via a namespace (and not attached): ## [1] lattice_0.20-34 XML_3.98-1.5 bitops_1.0-6 grid_3.3.3 ## [5] formatR_1.4 evaluate_0.10 highr_0.6 stringi_1.1.3 ## [9] getopt_1.20.0 tools_3.3.3 stringr_1.2.0 RCurl_1.95-4.8 ## [13] XMLRPC_0.3-0 AdvertisementsSHARE THIS: * Twitter * Facebook * Email * Reddit * More * * Print * LinkedIn * * Google * Tumblr * * Pinterest * Pocket * * Telegram * WhatsApp * * Skype * LIKE THIS: Like Loading... apple bear market bull market candlestick chart etf finance financial crisis financial sector flash crash google google finance hft math 3900 microsoft moving average quantmod reagan stock market stocks visualization xts yahoo financePOST NAVIGATION ← Data or Die2 THOUGHTS ON “ AN INTRODUCTION TO STOCK MARKET DATA ANALYSIS WITH R (PART 1) ” 1. A technique related to adding hybrid features to achieve a better signal is described here: http://54.174.116.134/recommend/datasets/ Like Like biomimic , March 27, 2017 at 5:21 pm Reply 2. 3. Pingback: An Introduction to Stock Market Data Analysis with R (Part 1) | A bunch of data 4. LEAVE A REPLY CANCEL REPLY Enter your comment here...Fill in your details below or click an icon to log in: * * * * * Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change ) You are commenting using your Twitter account. ( Log Out / Change ) You are commenting using your Facebook account. ( Log Out / Change ) You are commenting using your Google+ account. ( Log Out / Change ) CancelConnecting to %s Notify me of new comments via email. Create a free website or blog at WordPress.com. Post to Cancel Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","This post is the first in a two-part series on stock data analysis using R. In these posts, basics such as obtaining the data from Yahoo! Finance using pandas, visualizing stock data, moving averages, developing a moving-average crossover strategy, backtesting, and benchmarking will be covered.",An Introduction to Stock Market Data Analysis with R (Part 1),Live,162 445,"COMPOSE NOTES: PORTAL POWERUPS AND DELETED DEPLOYMENTS Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 1, 2016There's now the option to power up more Compose Portals and we've made a recently introduced feature, Deleted Deployments, easier to work with - in this Compose Notes , we'll tell you all about them: DELETED DEPLOYMENTS When we introduced the ability to see and recover your deleted deployments from backup, we added all your deleted deployments in a list underneath your existing deployments. We didn't realise how much of a distraction that could be so we've made a small change and hidden them by default. Now, if you look in your Compose console, at the bottom of your deployments list you'll find something like this: This line tells you simply how many backups of previously deleted deployments are available to be restored. Clicking on the Show button will open up the list so it will appear like this: Now you can select any deleted deployment backup and recover it. It makes bringing your deleted databases back from the dead easier and less distracting in day to day use. PORTAL POWERUPS We're working on something special when it comes to Compose access portals, and as part of making sure the foundations are solidly all in place, there's a little-big change happening in the Compose console - the ability to add more portals. A quick refresher, for those who don't know - each Compose database deployment runs on its own private virtual encrypted network. The only way traffic gets in or out is through one of our access portals and there's generic portals for TCP connections and SSH tunnels and more specialized variants that know a bit more about the database they are working with, like the Mongo Router. There's a number of reasons for doing this, which we talk about in the recent article, Do you know why Compose proxies database connections . The thing is, up until now, we've pretty much set in stone how many portals you can have, apart from the SSH tunnel. Well, that's what we're changing; you can now add as many portals as you think you may need. We're still enforcing minimum numbers of portals so you won't go below the number you need for high availability failover, but if you want an extra TCP portal or Mongo Router for your deployment then you can have one. Or two. Just so you know, each extra portal is $4.50 a month. To get your extra portals, just visit the Security tab for your database in the Compose console. Under each class of portal you'll see an Add button for that class - Add TCP Portal , Add SSH Portal , Add Mongo Router and so on. Click on the button and you get an extra portal. These extra portals will be identified in the Overview of the deployment's connection strings too. They are also identified in the portal list. Each portal also now displays its short name, external DNS name which you can use in connection strings you create and the internal IP address for the portal on the private network so you can identify connections from that portal in the logs. If a portal isn't needed any more, then click the Remove button next to it in the list - only portals which aren't part of the required quorum of portals get the Remove button. As we said, this is the foundation for some exciting new features. Currently, you can only have three of any portal class on a deployment, but that's one of the things we are working on. We'll let you know more about that, and the other things, when we are ready to unveil them, but rest assured they'll give you more control of your databases while letting you get on with developing your apps and running your business. --------------------------------------------------------------------------------","There's now the option to power up more Compose Portals and we've made a recently introduced feature, Deleted Deployments, easier to work with - in this Compose Notes, we'll tell you all about them.",Portal Powerups and Deleted Deployments,Live,163 446,"☰ * Login * Sign Up * Learning Paths * Courses * Our Courses * Partner Courses * Badges * Our Badges * BDU Badge Program * BLOG Welcome to the BDUBlog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * Learn TensorFlow and Deep Learning Together and Now! * This Week in Data Science (March 14, 2017) * This Week in Data Science (March 7, 2017) * This Week in Data Science (February 28, 2017) * This Week in Data Science (February 21, 2017) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsBLOGROLL * RBloggers LEARN TENSORFLOW AND DEEP LEARNING TOGETHER AND NOW! Posted on March 20, 2017 by Saeed Aghabozorgi I get a lot of questions about how to learn TensorFlow and Deep Learning. I’ll often hear, “How do I start learning TensorFlow?” or “How do I start learning Deep Learning?”. My answer is, “Learn Deep Learning and TensorFlow at the same time!”. See, it’s not easy to learn one without the other. Of course, you can use other libraries like Keras or Theano, but TensorFlow is a clear favorite when it comes to libraries for deep learning. And now is the best time to start. If you haven’t noticed, there’s a huge wave of new startups or big companies adopting deep learning. Deep Learning is the hottest skill to have right now. So let’s start from the basics. What actually is “Deep Learning” and why is it so hot in data science right now? What’s the difference between Deep Learning and traditional machine learning? Why TensorFlow? And where can you start learning? WHAT IS DEEP LEARNING? Inspired by the brain, deep learning is a type of machine learning that uses neural networks to model high-level abstractions in data. The major difference between Deep Learning and Neural Networks is that Deep Learning has multiple hidden layers, which allows deep learning models (or deep neural networks) to extract complex patterns from data. HOW IS DEEP LEARNING DIFFERENT FROM TRADITIONAL MACHINE LEARNING ALGORITHMS, SUCH AS NEURAL NETWORKS? Under the umbrella of Artificial Intelligence (AI), machine learning is a sub-field of algorithms that can learn on their own , including Decision Trees, Linear Regression, K-means clustering, Neural Networks, and so on. Deep Neural Networks, in particular, are super-powered Neural Networks that contain several hidden layers. With the right configuration/hyper-parameters, deep learning can achieve impressively accurate results compared to shallow Neural Networks with the same computational power. WHY IS DEEP LEARNING SUCH A HOT TOPIC IN THE DATA SCIENCE COMMUNITY? Simply put, across many domains, deep learning can attain much faster and more accurate results than ever before , such as image classification, object recognition, sequence modeling, speech recognition, as so on. It all started recently, too; around 2015. There were three key catalysts that came together resulting in the popularity of deep learning: 1. Big Data : the presence of extremely large and complex datasets; 2. GPUs : the low cost and wide availability of GPUs made the parallel processing faster and cheaper than ever; 3. Advances in deep learning algorithms , especially for complex pattern recognition. These three factors resulted in the deep learning boom that we see today. Self-driving cars and drones, chat bots, translations, AI playing games. You can now see a tremendous surge in the demand for data scientists and cognitive developers. Big companies are recognizing this evolution in data-driven insights, which is why you now see IBM, Google, Apple, Tesla, and Microsoft investing a lot of money in deep learning. WHAT ARE THE APPLICATIONS OF DEEP LEARNING? Historically, the goal of machine learning was to move humanity towards the singularity of “General Artificial Intelligence ”. But not surprisingly, this goal has been tremendously difficult to attain. So instead of trying to develop generalized AI, scientists started to develop a series of models and algorithms that excelled in specific tasks. So, to realize the main applications of Deep Learning, it is better to briefly take a look at each of the different types of Deep Neural Networks, their main applications, and how they work. WHAT ARE THE DIFFERENT TYPES OF DEEP NEURAL NETWORKS? CONVOLUTIONAL NEURAL NETWORKS (CNNS) Assume that you have a dataset of images of cats and dogs, and you want to build the model that can recognize and differentiate them. Traditionally, your first step would be “feature selection”. That is, to choose the best features from your images, and then use those features in a classification algorithm (e.g., Logistic Regression or Decision Tree), resulting in a model that could predict “cat” or “dog” given an image. These chosen features could simply be the color, object edges, pixel location, or countless other features that could be extracted from the images. Of course, the better and effective the feature sets you found, the more accurate and efficient image classification you could obtain. In fact, in the last two decades, there has been a lot of scientific research in image processing just about how one can find the best feature sets from images for the purposes of classification. However, as you can imagine, the process of selecting and using the best features is a tremendously time-consuming task and is often ineffective. Further, extending the features to other types of images becomes an even greater problem – the features you used to discriminate cats and dogs cannot be generalized, for example, for recognizing hand-written digits. Therefore, the importance of feature selection can’t be overstated. Enter convolutional neural networks (CNNs). Suddenly, without having to find or select features, CNNs finds the best features for you automatically and effectively. So instead of you choosing what image features to classify dogs vs. cats, CNNs can automatically find those features and classify the images for you. Convolutional Neural Network (Wikipedia) WHAT ARE THE CNN APPLICATIONS? CNNs have gained a lot of attention in the machine learning community over the last few years. This is due to the wide range of applications where CNNs excel, especially machine vision projects: image recognition/classifications , object detection/recognition in images , digit recognition , coloring black and white images , translation of text on the images , and creating art images , Lets look closer to a simple problem to see how CNNs work. Consider the digit recognition problem. We would like to classify images of handwritten numbers, where the target will be the digit (0,1,2,3,4,5,6,7,8,9) and the observations are the intensity and relative position of pixels. After some training, it’s possible to generate a “function” that map inputs (the digit image) to desired outputs (the type of digit). The only problem is how well this map operation occurs. While trying to generate this “function”, the training process continues until the model achieves a desired level of accuracy on the training data. You can learn more about this problem and the solution for it through our convolution network with hands-on notebooks . HOW DOES IT WORK? Convolutional neural networks (CNNs) is a type of feed-forward neural network , consist of multiple layers of neurons that have learnable weights and biases. Each neuron in a layer that receives some input, process it, and optionally follows it with a non-linearity. The network has multiple layers such as convolution, max pool, drop out and fully connected layers. In each layer, small neurons process portions of the input image. The outputs of these collections are then tiled so that their input regions overlap, to obtain a higher-resolution representation of the original image; and it is repeated for every such layer. The important point here is: CNNs are able to break the complex patterns down into a series of simpler patterns, through multiple layers. RECURRENT NEURAL NETWORK (RNN) Recurrent Neural Network tries to solve the problem of modeling the temporal data. You feed the network with the sequential data, it maintains the context of data and learns the patterns in the temporal data. WHAT ARE THE APPLICATIONS OF RNN? Yes, you can use it to model time-series data such as weather data, stocks, or sequential data such as genes. But you can also do other projects, for example, for text processing tasks like sentiment analysis and parsing. More generally, for any language model that operates at word or character level. Here are some interesting projects done by RNNs: speech recognition , adding sounds to silent movies , Translation of Text , chat bot , hand writing generation , language modeling (automatic text generation) , and Image Captioning . HOW DOES IT WORK? The Recurrent Neural Network is a specialized type of Neural Network that solves the issue of maintaining context for sequential data . RNNs are models with a simple structure and a feedback mechanism built-in. The output of a layer is added to the next input and fed back to the same layer. At each iterative step, the processing unit takes in an input and the current state of the network and produces an output and a new state that is re-fed into the network . However, this model has some problems . It’s very computationally expensive to maintain the state for large amounts of units, even more so over a long amount of time. Additionally, Recurrent Networks are very sensitive to changes in their parameters. To solve these problems, a way to keep information over long periods of time and additionally solve the oversensitivity to parameter changes, i.e., make backpropagating through the Recurrent Networks more viable was found. What is it? Long-Short Term Memory (LSTM). LSTM is an abstraction of how computer memory works: you have a linear unit, which is the information cell itself, surrounded by three logistic gates responsible for maintaining the data. One gate is for inputting data into the information cell, one is for outputting data from the input cell, and the last one is to keep or forget data depending on the needs of the network. If you want to practice the basic of RNN/LSTM with TensorFlow or language modeling, you can practice it here . RESTRICTED BOLTZMANN MACHINE (RBM) RBMs are used to find the patterns in data in an unsupervised fashion. They are shallow neural nets that learn to reconstruct data by themselves. They are very important models, because they can automatically extract meaningful features from a given input, without the need to label them. RBMs might not be outstanding if you look at them as independent networks, but they are significant as building blocks of other networks, such as Deep Believe Networks. WHAT ARE THE APPLICATIONS OF RBM? RBM is useful for unsupervised tasks such as feature extraction/learning, dimensionality reduction, pattern recognition, recommender systems ( Collaborative Filtering ), classification, regression, and topic modeling. To understand the theory of RBM and application of RBM in Recommender Systems you can run these notebooks . HOW DOES IT WORK? It only possesses two layers: a visible input layer and a hidden layer where the features are learned. Simply put, RBM takes the inputs and translates them into a set of numbers that represents them. Then, these numbers can be translated back to reconstruct the inputs. Through several forward and backward passes, the RBM will be trained. Now we have a trained RBM model that can reveal two things: first, what is the interrelationship among the input features; second, which features are the most important ones when detecting patterns. DEEP BELIEF NETWORKS (DBN) Deep Belief Network is an advanced Multi-Layer Perceptron (MLP). It was invented to solve an old problem in traditional artificial neural networks. Which problem? The backpropagation in traditional Neural Networks can often lead to “local minima” or “vanishing gradients”. This is when your “error surface” contains multiple grooves and you fall into a groove that is not the lowest possible groove as you perform gradient descent. WHAT ARE THE APPLICATIONS OF DBN? DBN is generally used for classification (same as traditional MLPs). One the most important applications of DBN is image recognition. The important part here is that DBN is a very accurate discriminative classifier and we don’t need a big set of labeled data to train DBN; a small set works fine because feature extraction is unsupervised by a stack of RBMs. HOW DOES IT WORK? DBN is similar to MLP in term of architecture, but different in training approach. DBNs can be divided into two major parts. The first one is stacks of RBMs to pre-train our network. The second one is a feed-forward backpropagation network, that will further refine the results from the RBM stack. In the training process, each RBM learns the entire input. Then, the stacked RBMs, can detect inherent patterns in inputs.DBN solves the “vanishing problem” by using this extra step, so-called DBN solves the “vanishing problem” by using this extra step, so-called pre-training . Pre-training is done before backpropagation and can lead to an error rate not far from optimal. This puts us in the “neighborhood” of the final solution. Then we use backpropagation to slowly reduce the error rate from there. AUTOENCODER An autoencoder is an artificial neural network employed to recreate a given input. It takes a set of unlabeled inputs, encodes them and then tries to extract the most valuable information from them. They are used for feature extraction, learning generative models of data, dimensionality reduction and can be used for compression. They are very similar to RBMs but can have more than 2 layers. WHAT ARE THE APPLICATIONS OF AUTOENCODERS? Autoencoders are employed in some of the largest deep learning applications, especially for unsupervised tasks. For example, for Feature Extraction , Pattern recognition, and Dimensionality Reduction . In another example, say that you want to extract what feeling the person in a photography is feeling , Nikhil Buduma explains the utility of this type of Neural Network with excellence. HOW DOES IT WORK? RBM is an example of Autoencoders, but with fewer layers. An autoencoder can be divided into two parts: the encoder and the decoder . Let’s say that we want to classify some facial images and each image is very high dimensionally (e.g 50×40). The encoder needs to compress the representation of the input. In this case we are going to compress the face of our person, that consists of 2000 dimensional data to only 30 dimensions, taking some steps between this compression. The decoder is a reflection of the encoder network. It works to recreate the input, as closely as possible. It has an important role during training, to force the autoencoder to select the most important features in the compressed representation. After training, you can use 30 dimensions to apply your algorithms. WHY TENSORFLOW? HOW DOES IT WORK? TensorFlow is also just a library but an excellent one. I believe that TensorFlow’s capability to execute the code on different devices, such as CPUs and GPUs, is its superpower. This is a consequence of its specific structure. TensorFlow defines computations as graphs and these are made with operations (also know as “ops”). So, when we work with TensorFlow, it is the same as defining a series of operations in a Graph. To execute these operations as computations, we must launch the Graph into a Session. The session translates and passes the operations represented in the graphs to the device you want to execute them on, be it a GPU or CPU. For example, the image below represents a graph in TensorFlow. W , x, and b are tensors over the edges of this graph. MatMul is an operation over the tensors W and x , after that Add is called and add the result of the previous operator with b . The resultant tensors of each operation cross the next one until the end, where it’s possible to get the wanted result. TensorFlow is really an extremely versatile library that was originally created for tasks that require heavy numerical computations. For this reason, TensorFlow is a great library for the problem of machine learning and deep neural networks. WHERE SHOULD I START LEARNING? Again, as I mentioned first, it does not matter where to start, but I strongly suggest that you learn TensorFlow and Deep Learning together. Deep Learning with TensorFlow is a course that we created to put them together. Check it out and please let us know what you think of it. Good luck on your journey into one of the most exciting technologies to surface in our field over the past few years. SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * RELATED Tags: data science , Deep Learning , Deep Neural Networks , TensorFlow -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Events * Ambassador Program * Resources * FAQ * Legal Follow us * * * * * * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","In this article, we discuss various Deep Learning approaches and recommend you a way to learn TensorFlow and Deep Learning at the same time.",Learn TensorFlow and Deep Learning Together and Now!,Live,164 447,"Compose The Compose logo Articles Sign in Free 30-day trialCLASSCRAFT - MAKING THE MOST OF COMPOSE Published Apr 24, 2017 case study mongodb elasticsearch Classcraft - Making the Most of ComposeClasscraft gamifies the whole classroom experience, making education a fun adventure for both students and teachers. We chatted with Shawn Young, ex-teacher, programmer, and founder of Classcraft Studios about their teaching platform built on Meteor.js and their use of Compose for MongoDB and Elasticsearch. “It’s boring.” That’s the usual answer most parents get when they ask their children about school. Shawn Young, a former 11th grade teacher has seen this firsthand in his career. According to Shawn, “There’s a crisis in education right now. We’ve started solving a lot of the logistical problems with technology like parent communication, homework, and distributing resources. But now we’re realizing that, as a market, all these great tools aren’t actually enough to get students excited about coming to school.” Worse, a recent Gallup study found out that this disengagement increases as students progress causing dropouts. At home, the students have a richer digital interactive experience through the internet, social media and games. At school this experience is missing. Shawn wanted to solve this engagement problem. In 2013, he created a basic online role-playing game that would make the classroom participation more engaging for his students. A former student posted the game on Reddit. Within a week, it went to the front page of Reddit Gaming. And suddenly Shawn started to get inquiries from thousands of teachers about the game. Realizing he had got something here, Shawn teamed up with his father Lauren, a 35-year business veteran, and brother Devin, a creative director in New York, to start Classcraft Studios. The first beta of Classcraft was launched in January 2014 and then an open version of the product was made available in August of the same year. The original app was built on PHP. But soon they moved to Node.js and Meteor.js because of the scaling and speed they needed for a real-time game. The entire monolithic single server app was hosted on Amazon Web Services (AWS) and deployed using Capistrano, a Ruby tool, for compiling and deploying a Meteor instance. It all worked fine for a year, but as Classcraft exploded in popularity they started to see some hiccups with the architecture. They had only one server, but needed the ability to run multiple instances of the app. Node wasn’t designed for it. They used Passenger (also a Ruby tool) to overcome this Node limitation. NGINX was used to direct people to the right process. It temporarily made things better, but then they started running into memory leak issues that would impact everyone on the server. 'Hot patches' were deployed to fix things in the app, but these forced restarts, which required all users to connect to the database at the same time. As a result, the database started to crash. The obvious solution was to scale vertically by adding memory to the servers. So, they switched to the top tier of the AWS services. But soon it became clear that they also needed horizontal scaling – a challenge because documentation on how to do this with Meteor was sparse at the time. Fortunately, Meteor came out with Galaxy, a Docker-based solution for hosting Meteor apps that would enable horizontal scaling through containers. Upgrading from MongoDB 2.6 to 3.2 also helped mitigate some of the performance issues. But all these changes came at a cost. During peak times, Shawn found himself spending 20 hours a week on sysadmin tasks for just to keep the app running. During one of their upgrades, Shawn said, ”I stayed up all night, and I hadn’t completed the migration – I thought I would do it fast the next day, but basically it was becoming a huge time sink. My senior developer and I were running on fumes. The end of that first night I started looking into other solutions.” That’s when Classcraft decided to move to Compose. It coincided well, because it was right after Compose had implemented the WiredTiger storage engine as an option. Classcraft could use it to migrate their entire platform very easily. As the product evolved, the team developed features requiring advanced search capabilities. While MongoDB is great at many things, the complex types of location-based and fuzzy searches they needed weren't ones that MongoDB supports well. Thus, Classcraft turned to Compose for Elasticsearch. “Part of what’s cool is that basically you don’t have to provision an entire Elasticsearch setup yourself. I can just press a button [on the Compose console] and then I know, for our use, it’s probably the best way that it should be set up.” Another feature they liked about Compose was the ability to assign user permissions and roles. “It’s actually pretty cool to be able to give developers selective access. For awhile we would restrict access to the database on the old stack, because you could just go in there and write queries and erase all the users, right? You don’t want to let anybody do that.” With Compose, Classcraft was able to select the right tools for the app they were building without all the administrative overhead. “Looking back, I am very happy that we moved to Compose,” Shawn said. “Basically, Compose took the hassle of database management off of our hands so we could focus on what’s most important to us - our product. And I didn’t have to do anything. It’s pretty great!” So how is Classcraft doing these days? “Fantastic!” according to Shawn. They just passed 2.1 million users. The app is available in 75 countries in 10 languages. People are flooding social media with great testimonials and feedback. Classcraft's success is getting noticed in academia too who are publishing papers on their achievements. And then there are schools where kids are dressed up in armor and doing giant dance battles against one another - as part of the game exercise! Occasionally they would get a testimonial from a teacher that says, 'This has completely changed my classroom. It’s the best thing that’s happened to me in my entire career.' “It’s all very humbling”, said Shawn. “Thanks to Compose for having our back as we set to make an impact in the educational sector.” To learn more about Classcraft Studios and their platform, visit: https://www.classcraft.com/ . -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. Image by: Classcraft Arick Disilva works in Product Marketing at Compose. Love this article? Head over to Arick Disilva ’s author page and keep reading.RELATED ARTICLES Jun 16, 2015MYSTRO MODERNIZES MASSAGE THERAPY We're always excited to hear from our customers on how they're using Compose. Sheree Evans, a co-founder at Mystro, emailed o… Jon Silvers Mar 15, 2017USE ALL THE DATABASES – PART 2 Loren Sands-Ramshaw, author of GraphQL: The New REST shows how to combine data from multiple data sources using GraphQL in p… Guest Author Mar 2, 2017USE ALL THE DATABASES - PART 1 Loren Sands-Ramshaw, author of GraphQL: The New REST, shows how to combine data from multiple sources using GraphQL in this W… Guest Author Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Classcraft gamifies the whole classroom experience, making education a fun adventure for both students and teachers. We chatted with Shawn Young, ex-teacher, programmer, and founder of Classcraft Studios about their teaching platform built on Meteor.js and their use of Compose for MongoDB and Elasticsearch.",Making the Most of Compose – Customer: Classcraft,Live,165 448,"Compose The Compose logo Articles Sign in Free 30-day trialPUSH NOTIFICATIONS WITH MONGODB Published Jul 18, 2017 mongodb push notifications firebase Push Notifications With MongoDBPush notifications are a staple of mobile and Internet of Things applications, and in this Write Stuff contribution Don Omondi, Founder and CTO of Campus Discounts , demonstrates how to leverage Compose MongoDB to send more effective push notifications. Today’s technology has seen a sharp rise in connected devices, popularly known as the Internet of Things (IoT). Applications now live in watches, shoes and, perhaps rather oddly, in salt shakers too! The IoT surge has also posed a few challenges for developers, one of them being how to send notifications to the plethora of connected devices. The main problem arises from the fact that different devices have different ways of subscribing to, receiving, and unsubscribing from notifications. We’ll see how to tackle this problem using MongoDB but first a little background information. WHAT ARE PUSH NOTIFICATIONS? A push notification is a message that is ""pushed"" from a backend server or application to a user interface such as mobile applications and desktop applications. A lot of developers make use of a notification service to send push notifications. A notification service provides a means to push notifications to many devices at once and may include other features such as delivery reports and analytics. PUSH NOTIFICATIONS WITH FIREBASE CLOUD MESSAGING So with the preliminaries out of the way, let’s see how we can integrate Firebase Cloud Messaging (FCM), Google’s free notification service into a MongoDB powered backend. Through FCM we can send notifications to any service worker enabled browser (Chrome, Firefox, and Opera with Edge coming soon) as well as native Android & IOS applications. To push with FCM, all we need to do is create an FCM app which will give us a server key. Thereafter, using either the Web, Android or iOS SDK generate an FCM client token once the user grants permissions (A practical example coming a bit later). If you google around, you might be surprised to find that there are a number of people who’ve had a bit of trouble finding the GCM settings. You’ll have to click the settings icon/cog wheel next to your project name at the top of the Firebase console, then click on Project settings, and finally select the Cloud Messaging tab. Armed with a server key and client token pair, sending a push notification is performed by a simple POST request to the FCM endpoint with an authorization header containing the key and a JSON encoded body of the notification with the client token in the ""to"" field like: https://fcm.googleapis.com/fcm/send Content-Type: application/json Authorization: key=AIzaSyC...akjgSX0e4 { ""notification"": { ""title"": ""Message Title"", ""body"": ""Message body"", ""click_action"" : ""https://dummypage.com"" }, ""to"" : ""eEz-Q2sG8nQ:APA91bHJQRT0JJ..."" } The POST response will respond indicating whether the push notification was sent successfully or failed. { ""multicast_id"": 7986976529786388478, ""success"": 1, ""failure"": 0, ""canonical_ids"": 0, ""results"": [{ ""message_id"": ""0:1496965028924567%e609af1cf9fd7ecd"" }] } That is really all it takes to send push notifications, but for many real life applications, it mustn’t stop there. It’s important to note that users don’t subscribe to push notifications but devices do, so we’ll have to find a way to link a client token to a user. This means saving the data in a store somewhere. We may also be interested in granting a user a subscription management interface as well as logging notifications. Let’s see why MongoDB is a good fit for this data store. SOME REASONS TO USE MONGODB FOR PUSH NOTIFICATIONS Storing Device Metadata: Many times you’d want to store some metadata about the device that has subscribed to push notifications, such as the browser vendor and version, or the toaster serial number or perhaps the salt shaker color. With nearly an infinite number of connectable devices, you’d really want a schemaless database for this. Handling Shared Devices: A lot of people share devices, whether publicly like when using a cyber-café or privately when browsing on a friend's laptop, tablet or phone. They might not unsubscribe from notifications which the notification service provider will continue to happily deliver. We can mitigate this by setting a time to live (TTL) that automatically removes subscriptions that are not renewed within a given time. MongoDB has us covered here, too. One User Many Devices: With the increase in connectable devices, it’s now common for one user to own many devices that use your application. For performance reasons, embedding a list of devices in one document per user would ensure maximum efficiency in many use cases. Logging: You may also be interested in getting an overview of the recently pushed notifications. This can be useful for example to delete subscriptions that repeatedly fail to be delivered. MongoDB’s capped collection would be a perfect fit for this use case. A PRACTICAL EXAMPLE: BLOG Let’s say we have a blog and want to subscribe users to receive push notifications for example on new posts, comments or likes. Our blog will store each subscription in a MongoDB document using a sample schema below. { ""_id"": ObjectId, ""token"": String, ""subscribed_on"": Date, ""user_id"": Integer, ""fingerprint"": String, ""details"": [ ""browser"" : String, ""os"" : String, ""osVersion"" : String, ""device"" : String, ""deviceType"" : String, ""deviceVendor"" : String, ""cpu"" : String ] } We already know we need to store two fields in the FCM, client token and the user_id . We'll also want to know the time a user subscribed to receive a push, which we'll store in the subscribed_on field. Furthermore, to help a user manage their subscriptions, we’ll need to store a device’s information like the operating system and version, browser vendor, and others. This way you can help a user associate a FCM notification endpoint to a device. We created a details array field to store such arbitrary data. 28th June, 2017 via Chrome on Android 5.1 Finally, let’s assume we also want to reduce the number of duplicate subscriptions, which can happen when people share devices or when the notification service generates a new subscription Universally Unique Identifier (UUID) for the same device. Duplicate data is bad because it can lead to different notifications being sent to the same device but for different users. So we’ll need to create a field to store a value that can fairly accurately identify a device, to achieve this, we’ll use a technique called device fingerprinting. A device fingerprint , also sometimes called a machine fingerprint or browser fingerprint is information collected about a remote computing device for the purpose of identification. Fingerprints can be used to fully or partially identify individual users or devices even when cookies are turned off. With the document schema ready, we’ll need to create a MongoDB collection to hold them, let’s create one called ‘push_notifications’ from mongo shell > db.createCollection(‘push_notifications’) From MongoDB 3.2 and beyond we can enforce some level of strict schema by using document validation. You can read more about it as well as find some examples in Document Validation in MongoDB By Example . In our example, we want to ensure that every subscription document has a non-null subscribed_on field with a data type Date . We also need a non-null device fingerprint value of type string and a non-null user_id value of type Int . Let’s enforce it with this validation. > db.createCollection( ""push_notifications"", { validator: { $and: [ { token: { $type: ""string"" } }, { token: { $exists: true } }, { subscribed_on: { $type: ""date"" } }, { subscribed_on: { $exists: true } }, { user_id: { $type: ""int"" } }, { user_id: { $exists: true } }, { fingerprint: { $type: ""string"" } }, { fingerprint: { $exists: true } } ] } } ) For speedy lookups on documents matching certain client tokens, let’s create an index on the token field. We can also declare this index as unique so as to prevent duplicate subscriptions from different users using the same token. db.push_notifications.createIndex( { ""token"": 1 }, { unique: true} ) Since we wish to have subscriptions automatically removed after a certain amount of time, let’s create a TTL index to tell MongoDB to delete documents after some time. db.push_notifications.createIndex( { ""subscribed_on"": 1 }, { expireAfterSeconds: 604800 } ) This will purge documents whose subscribed_on field’s value is greater than or equal to 1 week from the time MongoDB runs its background checks. To keep active subscriptions, update the subscribed_on field periodically, for example, once a day, so that they are always less than 1 week old. To enable us to quickly look up all subscriptions for a specific user, let’s create an index on the user_id field. db.push_notifications.createIndex({ user_id: 1 }) If your app allows non-logged in users to subscribe to push, then you can make this a partial index from the MongoDB shell as follows. db.push_notifications.createIndex({ user_id: 1 } , { partialFilterExpression: { user_id: { $exists: true } } }) Also, don’t forget to remove the user_id requirements from the document validation. For MongoDB versions prior to 3.2, use sparse indexes instead. That’s it, with the database schema all set up, our backend is ready to push. It’s now up to our frontend to send information so our backend can know to whom. For that, we’ll use some JavaScript. The procedure is to first ask the user for permissions, then register the device for push notifications and pass its UUID, fingerprint as well as some details to our backend. SETTING UP THE JS CLIENT For getting the browser fingerprint on the client side, we can use the conveniently named library clientjs . clientjs also allows us to get device specific information such as the OS, OS version, CPU type, Device Type and more which we’ll use to fill our details array. We’ll also need to create a small service worker script called firebase-messagingsw.js which is also responsible for enabling background pushes. // Give the service worker access to Firebase Messaging. // Note that you can only use Firebase Messaging here, other Firebase libraries // are not available in the service worker. importScripts('https://www.gstatic.com/firebasejs/4.1.2/firebase-app.js'); importScripts('https://www.gstatic.com/firebasejs/4.1.2/firebase-messaging.js'); // Initialize the Firebase app in the service worker by passing in the // messagingSenderId. firebase.initializeApp({ 'messagingSenderId': prev.add(cur(""revenue""))) In the command above, we're first referencing our table ""revenue_by_month"" and then selecting the documents from it in order by month - exactly as we showed in the section above to retrieve all the documents in the table. In our case, we could just as easily order by id since our documents are in order that way as well, but the example demonstrates that you could put the documents in any order you choose based on one or more fields. The fold command will process the input according to the order you specify. Next, we're calling fold and passing in a base value of 0. If our fiscal year for some reason included December 2014 and we knew that value, we could pass that in as the base instead. For our purposes here, we want to just get the cumulative revenue for 2015 so we want to start at 0. In the next part of the command we're naming two variables, ""prev"" for the first input to the process and ""cur"" for the current row's input. Finally, we're telling fold what we want it to output. In this case, we're telling it to add the ""prev"" value to the ""cur"" value using the ""revenue"" field. What happens is this: as the first step, our base value of 0 comes in as ""prev"". That gets added to the January revenue, the ""cur"" value of 89750. We now have 89750. It becomes the new ""prev"" and we add to it the February revenue value of 100327. We now have 190007. It becomes the new ""prev""... and so on. From this, we'll get the sum of the 12 months of revenue values: 1469221 So, why not just use the sum aggregation? Yes, of course we could, but if we were doing something other than add where the order of the processing of the values was important to the outcome or where being able to pass in a base value that was not part of the dataset was required, then you can see how fold sets itself apart from reduce and concatMap . For our example, the beauty of what we want to achieve - where those benefits become apparent - actually comes with the emit function of fold . We'll look at that next. EMIT emit outputs an array where each element represents one step in the process. Let's look at that function: r.table(""revenue_by_month"").orderBy(""month"") .fold(0, (prev, cur) => prev.add(cur(""revenue"")), {emit: (prev, cur, ytd) => [ytd]}) Now we've called the emit function and we've specified three variables (it requires three so even if you don't want all three, you still need to specify them). For us, that's the ""prev"" and ""cur"" we used for processing the sequence and now we've added in a variable for the outcome of each of those variables being added together called ""ytd"". For our emit we're not doing anything much fancy. We're simply outputting the ""ytd"" array, which will show us the value for each step of the process. In the API reference, you'll see the examples use the branch function which applies an ""if... then... else..."" logic to the output, but for our example, our outcome does not require any advanced logic to be applied. Here's what emit returns to us for the ""ytd"" array: element | value ----------------- 1 | 89750 2 | 190077 3 | 286786 4 | 399621 5 | 525407 6 | 649701 7 | 776435 8 | 902376 9 | 1048599 10 | 1207724 11 | 1330831 12 | 1469221 As you can see, we get the year-to-date total for each month. So, at the end of April, the year-to-date revenue was 399621. Ah, yes... that's what fold can do for us in this example! NEXT STEPS With the ""ytd"" array, then, as an output of fold , we could choose to perform any other functions on it by applying a then and a do . The array is now open for additional processing. That's outside the scope of this article, however, but we hope we've made the fold command a little less esoteric for you in this article. If you deploy RethinkDB this month (July 2016), you can get a limited edition t-shirt ! Image by: Michael Gaida Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Earlier this year, RethinkDB released their Fantasia 2.3 version. In this article, we're going to take a closer look at one of the lesser-known features that came out with that release - the aggregation fold command.",Deeper into RethinkDB 2.3,Live,171 469,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSIMPLE DATA PIPE CONNECTORSptitzler / April 7, 2016Earlier this year, we released a new version of the Simple Data Pipe application . This app lets you load data from the source of your choice directly intoCloudant. You just create a data pipe configuration and run it.Here, the Simple Data Pipe app loaded 26 case records from Salesforce by runningthe pipe configuration salesforce_case .The Simple Data Pipe app is a framework to create, modify, delete, and run datapipe configurations. When an app user chooses a pipe configuration (like source : Saleforce , and dataset : case ) and runs it, the Simple Data Pipe framework invokes a data-source-specificconnector (in this case, the Salesforce connector) to perform the actual datamovement. The connector interprets the configuration and moves the appropriatedata into Cloudant.A data pipe configuration contains information about the data source,authentication information, and source data set information. A pipeconfiguration depends upon the connector and the choices a user makes.Simple Data Pipe loads a configuration-specific connector from the GitHub repothat contains the connector implementation.Connectors handle data movement from the cloud data source to Cloudant, by: 1. connecting to the source using OAuth (if secure access is required) 2. retrieving the requested data sets 3. optionally enriching them with data from other sources, and 4. storing the results as JSON documents in Cloudant for later processingConnectors copy data from the source into Cloudant databases.The Simple Data Pipe app ships with built-in connectors for Salesforce and Stripe . Additional connectors are available and you can deploy them as add-ons, providing access to a variety of datasources, like Reddit, Slack, and Trello.We developed the connectors that exist so far to facilitate our own dataanalysis projects. As part of this work, we updated the Simple Data Pipe to makeit easier to build new custom connectors for other popular data sources.CLOUD DATA SOURCE AUTHENTICATIONConnectors can now take advantage of the popular Passport authentication middleware for Node.js to establish secure connectivity withdata sources. This eliminates the need to manually implement the entire OAuthauthentication flow. Take a look, for example, at the Slack connector . To implement authentication, we * added the passport-slack strategy as a module dependency, * configured the strategy, and * specified the OAuth scopes required by the Slack API calls (fetch list of channels and fetch messages in channel) we intended to use.With hundreds of strategies to choose from, chances are good that there's onefor the data source you need. If there isn't one yet, why not implement ityourself and publish on GitHub ?JUMPSTARTING CONNECTOR DEVELOPMENT USING BOILERPLATESTo make it even easier for you to get started, we created a couple connectorboilerplates for popular cloud data sources. These boilerplates haveauthentication support baked-in, which lets you focus on what's important: dataretrieval. Check out our connectors page to see the list.DATA RETRIEVAL AND ENRICHMENTThe Simple Data Pipe framework does not impose any restrictions on how to fetch,manipulate, and optionally enrich data. Browsing through our catalog, you'll seethat some connectors use vendor-provided API libraries (like stripe.com ), some use third-party API wrapper libraries (like this lightweight one for slack ), and some call the REST API endpoints directly via HTTP(S) requests.DATA STORAGE AND OUTPUT FORMATWhen you start a data pipe run, the Simple Data Pipe app automatically creates adedicated Cloudant database for each data set the connector processes. Atruntime, the Simple Data Pipe framework provides the connector with a callbackto be invoked whenever individual records or sets of records need to be writtento the Cloudant database. There are no constraints as to what structure recordshave to use—you can pick whatever makes the most sense in the context of how thedata will be consumed. For example, our connector for Reddit flattens data structures to support processing by the Spark-Cloudant connector , whereas others preserve the data structures returned by the API.DATA ENRICHMENTSimple Data Pipe connectors can simply load data from the cloud data source(like the salesforce.com connector). Or they can be smarter and combine fetcheddata with information obtained from other cloud data sources to provide avalue-added service, as in the following two examples: * The social media connector for Reddit uses Watson Tone Analyzer to gauge the tone of user comments. A complete use-case scenario based on this data is nicely illustrated in Chetna's blog post . * The connector for flightstats.com combines flight status information with weather data for the departure airport.TRY ITWhat do you think? Does the Simple Data Pipe sound like a something that wouldstreamline some of your projects?Try it out: Deploy the Simple Data Pipe app , and load data with a built-in connector. Next, deploy an add-on connector . If we've won you over by then, go whole-hog and create a custom connector of your own! Let us know how it goes. We'd love to hear from you andcollaborate on GitHub.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Analytics / cloudant / JSON / migration / Simple Data Pipe Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Import from the cloud data source of your choice. How connectors let you load data from a variety of sources, through the Simple Data Pipe, and into Cloudant.",Simple Data Pipe Connectors,Live,172 472,"Homepage About membership Sign in / Sign up Homepage Formulated.by Blocked Unblock Follow Following Formulating Digital and Face-To-Face Experiences Oct 31 -------------------------------------------------------------------------------- 10 MUST ATTEND DATA SCIENCE, ML AND AI CONFERENCES IN 2018 The keynote stage at Strata Data Conference in London (source: O’Reilly Conferences via Flickr 2017)Technology is advancing and new methods of designing effective business-progressing tools are emerging. Data Science conferences are not only about discovering the latest trends in the field, but also about building connections and networks of people who will fulfill your career and personal goals. Advancing is about continuous learning and we believe that the best way of doing this is by joining passionate people willing to share their work insights and innovations. Read about and check out our top ten 2018 Data Science, Machine Learning and Artificial Intelligence conferences. Enjoy! 1. KDD The KDD ( Knowledge Discovery and Data Mining) is an annual international conference focusing on science and engineering topics. It brings together researchers and practitioners from data science, data mining, knowledge discovery, large-scale data analytics, and big data. In 2018, the conference will take place in London between 19 -23rd of August . 2. DATA SCIENCE SALON Focusing on media and entertainment in Los Angeles on December 14th where the conference was born and finance and technology data science fields in Miami spin-off, the Data Science Salon is a destination conference which brings together specialists face-to-face to educate each other, illuminate best practices, and innovate new solutions in a casual atmosphere. Data Science Salon is a one-two day conference including workshops targeting executives, senior data scientists, developers, and business development professionals alike. After 2 successful encounters in 2017, Data Science Salon will come to Miami on 8–9 of February discussing Artificial Intelligence and Machine Learning in the fields of finance and health technology. 3. STRATA Strata covers a wide range of topics varying from Machine Learning, Data Engineering and Architecture, Big Data to Visualization, Cybersecurity, Law, Ethics and Data-Driven Business Management. Data case studies are shared by experts around the world together with their best practices, effective new analytic approaches, and exceptional skills. Strata targets an audience of data scientists, analysts, and executives. In 2018, three sessions are scheduled in San Jose on 5–8 March, in London on 21–24 May and New York on 11–14 September . 4. NIPS In 2018, the annual conference of Neural Information Processing Systems (NIPS) will meet for its thirty-second time. It is a single-track machine learning and computational neuroscience conference that includes invited talks, demonstrations and oral and poster presentations of refereed papers. It will take place in Montréal, Canada on 3–8 December. 5. AI CONFERENCE Exploring the most essential issues and innovations in Applied AI, the Artificial Intelligence Conference targets a wide range of topics on the subject. From AI impact on business and society to implementing AI projects and its models and methods, this conference brings the growing AI community together to share executive briefings, case studies and industry-specific applications. In 2018, there will be four sessions, the first one being held in Mandarin Chinese in Beijing on 10–13 April , followed by New York on 29 April — 2 May , San Francisco on 4–7 September and London on 8–11 October. 6. DATA SCIENCE POP-UP The Data Science Pop-up is a day long conference which brings together data science managers who are passionate about asking the right questions and identifying problems worth solving. Share ideas, develop best practices, and network with others in the field. The main focus is on presenting real stories about the cutting edge work being done today. Catch the last data science pop-up of 2017 in Chicago on November 14th . In 2018 the conference will be held in New York in February, San Francisco in May and in London in October. 7. ODSC EAST & WEST The Open Data Science conference is about accelerating your data science knowledge, training, and network. The event speakers include some of the core contributors to many open source tools, libraries, and languages. Topics discussed include the latest AI & data science topics, tools, and languages from some of the best and brightest minds in the field. In 2018, the ODSC East will take place in Boston on 1–4 of May and ODSC West will meet in San Frinciso, California. 8. MLCONF MLconf gathers communities to discuss the recent research and application of Algorithms, Tools, and Platforms to solve the hard problems that exist within organizing and analyzing massive and noisy data sets. MLconf events host speakers from various industries, research and universities.Each event is a single-track, single-day event, composed of 14–16 presentations around 25 min each. The goal of this format is for attendees to take home practical tips and methods to apply in their own work; as well as cited papers, code samples and work to reference for their own research. Date and location TBA. 9. ENTERPRISE DATA WORLD The Enterprise Data World (EDW) Conference will meet for its 22nd time in San Diego , California on 22–27 April. EDW is unique in being considered the most comprehensive educational conference on data management in the world. The six-day conference consists of in-depth tutorials, hundreds of hours of presentations on educational material and two-day workshops. Topics discussed by distinguished speakers include Data Governance and Stewardship, Data Architecture, Modeling, Metadata Management, NoSQL and Database Technologies, Data and Information Quality, Business Intelligence, Analytics, Data Science, Big Data and Enterprise Information Management, and much more. 10. RSTUDIOCONF RStudio conference is about all things R and RStudio. In 2018 more optional Training Days workshops for people newer to R and for advanced users and administrators will be added. Three conference tracks will be available; one focusing on the fundamentals of data science with R, another for more experienced RStudio users on advanced capabilities, R in “production” and interoperability, and a third one on solutions to interesting problems. Next year, the conference will be held in San Diego, California on 1–3rd of February. * Data Science * Machine Learning * AI * Deep Learning * Big Data 3 Blocked Unblock Follow FollowingFORMULATED.BY Formulating Digital and Face-To-Face Experiences * 3 * * * Never miss a story from Formulated.by , when you sign up for Medium. Learn more Never miss a story from Formulated.by Blocked Unblock Follow Get updates","Technology is advancing and new methods of designing effective business-progressing tools are emerging. Data Science conferences are not only about discovering the latest trends in the field, but…","10 Must Attend Data Science, ML and AI Conferences in 2018",Live,173 474,"INTRODUCING THE SIMPLE SEARCH SERVICE Glynn Bird / January 21, 2016 Turning your spreadsheet or mysql.dump into a faceted search engine just got alot easier. Try out our new Simple Search Service, built to help you create andmanage a useful polished search engine for your own site or app.I’ve blogged before about turning spreadsheet data into a faceted search engine . That tutorial has a few basic steps: 1. sign up for an IBM Cloudant NoSQL database account 2. use couchimport to import your spreadsheet data into Cloudant 3. instruct Cloudant to index the data using a Design Document 4. perform a Cloudant Search queryIf you’re familiar with NoSQL databases and Cloudant or Apache CouchDB inparticular, you should find those steps relatively easy to follow. But forsomeone new to NoSQL, there’s a lot to learn in there before hitting the searchAPI: JSON, command-line tools, design documents, and Lucene query syntax to namejust a few.The Cloud Data Services Developer Advocacy team is always looking to make thingsas easy as possible. To that end, we are today unveiling the Simple SearchService, which greatly simplifies the steps to turning your tabular data into afaceted search engine.To try it out, visit the Simple Search Service repository on Github and click the Deploy To Bluemix button. This will install the code in your IBM Bluemix account, connect theservices it needs and give you a simple web front-end that lets you import andindex your spreadsheet data. (Bluemix has a free trial, so it won’t cost youanything to try out Simple Search Service in the first month.)WHAT IS THE SIMPLE SEARCH SERVICE?Simple Search Service is a Node.js app that you can get and use immediately bydeploying to the IBM Bluemix platform-as-a-service with a couple of mouse clicks. Deployment gets you yourown working instance of the app, automatically provisions a Cloudant account,attaches it to the service, and presents a web app that lets you upload a datafile. When you upload data, it’s automatically imported into Cloudant, withevery field indexed for search.Simple Search Service then exposes a RESTful search API that your applicationcan use. The API is CORS-enabled, so your client-side web app can use it withoutissue. The API is also cached, meaning that it stores popular searches in anin-memory data store for faster retrieval, giving your application betterperformance.UPLOADING DATAThe Simple Search Service home page invites you to upload your CSV(comma-separated file) or TSV file (tab-separated file):Uploading a CSV or TSV is easySimple Search Service expects the first line of the file to contain the columnheadings like this:transaction_id description price customer_name date 42 Pet food 24.22 Jones 2015-04-02 43 Cake 9.99 Smith 2015-04-02File format must be comma or tab-separated and filenames must end in either .csv or .tsv .Simple Search Service will accept the following data types: * strings * numbers * booleans * arrays of strings (separated by commas)Records like this:person_id first_name last_name score passed tags 1 Glynn Bird 45.3 true uk,tall,glasses 2 Mike Broberg 24.1 false us,short,funnywould be turned into the following JSON documents:{ ""person_id"": ""Glynn"", ""last_name"": ""Bird"", ""score"": 45.3, ""passed"": true, ""tags"": [""uk"", ""tall"", ""glasses""]}{ ""person_id"": ""Mike"", ""last_name"": ""Broberg"", ""score"": 24.1, ""passed"": false, ""tags"": [""us"", ""short"", ""funny""]}The values within the score and passed fields are not wrapped in quotation marks. That’s because they’re not strings,they’re numbers and boolean values. Simple Search Service will, in most cases,detect the data types by examining the first few lines of the file but alsogives you the opportunity to override.At this point you may also choose which fields you would like to be “faceted”,by ticking the facet box next to each field:Specify facets on fields upon data importChoose fields you’d use to group your data. Faceting counts the occurrences ofeach field value in a result set. This gives someone searching your data aninsight into the composition of the dataset at a glance. The fields you want tofacet are usually ones where the values tend to repeat frequently, like these: * category names * tags * enumerationsYou can see an example of faceted search results in the guitarsexample app for the tutorial I mentioned at the start of this article. Thefaceted fields (type, range, brand, country, year) appear to the right of theresult set and have been programmed to act as secondary filters within thesearch results.What makes a good facet?SIMPLE SEARCH SERVICE APIThe Simple Search Service API is a simplified version of the Cloudant Search API. With Simple Search Service, there are only two parameters: * q – the query you wish to perform (default = : ) * cache – whether to cache search results (default = true)The API is expecting GET requests to /search e.g. /search?q=brand:fender . Here are some example queries: * q=*:* – return everything * q=brand:fender – a field search looking for a specific value of the field ‘brand’ * q=brand:fender OR brand:gibson – a more complicated fielded search with an ‘OR’ clause * q=blonde+fender+telecaster – a simple, free-text searchUnder the hood, Simple Search Service adds additional parameters to ensure thatthe document body is returned, that counts of faceted fields are returned, andthat the returned JSON is simplified.Simple Search Service automatically caches all search results for an hour. Youcan override this behaviour by adding a cache=false parameter to each Simple Search Service API search request.USING REDIS AS A CACHEBy default, Simple Search Service uses an in-memory hash table to cache commonsearch results. This is fine for testing, but if you are going to multipleSimple Search Service nodes then it makes sense to have a centralised cache. Redis is an in-memory database and can be easily integrated into a Simple SearchService installation. To do so: 1. Sign up for an account at compose.io 2. Create a Redis cluster and make a note of URL and password of your cluster 3. In Bluemix, add a Redis by Compose service, ensuring that you name it Redis by Compose — with no appended characters 4. Configure your Bluemix Redis service with the URL and password from your Compose.io accountAdd a centralised cache with Redis by Compose to scale up your deploymentWhen Simple Search Service reboots, it will detect the attached Redis serviceand use that for its caching layer.TRY ITSee for yourself. Visit the Simple Search Service repository on Github to preview the code, or click the Deploy To Bluemix button below. After it deploys, click the View your app button and upload your data. Happy searching!SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Quickly create a faceted search API for use in your own apps with open source code for Cloudant & Redis, from IBM's CDS developer advocacy team.",Introducing the Simple Search Service: Faceted search API made easy,Live,174 476,"Nikole Mcleish Blocked Unblock Follow Following Jul 28 -------------------------------------------------------------------------------- GENERATING POEMS: A WAY WITH WORDS AND CODE HOW I BUILT A POEM GENERATOR APP USING WATSON APIS Editor’s note: This article marks the first in an occasional series by the 2017 summer interns on the Watson Data Platform developer advocacy team, depicting projects they developed using Bluemix data services, Watson APIs, the IBM Data Science Experience, and more.Some people have a way with words. Others have a way with code. If you’re using my Poem Generator project , you don’t necessarily need either. The application creates poems based on user input. Using the Watson Tone Analyzer service , the user’s feelings are scored on a scale of 0 to 1. The Poem Generator then uses these feelings to craft a poem. An example poem from my Poem Generator application.HOW WAS IT BUILT? The Poem Generator is a Flask web application. Flask is a Python microframework for web development. The application uses Watson Natural Language Understanding , in addition to the Tone Analyzer service, both available on the IBM Bluemix platform. Tone Analyzer analyzes the emotional sentiment of a text. Natural Language Understanding can extract topics from text with keywords and entities. ""document_tone"": { ""tone_categories"": [ { ""category_id"": ""emotion_tone"", ""tones"": [ { ""tone_name"": ""Anger"", ""score"": 0.000105, ""tone_id"": ""anger"" }, { ""tone_name"": ""Disgust"", ""score"": 0.001659, ""tone_id"": ""disgust"" }, { ""tone_name"": ""Fear"", ""score"": 0.026971, ""tone_id"": ""fear"" }, { ""tone_name"": ""Joy"", ""score"": 0.066884, ""tone_id"": ""joy"" }, { ""tone_name"": ""Sadness"", ""score"": 0.946133, ""tone_id"": ""sadness"" } ], ""category_name"": ""Emotion Tone"" }, ... ] } In addition, the application uses a PostgreSQL database and offers in-app database management. In Bluemix, you can create a PostgreSQL database using either Compose for PostgreSQL or with ElephantSQL . HOW DOES IT WORK? ADDING LINES The Poem Generator keeps a database of lines and their emotional content, if any. When a user enters a line, the application calls the Tone Analyzer service and receives scores on the emotional content of that line. If a score for a particular emotion is high enough, that emotion will be marked. The Poem Generator is bootstrapped with a database of scored lines from various verses.Lines that do not score high enough for any emotions are marked as fillers. These lines have no distinct emotional content, allowing them to be placed in any poem without affecting the tone. Alternately, the application allows users to import and export multiple lines as well. GENERATING POEMS Users receive the input prompt, “ How are you feeling?” If there is emotional content in their input, the generator will gather all lines that register any emotion. The application then randomly selects lines to craft a 5-line poem. The generator will also randomly determine and select a filler line for lines 1, 3 and 5. WHAT OTHER FEATURES DOES IT HAVE? When generating poems, users can toggle certain features. The first is Word Replacement . This feature uses Watson Natural Language Understanding to extract keywords from the user’s input. Then this service will analyze the generated poem for keywords. If both contain keywords, then a random keyword from each will be swapped. I added this feature to allow more personalization to the poems. Note the various options along the bottom of the UI that you can toggle on and off.Other features include options to change how the application selects emotions. The Dominant Emotion feature will select the highest-scoring emotion and create a poem based on that. The Shared Emotions feature will select lines that match the range of emotions flagged in the user’s input. This feature requires lines in the database that can simultaneously satisfy multiple emotions. Lines that have more than one prominent emotion, however, are more difficult to find, causing this feature to rarely generate successful poems. The last feature for generating poems is No Fillers . Fillers are lines that do not exhibit strong emotions. When users select this feature, the generator will not add filler lines to the poem. DATABASE MANAGEMENT The Poem Generator uses a PostgreSQL database. When using the application for the first time, it creates a table if one does not already exist in the database. Eight columns comprise this table: id , line , anger , disgust , fear , joy , sadness , and filler . The database prevents duplicates by requiring unique lines. When creating a table for the first time, a collection of modern lines will be automatically added. You can view all the lines in the database via the app’s UI. These lines can be added, modified and re-scored, or deleted in the application. There are also options to export all lines in your database to a CSV file. You can do a bulk import of lines using a CSV too. Additionally, you can delete all lines from the table in your PostgreSQL database. CURRENT LIMITATIONS AND FUTURE IMPROVEMENTS People could potentially use the Poem Generator to create stories or songs. One limitation, however, was the context of a line’s location in a particular poem. When analyzing the tones of a line of poetry, the emotional content does not always match the emotions of the overall poem. Thus, a poem containing lines marked as sad may exhibit other emotions, like fear or anger. Looking toward future improvements, this generator could use machine learning to determine which lines are better together. If users could rank or like certain poems, the overall emotional content, line placement, and lines included could be used to create better poems. It was my first time working with the Watson APIs, and I hope it serves as a useful example to others just getting started as well. Give it a try. How are you feeling today? Filler line: If you enjoyed this article, please ♡ it to recommend it to other Medium readers. Thanks to Mike Broberg and G. Adam Cox . * Cognitive Computing * Ibm Bluemix * Ibm Watson * Python Flask * Postgres Blocked Unblock Follow FollowingNIKOLE MCLEISH FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform. * Share * * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Some people have a way with words. Others have a way with code. If you’re using my Poem Generator project, you don’t necessarily need either. The application creates poems based on user input. Using…",A Way with Words and Code – IBM Watson Data Lab – Medium,Live,175 478,"Jump to navigation * Twitter * LinkedIn * Facebook * About * Contact * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chats * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Subscribe ×BLOGS TOP ANALYTICS TOOLS IN 2016 Post Comment June 10, 2016 by Gaurav Vohra CEO & Co-Founder, Jigsaw Academy, The Online School of Analytics Follow me on LinkedIn , TwitterData analysis is not cut and dried, providing results in absolute terms. Rather, many tools, techniques and processes can help dissect data, structuring it into actionable insights. As we look toward the future of data analytics, we can expect certain trends in tools and technologies to dominate the analytics space: * Data analysis frameworks * Visualization frameworks * Model deployment frameworks DATA ANALYSIS FRAMEWORKS Open-source frameworks such as R, with its increasingly mature ecosystem, and Python, with its pandas and scikit-learn libraries, seem poised to continue their dominance of the analytics space. In particular, certain projects in the Python ecosystem seem ripe for quick adoption: * blaze Modern data scientists work with myriad data sources, ranging from CSV files and SQL databases to Apache Hadoop clusters. The blaze expression engine helps data scientists use a consistent API to work with a full range of data sources, lightening the cognitive load required by use of varied frameworks. * bcolz By providing the ability to do processing on disk rather than in memory, this interesting projects aims to find a middle ground between using Hadoop for cluster processing and using local machines for in-memory computations, thereby providing a ready solution when data size is too small to require a Hadoop cluster but not so small as to be handled within memory. R and Python ecosystems, of course, are only the beginning, for the Apache Spark framework is also seeing rapid adoption—not least because it offers APIs in R as well as in Python. Building on a general trend of using open-source ecosystems, we can also expect to see a move toward distribution-based approaches. Anaconda, for example, offers distributions for both Python and R, and Canopy offers a Python distribution geared toward data science. And no one will be surprised if we see the integration of analytics software such as R or Python in a standard database. Beyond open-source frameworks, a growing body of tools is helping business users interact directly with data while helping them produce guided data analysis. Tools such as IBM Watson, for example, attempt to abstract the data science process away from the user. Although such an approach is still in its infancy, it offers what appears to be a very promising framework for data analysis. VISUALIZATION FRAMEWORKS Visualizations are on the verge of being dominated by the use of web technologies such as JavaScript frameworks. After all, everyone wants to create dynamic visualizations, but not everyone is a web developer—or has the time to spend writing JavaScript code. Understandably, then, certain frameworks have been rapidly gaining in popularity: * plotly Offering APIs in Python, R and Matlab, this data visualization tool has been making a name for itself and seems on track for increasingly broad adoption. * bokeh This library may be exclusive to Python, but it also offers a strong potential for rapid future adoption. What’s more, these two examples are only the beginning. We should expect to see JavaScript-based frameworks that offer APIs in R and Python continue to evolve as they see increasing adoption. MODEL DEPLOYMENT FRAMEWORKS Many service providers are willing to replicate the SaaS model on premises, notably the following: * Domino Data Labs * Yhat * Opencpu What’s more, in addition to needing to deploy models, we’re also seeing a growing need to document code. Accordingly, we might expect to see a version control system similar to Github but that is geared toward data science, offering the ability to track different versions of data sets. Going forward, we anticipate that data and analytics tools will see increased implementation in mainstream business processes, and we expect such use to guide organizations toward a data-driven approach to decision making. For now, keep your eye on the foregoing tools—you won’t want to miss seeing how they reshape the world of data. Experience the power of Apache Spark in an integrated development environment for data science . Also, join the data science experience and explore how you can use Spark and R to build your own data science applications . Follow @IBMBigData Topics: Analytics , Big Data Education , Big Data Research , Big Data Technology , Data Scientists , IBM Watson Foundations Tags: data analytics , data science , modeling , Python , R , visualizationRELATED CONTENT BLOG WIMBLEDON: USING REAL-TIME SPORTS STATISTICS FOR FAN ENGAGEMENT A real-time notifications system was a champ behind-the-scenes at The Championships, Wimbledon 2015 by enabling its digital and content team to break the news of a key tournament statistics milestone that scooped media organizations worldwide. See what value an extension to that system is adding to... Read Blog Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacy Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Shifting winds in the Cognitive Era for banking’s digital transformation Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information Lifecycle Governance solutions Video What does Hadoop and big data success look like? White papers & Reports Driving value from body cameras View the discussion thread. IBM * Site Map * Privacy * Terms of Use * 2014 IBM FOLLOW IBM BIG DATA & ANALYTICS * Facebook * YouTube * Twitter * @IBMbigdata * LinkedIn * Google+ * SlideShare * Twitter * @IBManalytics * Explore By Topic * Use Cases * Industries * Analytics * Technology * For Developers * Big Data & Analytics Heroes * Explore By Content Type * Blogs * Videos * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Events * Around the Web * About The Big Data & Analytics Hub * Contact Us * RSS Feeds * Additional Big Data Resources * AnalyticsZone * Big Data University * Channel Big Data * developerWorks Big Data Community * IBM big data for the enterprise * IBM Data Magazine * Smarter Questions Blog * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics Heroes More * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics Heroes SearchEXPLORE BY TOPIC: Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacyMORE Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacy Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information Lifecycle Governance solutions Blog Cloud-based ingestion: The future is hereMORE Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information Lifecycle Governance solutions Blog Cloud-based ingestion: The future is here Blog 3 strategies to get your CFO to care about Sales Performance Management Blog Proactive emergency plans: Data empowers law enforcement agencies at all levels Blog Emergency management information system data needs to be filtered Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Shifting winds in the Cognitive Era for banking’s digital transformationMORE Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Shifting winds in the Cognitive Era for banking’s digital transformation Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog Keep your head above water with information lifecycle governance Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information Lifecycle Governance solutionsMORE Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information Lifecycle Governance solutions Video What does Hadoop and big data success look like? Blog The death of application performance White papers & Reports Introducing notebooks: A power tool for data scientists * Home * Explore By Topic * Use Cases * All * Acquire, Grow & Retain Customers * Create New Business Models * Improve IT Economics * Manage Risk * Optimize Operations & Reduce Fraud * Transform Financial Processes * Industries * All * Banking * Consumer Products * Education * Energy & Utilities * Government * Healthcare & Life Sciences * Industrial * Insurance * Media & Entertainment * Retail * Telecommunications * Analytics * All * Content Analytics * Customer Analytics * Entity Analytics * Social Media Analytics * Technology * All * Business Intelligence * Cloud Database * Data Governance * Data Warehouse * Database Management Systems * Data Science * Hadoop & Spark * Internet of Things * Predictive Analytics * Streaming Analytics * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chat * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Big Data & Analytics Heroes * For Developers * Events * Upcoming Events * Webcasts * Twitter Chat * Meetups * Around The Web * About Us * Contact Us * Search Site","Join us for a look at what’s on the horizon in data analytics, discovering how a broad array of tools aims to change the way we do—and think about—data science.",Top analytics tools in 2016,Live,176 480,"Glynn Bird / April 27, 2016In earlier blog posts I have described a microservice architecture that uses a queue, pubsub channel, or message hub to broker a list of “work”. Each item of work is a block of data—typically aJSON document—that is to be processed, saved, or acted upon in some way. Icreated the Metrics Collector Microservice , which collects web metrics data from mobile or web apps and writes it to aqueue or pubsub channel using Redis, RabbitMQ, or Apache Kafka. A separateMetrics Collector Storage Microservice consumes the work and writes it to achoice of Cloudant, MongoDB, or ElasticSearch. I then described how othermicroservices could be added to aggregate the streaming data as it arrived. Fortunately Compose.io allows the deployment of Redis, RabbiMQ,MongoDB, or ElasticSearch and IBM offers Cloudant and Apache Kafka as services,so it’s very easy to get started but there are a lot of moving parts.Today I’ll be using a new service, OpenWhisk , which makes it simple to deploy microservices and eliminates the need tomanage your own message broker or deploy your own worker servers.OpenWhisk is an open-source, event-driven compute platform. You send your action code to OpenWhisk and then deliver a stream of data that your OpenWhisk code worksupon. OpenWhisk handles the scaling out of the computing resources needed todeal with the workload; all you deal with is the action code and the data that triggers the actions. You pay only for the amount ofwork that is undertaken, not for servers standing idle waiting for something tohappen.You can write action code in JavaScript or Swift. This means that web developersand iOS developers can create server-side code in the same language as theirfront-end code.GETTING STARTEDThe code snippets and command-line calls made in this blog post assume that youhave signed up for the OpenWhisk beta programme in Bluemix and have alreadyinstalled the “wsk” command-line tool. Visit https://developer.ibm.com/openwhisk/ and click the Try Now button to get started.HELLO WORLDLet’s create a JavaScript file called ‘hello.js’ containing a function thatreturns a simple object:function main() { return {payload: 'Hello world'};}This is the simplest OpenWhisk action; it simply returns a static string as itspayload. Deploy the action to OpenWhisk with:> wsk action create hello hello.jsok: created action helloThis creates an action called “hello” that runs the code found in hello.js . We can run it in the cloud with:> wsk action invoke --blocking hello{ ""payload"": ""Hello world""}We can also make our code expect parameters:function main(params) { return {payload: 'Hello, ' + params.name + ' from ' + params.place};}Then update our action:> wsk action update hello hello.jsok: created updated helloAnd run our code with parameters:> wsk action invoke --blocking --result hello --param name 'Jennie' --param place 'The Block'{ ""payload"": ""Hello, Jennie from The Block""}We’ve created a simple JavaScript function that processes some data and withoutworrying about queues, workers, or any network infrastructure we were able toexecute the code on the OpenWhisk platform.DOING SOMETHING USEFUL WITH OUR ACTIONSWe can do more complex things in our action, such as making API calls. I createdthe following action, which calls out to a Simple Search Service instance containing Game of Thrones data, passing in the q parameter:var request = require('request');function main(msg) { var q = msg.q || 'Jon Snow'; var opts = { method: 'get', url: 'https://sss-got.mybluemix.net/search', qs: { q: q, limit:5 }, json: true } request(opts, function(error, response, body) { whisk.done({msg: body}); }); return whisk.async();}We can create this action and give it a different name:> wsk action create gameofthrones gameofthrones.jsok: created action gameofthronesThen call it with a parameter q ;> wsk action invoke --blocking --result gameofthrones --param q 'melisandre'{ ""msg"": { ""_ts"": 1460028600363, ""bookmark"": ""g2wAAAABaANkAChkYmNvcmVAZGI0LmJtLWRhbC1zdGFuZGFyZDEuY2xvdWRhbnQubmV0bAAAAAJuBAAAAACAbgQA____n2poAkY_7PVPoAAAAGHlag"", ""counts"": { ""culture"": { ""Asshai"": 1 }, ""gender"": { ""Female"": 1 } }, ""from_cache"": true, ""rows"": [ { ""_id"": ""characters:743"", ""_order"": [ 0.9049451947212219, 229 ], ""_rev"": ""1-c68720782e2500311125768153d7170b"", ""aliases"": [ ""The Red Priestess"", ""The Red Woman"", ""The King's Red Shadow"", ""Lady Red"", ""Lot Seven"" ], ""allegiances"": [ """" ], ""books"": [ ""A Clash of Kings"", ""A Storm of Swords"", ""A Feast for Crows"" ], ""born"": ""At\ufffd\ufffdUnknown"", ""culture"": ""Asshai"", ""died"": """", ""father"": """", ""gender"": ""Female"", ""mother"": """", ""name"": ""Melisandre"", ""playedBy"": ""Carice van Houten"", ""povBooks"": ""A Dance with Dragons"", ""spouse"": """", ""titles"": [ """" ], ""tvSeries"": ""Season 2,Season 3,Season 4,Season 5"" } ], ""total_rows"": 1 }}WRITING DATA TO SLACK FROM OPENWHISKAnother task we could perform in an OpenWhisk action is to post a message inSlack. Slack has a great API for creating custom integrations: a Slackadministrator can set up an “incoming webhook”, so posting to a channel is assimple as POSTing a string to an HTTP endpoint. We can create a Slack-postingaction with a few lines of code:var request = require('request');function main(msg) { var text = msg.text || 'This is the body text'; var opts = { method: 'post', url: 'MY_CUSTOM_SLACK_WEBHOOK_URL', form: { payload: JSON.stringify({text:text}) }, json: true } request(opts, function(error, response, body) { whisk.done({msg: body}); }); return whisk.async();}replacing MY_CUSTOM_SLACK_WEBHOOK_URL with the Webhook URL that Slack provided when the “Incoming Webhook”integration was created. Notice how this action is executed asynchronously andonly calls back when the request has completed.Then we can deploy and run it in the usual way:> wsk action create slack slack.jsok: created action slack> wsk action invoke --blocking --result slack --param text 'you know nothing, Jon Snow'{ ""msg"": ""ok""}As it happens, Whisk has a built-in Slack integration , but it’s nice to build things yourself isn’t it? Then you can perform yourown logic and decide whether a Slack message is posted or not based on theincoming data.WRITING DATA TO CLOUDANT FROM OPENWHISKIt is relatively simple to write your own custom action to write to Cloudantbecause you can: * ‘require’ the Cloudant Node.js library in your JavaSript action * write data to Cloudant using its HTTP APIThe disadvantage of this approach is that you’d have to hard-code your Cloudantcredentials in the action code, just as we hard-coded the Slack Webhook URL inour previous example, which isn’t best-practice.Fortunately, OpenWhisk has a pre-built Cloudant integration which you can invokewithout any custom code. If you have an existing Cloudant account, then you cangrant access to that Cloudant service on the command-line:> wsk package bind /whisk.system/cloudant myCloudant -p username 'myusername' -p password 'mypassword' -p host 'mydomainname.cloudant.com'Then you see at list of connections that OpenWhisk can interact with:> wsk package listpackages/me@uk.ibm.com_dev/myCloudant private bindingwhere me@uk.ibm.com is my Bluemix username (or the name of your Bluemix organisation) and dev is your Bluemix space.You can write data to Cloudant by invoking the write command of the package:> wsk action invoke /me@uk.ibm.com_dev/myCloudant/write --blocking --result --param dbname testdb --param doc '{""name"":""George Friend""}'{ ""id"": ""656eaeaed0fd47aa733dd41c3c79a7a0"", ""ok"": true, ""rev"": ""1-a7720095a32c4d1b994ce5e31fe8c73e""}LET’S TAKE A BREATHSo far we’ve created and updated OpenWhisk actions and triggered individualactions as blocking, command-line tasks. The ‘wsk’ tool lets you trigger actionsto run in the background and also chain actions together into sequences , but we are not going to cover those options in this post.Our code has been simple JavaScript blocks where the “main” no need to worryabout servers, operating systems, or network hardware.OpenWhisk is an event-driven system. You’ve seen how to create an event bydeploying code manually. But how can we set up OpenWhisk to act upon a stream ofevents?OPENWHISK TRIGGERSA Trigger in OpenWhisk is another way of firing events and executing code. We can createa number of named Triggers and then create rules that define which of our actions (our code) are executed against which of our triggers. Instead of invokingactions directly, we would invoke Triggers instead; the rules defined againstthe triggers decide which action(s) are executed. This lets us chain actionstogether so that one trigger causes several actions to occur and re-use code byassigning the same action code to multiple triggers.Triggers can fire individually, or tie to external feeds such as: * the changes feed from a Cloudant database – every time a document is added, updated, or deleted, a trigger fires * the commit feed of a Github repository – every time a commit occurs a trigger firesSo we can use a Cloudant database to fire a trigger which writes some data toSlack:> wsk trigger create myCloudantTrigger --feed /me@uk.ibm.com_dev/myCloudant/changes --param dbname mydb --param includeDocs trueand configure that trigger to fire our Slack-posting action:> wsk rule create --enable myRule myCloudantTrigger slackNow every time a document is added, updated or deleted in the Cloudant database,my custom action fires, which in this case posts a message to Slack!WHAT WOULD I USE OPENWHISK FOR?OpenWhisk lends itself to projects where you don’t want to manage anyinfrastructure. You pay only for the work done, and don’t waste money on idleservers. OpenWhisk easily manages peaks of activity, as it scales out to meetthe demand.Combining OpenWhisk with other “as-a-Service” databases, such as Cloudant, meansthat you don’t have to manage any data storage infrastructure either. Cloudantis built to store large data sets, cope with high rates of concurrency, andprovide high-availability.As the act of spinning up an OpenWhisk action is non-zero, it makes sense to useOpenWhisk for non-trivial computing tasks like * processing an uploaded image to create thumbnails, saving them to object storage * taking geo-located data from a mobile application and enriching it with calls out to a Weather APIIt is also useful for dealing with systems that feature large amounts ofconcurrency such as: * mobile apps sending data to the cloud * Internet of Things deployments where incoming sensor data needs to be stored and acted uponThere are features of OpenWhisk that I haven’t touched on such as Swift support,the ability to use Docker containers as action code instead of uploading sourcecode, and the mobile software development kit.REFERENCES * OpenWhisk * OpenWhisk Source Code * OpenWhisk iOS SDKSHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Apache Kafka / cloudant / Elasticsearch / microservices / MongoDB / OpenWhisk / pubsub / RabbitMQ Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",OpenWhisk makes it easy to deploy microservices and eliminates the need to manage your own message broker or deploy your own worker servers.,Introducing OpenWhisk: Microservices Made Easy,Live,177 484,"TL;DR: Betteridge's law applies unless your JSON is fairly unchanging and needs to be queried a lot.With the most recent version of PostgreSQL gaining ever more JSON capabilities, we've been asked if PostgreSQL could replace MongoDB as a JSON database. There's a short answer to that, but we'd prefer to show you. Ah, a question from the audience...Yes, it did. Before PostgreSQL 9.4 there was the JSON data type and that's still available. It lets you do this:>CREATE TABLE justjson ( id INTEGER, doc JSON)>INSERT INTO justjson VALUES ( 1, '{""name"":""fred"",""address"":{""line1"":""52 The Elms"",""line2"":""Elmstreet"",""postcode"":""ES1 1ES""That stored the raw text of the JSON data in the database, complete with white space and retaining all the orders of any keys and any duplicate keys. Let's show that by looking at the data:>SELECT * FROM justjson;id | doc1 | { +| ""name"":""fred"", +| ""address"":{ +| ""line1"":""52 The Elms"", +| ""line2"":""Elmstreet"", +| ""postcode"":""ES1 1ES"" +(1 row)It has stored an exact copy of the source data. But we can still extract data from it. To do that, there's a set of JSON operators to let us refer to elements within the JSON document. So say we just want the address section, we can do:select doc->>'address' FROM justjson;?column?""line1"":""52 The Elms"", +""line2"":""Elmstreet"", +""postcode"":""ES1 1ES"" +(1 row)The ->> operator says within doc, look up the JSON object with the following fieldname and return it as text. With a number, it would have treated it as an array index, but still returned the value as text. There's also -> to go with ->> which doesn't do that conversion to text. We need that so we can navigate into the JSON objects like so:select doc->'address'->>'postcode' FROM justjson;?column?ES1 1ES(1 row)Though there is a shorter form where we can specify a path to the data we are after using #>> and an array like this:select doc#>>'{address,postcode}' FROM justjson;?column?ES1 1ES(1 row)By preserving the entire document the JSON data type made it easy to work with exact copies of JSON documents and pass them on without loss. But with that exactness comes a cost, a loss of efficency, and with that comes an inability to index.. So although it's convenient to preserve and parse JSON documents, there was still plenty of room for improvement and thats where JSONB comes in.Well, with JSONB it turns the JSON document into a hierarchy of key/value data pairs. All the white space is discarded, only the last value in a set of duplicate keys is used and the order of keys is lost to the structure dictated by the hashes in which they are stored. If we make a JSONB version of the table we just created, insert some data and look at it:>CREATE TABLE justjsonb ( id INTEGER, doc JSONB)>INSERT INTO justjsonb VALUES ( 1, '{""name"":""fred"",""address"":{""line1"":""52 The Elms"",""line2"":""Elmstreet"",""postcode"":""ES1 1ES"">SELECT * FROM justjsonb;id | doc1 | {""name"": ""fred"", ""address"": {""line1"": ""52 The Elms"", ""line2"": ""Elmstreet"", ""postcode"": ""ES1 1ES""}}(1 row)We can see that all the textyness of the data has gone away, replaced with the bare minimum required to represent the data held within the JSON document. This stripping down of data means the JSONB representation moves the parsing work to when the data is inserted, but relieves any later access to the data of the task of parsing it.Looked at as key/value pairs, then the JSONB datatype does look a bit like the PostgreSQL HSTORE extension. That's a data type for storing key/value pairs but it is an extension, where JSONB (and JSON) are in the core and HSTORE is one-deep in terms of data structure where JSON documents can have a nested elements. Also, HSTORE stores only strings while JSONB understands strings and the full range of JSON numbers.Indexing, indexing everywhere. You can't actually index a JSON datatype in PostgreSQL. You can make an index for it using expression indexes, but that'll cover you for whatever you can put in an expression. So if we wanted to we could docreate index justjson_postcode on justjson ((doc->'address'->>'postcode'));And the postcode, and nothing else would be indexed.With JSONB, there's support for GIN indexes; a Generalized Inverted Index. That gives you another set of query operators to work with. These are @> contains JSON, contained, ? test for string existing, ?| any strings existing and ?& all strings existing.There are two kinds of indexes you can create with the default one, called json_ops, which supports all these operators and an index using jsonb_path_ops which only supports @>. The default index creates an index item for every key and value in the JSON, while the jsonb_path_ops only creates a hash of the keys leading up to a value and the value itself and that's a lot more compact and faster to process than the more complex default. But the default does offer more operations at the cost of consuming more space. After adding some data to our table, we can do a select looking for a particular post code. If we create the default GIN JSON index and do a query:explain select * from justjsonb where doc @> '{ ""address"": { ""postcode"":""HA36CC"" } }';QUERY PLANSeq Scan on justjsonb (cost=0.00..3171.14 rows=100 width=123)Filter: (doc @> '{""address"": {""postcode"": ""HA36CC""}}'::jsonb)(2 rows)We can see that it will sequentially scan the table. Now, if we create a default JSON GIN index we can see the difference it makes:> create index justjsonb_gin on justjsonb using gin (doc);> explain select * from justjsonb where doc @> '{ ""address"": { ""postcode"":""HA36CC"" } }';QUERY PLANBitmap Heap Scan on justjsonb (cost=40.78..367.62 rows=100 width=123)Recheck Cond: (doc @> '{""address"": {""postcode"": ""HA36CC""}}'::jsonb)-> Bitmap Index Scan on justjsonb_gin (cost=0.00..40.75 rows=100 width=0)Index Cond: (doc @> '{""address"": {""postcode"": ""HA36CC""}}'::jsonb)(4 rows)It's a lot more efficient searching as you can tell by the lower cost. But the hidden cost is in the size of the index. In this case it's 41% of the size of the data. Let's drop that index and repeat the process with a jsonbpathops GIN index.> create index justjsonb_gin on justjsonb using gin (doc jsonb_path_ops);> explain select * from justjsonb where doc @> '{ ""address"": { ""postcode"":""HA36CC"" } }';QUERY PLANBitmap Heap Scan on justjsonb (cost=16.78..343.62 rows=100 width=123)Recheck Cond: (doc @> '{""address"": {""postcode"": ""HA36CC""}}'::jsonb)-> Bitmap Index Scan on justjsonb_gin (cost=0.00..16.75 rows=100 width=0)Index Cond: (doc @> '{""address"": {""postcode"": ""HA36CC""}}'::jsonb)(4 rows)The total cost is slightly lower and typically the index size should be a lot smaller. It's going to be the classic task of balancing speed and size for indexes. But it's far more efficient than sequentially scanning.If you update your JSON documents in place, the answer is no. What PostgreSQL is very good at is storing and retrieving JSON documents and their fields. But even though you can individually address the various fields within the JSON document, you can't update a single field. Well, actually you can, but by extracting the entire JSON document out, appending the new values and writing it back, letting the JSON parser sort out the duplicates. But it's likely that you aren't going to want to rely on that.If your active data sits in the relational schema comfortably and the JSON content is a cohort to that data then you should be fine with PostgreSQL and it's much more efficient JSONB representation and indexing capabilities. If though, your data model is that of a collection of mutable documents then you probably want to look at a database engineered primarily around JSON documents like MongoDB or RethinkDB.","With the most recent version of PostgreSQL gaining ever more JSON capabilities, we've been asked if PostgreSQL could replace MongoDB as a JSON database.",Is PostgreSQL Your Next JSON Database?,Live,178 485,"POWER PROTOTYPING WITH MONGODB AND NODE-RED Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 23, 2016Do you want to be able to quickly get your database backend fronted by a web service? Node-RED and MongoDB can be a powerful ally in your strategy and we'll show you how. Whether you're just getting started with a small toy project or about to embark on ""The Next Big Thing""®, being able to quickly set up a backend to your application can be the key to getting your project off the ground. With the widespread use of JSON as the de-facto serialization format for data structures on the web, JSON document databases such as MongoDB are excellent platforms for prototyping new applications. However, exposing MongoDB directly to the client-side of your application is difficult to manage, hard to keep efficient, and pushes logic into your client applications. In this article, we'll use Node-RED and MongoDB to build a minimal RESTful API for a photographer's portfolio website. GETTING STARTED ACCESSING A MONGODB INSTANCE You should first spin up a spin up a Compose MongoDB database with SSL enabled or start your own local instance of MongoDB. Starting a new deployment on Compose is the easiest way to get started. INSTALLING NODE-RED You should also have access to a running installation of Node-RED. If you already have NodeJS on your local machine, you can use the Node Package Manager from a terminal to install Node-Red: npm install node-red . If you don't have NodeJS yet, you can download the installer for your platform directly from the NodeJS website . INSTALL THE MONGODB2 NODE Node-RED is a flow-based programming (FBP) environment, so connecting to services requires access to a ""node"" that provides the services you need. In this article, we'll connect to MongoDB using the node-red-node-mongodb2 package. You can install it by selecting the Manage Palette option from the main menu, searching for the mongodb2 node and clicking install . CONNECTING TO MONGODB USING NODE-RED To connect to MongoDB, you must first configure the mongodb2 node. Start by dragging the mongodb2 node onto the Node-RED canvas. The node initially has a red error indicator letting you know that the node needs to be configured. Double-click on the node to open the configuration editor. In the top section labelled ""server"", ensure that ""Add new mongodb2..."" is showing in the drop-down menu and click on the ""pencil"" icon to add a new server configuration. The configuration section has all of the information Node-RED needs to connect to MongoDB. You can find the connection string in your Compose console by clicking on your database name and clicking the Admin tab: Once you've configured a MongoDB node, it will be available for every subsequent MongoDB node you create by selecting it from the drop-down menu. RETREIVING AND QUERYING RECORDS (HTTP GET) We're now ready to start adding the HTTP endpoints to our RESTful portfolio API. All of our endpoints will use the same route location but different HTTP methods to GET , POST , PUT , and DELETE items in the portfolio. We'll start with the GET endpoint. To add an HTTP endpoint, we'll use the HTTP input node which comes installed by default in all new instances of Node-RED. In the future we can also add other interfaces such as WebSockets and RabbitMQ, but for now we'll stick with HTTP. To add a new HTTP endpoint, drag the HTTP input node onto the canvas. Double-click the node to open the configuration panel and add the URL and method you prefer (in this case, we'll start with the GET HTTP method). Since we're working with a data entity called ""project"", we'll make each of our endpoints available at the /projects URL. We'll also store each of these projects in a database collection called ""projects"". Double-click on the mongodb2 node and type projects in the collection field. Then select the find.toArray operation from the operation drop-down and click done . Next we'll wire together our mongodb2 node and HTTP input. This can be done by clicking on the out port on the right of the HTTP input node and dragging a wire to the in port on the left of the mongodb2 node as shown below. We'll also need to drag an HTTP output node onto the canvas to ensure that our HTTP client receives a response. Finally, click deploy to publish the flow and make the endpoints active. If you're running Node-RED locally, You can now access your endpoints at http://localhost/projects . Any query string parameters you pass to the URL will be available in the msg.payload object. Since the find method in the mongodb2 node also reads in the msg.payload object, we can use this wiring to automatically send all parameters passed into the HTTP input node to the mongodb2 node. For example, to search for a project with a title field matching “test” in the MongoDB shell, the following MongoDB query would look like: db.projects.find({ ""title"": ""test"" }); The query also could be executed using the following CURL command: curl -X GET localhost/projects?title=test CREATING THE REMAINING RESTFUL ENDPOINTS The other endpoints follow a similar structure: they start with an HTTP input node, send parameters into the mongodb2 node configured with the desired operation, and send the results as a response back to the user. In this next section, we'll cover the Create , Update , and Delete operations. CREATE A NEW PROJECT (HTTP POST) To create a new project in MongoDB, we'll send an HTTP POST request to the /projects endpoint. This is similar to what we did above with the GET request, except we have to extract project data from the POST request's body rather than the query string parameters. The HTTP input node does not send form body data in the msg.payload object, however we can access form body data directly using the msg.req object. The msg.req object contains the underlying HTTP request from ExpressJS , so POST form data can be found in the msg.req.body object. We can copy the msg.req.body over to the msg.payload by adding a function node in our flows. We'll start by copying the GET flow and pasting the copy onto the canvas. Then we'll modify the HTTP input node to use the POST method instead of the GET method. Now we’ll drag a function node onto the canvas and place it between mongodb2 and the HTTP input node. Then, we'll add the following code to the function node: msg.payload = msg.req.body; return msg; Finally, configure mongodb2 to use the projects collection and the insert operation and click Deploy . You can use the following CURL command to create a new project with a title field and a value of ""test"": $ curl -X POST -H ""Content-Type: application/json"" -d '{""title"":""test""}' localhost/projects { ""_id"": ObjectID(""2fae32498ac2b113ca241543bfcaef""), ""title"": ""test""} UPDATE AN EXISTING DOCUMENT (HTTP PUT) Continuing on with RESTful convention, we'll use the HTTP PUT method to update an existing project. Let's copy the nodes we created for the POST method and modify the HTTP node to use PUT and change the mongodb2 operation to update . Since the document to update will be passed in through the request body, we'll keep the copied function node as it is. The PUT command includes the ID in an object in the body of the request, along with the fields that you want to update. The following CURL command will update an existing project: $ curl -X PUT -H ""Content-Type: application/json"" -d '{""_id"": ""2fae32498ac2b113ca241543bfcaef"", ""title"":""not test""}' localhost/projects { ""_id"": ObjectID(""2fae32498ac2b113ca241543bfcaef""), ""title"": ""not test""} DELETING AN EXISTING DOCUMENT (HTTP DELETE) The last of the CRUD operations we need to implement is DELETE . We'll use the HTTP DELETE method to do this. HTTP DELETE is similar to HTTP GET in that it does not send a form body along with the request. To send the ID of the record to be deleted we'll encode it in the URL like this: /projects/id_to_delete . Copy the flow we created for PUT in the previous section and paste the copy onto the canvas. Then, modify the Method field in the HTTP input node by clicking on the node and selecting the DELETE method from the drop-down menu. In the URL field we can insert /project/:id which adds the URL parameter id . We'll exploit the msg.req object again, this time to get the URL parameters from the msg.req.params object. To do this, we’ll add a function node to move the msg.req.params object over to the msg.payload by double-clicking on the function node and adding the following code to the editor: msg.payload = msg.req.params; return msg; Finally, we’ll update the mongodb2 node's operation field to deleteOne . You can delete an item with an ID of ""2fae32498ac2b113ca241543bfcaef"" by using the following CURL command: $ curl -X DELETE localhost/projects/2fae32498ac2b113ca241543bfcaef { ""_id"": ObjectID(""2fae32498ac2b113ca241543bfcaef"")} WRAP-UP MongoDB, with its schema-less architecture, and Node-RED, with its flow-based programming model, make a powerful rapid-prototyping duo. Node-RED also makes it possible to expand out the functionality of our minimal API as much as we want, thanks to its flexible programming model and robust community of third-party nodes. In the next installment, we'll move your API out of the prototype phase by adding authentication to your exposed endpoints using JSON Web Token . --------------------------------------------------------------------------------","In this article, we'll use Node-RED and MongoDB to build a minimal RESTful API for a photographer's portfolio website. ",Power Prototyping with MongoDB and Node-RED,Live,179 486,"OFFLINE CAMP CALIFORNIA -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Maureen McElaney 11/17/16Maureen McElaney Maureen McElaney is a Developer Advocate at IBM Cloud Data Services. Prior to joining the team, she worked as a QA Engineer at Dealer.com and is passionate about building tools that increase developer productivity and joy. She is an experienced community builder. In 2013 she founded the Burlington, Vermont chapter… Learn More Recent Posts * Offline Camp California A recap of our second-ever Offline Camp, and how to get involved in the Offline… * Girl Develop It Summit Recap IBM Cloud Data services proudly sponsored the 2016 Girl Develop It Leadership Summit. One of… * IBM Went Camping With #OfflineFirst In June 2016, members of the Offline First community gathered for a retreat at a… CAMARADERIE, OFFLINE FIRST & TARANTULAS Offline Camp is a gathering of folks from the Offline First community, who come together to share projects, best practices, and hack on offline-first problems over a long weekend away from it all. The Offline First movement involves people who build Progressive Web Apps , native apps, desktop apps, IoT, and even data scientists! After co-organizing the first Offline Camp in New York this past June, I was excited to attend the California event as a simple camper. Hayride brought us some very good views @OfflineCamp pic.twitter.com/hzdMEFhlqR — Steve Trevathan (@strevat) November 6, 2016 Two of Offline Camp’s co-organizers, Bradley Holt and Gregor Martynus , held a session on the future of the Offline First community where they took a first crack at a community logo, which was met with more ¯\_(ツ)_/¯ s. You can see a full list of topics that arose at camp on the Offline Camp medium account . WiFi ? LiFi ? ¯_☁️_/¯ @OfflineCamp #OfflineFirst pic.twitter.com/A5bhlGMdcT — Luis Montes (@monteslu) November 7, 2016 THE CAMPERS The campers are what truly sets Offline Camp apart from your run-of-the-mill tech conference. All the sessions are proposed, voted upon, and decided by the people in attendance. The organizing team continuously works hard to connect with a diverse audience of people who are doing amazing things in the Offline First space. The organizers promote an inclusive environment at camp via the camp Code of Conduct . These rules are important because the campers are the ones who decide what happens at camp. The people who attend truly set the tone for what kind of projects the Offline First community will tackle afterward! This installment of Offline Camp occurred in Santa Margarita, California There were 21 amazing people at camp, but I thought I’d highlight a few so that you could get a feel for the type of people who attend an event like this. MAX I met cat cafe connoisseur Max Ogden , former Fellow at Code for America and author of JS for Cats, who also maintains Dat Data Project , which shares datasets over peer to peer networks. Had such an awesome time at @OfflineCamp , made new lifelong friends, got excited about the future of the web. And saw this bunny pic.twitter.com/FUzOtGPCPB — maxwell ogden (@denormalize) November 7, 2016 MACHIKO At camp, you could have gone on an Offline First mapping hike with Machiko Yasuda , who runs multiple tech meetups in Los Angeles including Fullstack and MaptimeLA . She also told me about an open source tool she built for mentoring new developers and leveling up existing ones, called exercism.io . on the drive home from @offlinecamp & reflecting on what i learned this weekend, i got stuck on the 10 behind a car with this plate: READ ME — machiko / 安田万智子 (@machikoyasuda) November 8, 2016 NOLAN Around the campfire, we played the role-playing game Werewolf . There I sat across from Nolan Lawson , who maintains PouchDB and works at Microsoft Edge . In the game he poisoned me because he thought I was the werewolf — alas, I was but an innocent villager! — but in real life I learned a bunch about the development environment for Microsoft users and made plans for our Offline First panel at SXSW . By the way, we both hope that if you’re planning to attend SXSW next year, that you’ll come see our panel! Learn more about it here . Day 2 Design Patterns sesh at @OfflineCamp . ""What percentage of PWAs were written by @jaffathecake ?"" @nolanlawson #offlinefirst pic.twitter.com/ysVDMpL22H — Mo_Mack (@Mo_Mack) November 6, 2016 TRAIL’S END It was hard to leave all the friends we made at camp. Campers have a lot of amazing things to say about their experience: I just published “My biggest takeaway from the second Offline Camp in Santa Margarita, CA” https://t.co/mjfHPExh6C — Disruption disruptor (@jessebeach) November 8, 2016 My jacket smells like a campfire and I'm instantly reminded how awesome @OfflineCamp was! — John Kleinschmidt (@jkleinsc) November 10, 2016 Follow the Offline Camp Medium account now, as the majority of the campers have signed up to contribute recap posts from the sessions they participated in. Those articles will be published continuously in the coming weeks. Sign up for the Offline First Reader to stay on top of news and events happening within the community. If you’re interested in contributing now, join the Offline First Slack team and add to the discussion. Stay tuned for more Offline Camp events in 2017 — perhaps we’ll even go to Europe? ¯\_(ツ)_/¯ Better question: Are there tarantulas in Europe? Met some (actually quite friendly) neighbors this morning. pic.twitter.com/Ia3iYG0W8M — Offline Camp (@OfflineCamp) November 7, 2016 ","Offline Camp is a gathering of the Offline First community, coming together to hack on offline first problems over a long weekend away from it all.",Offline First at Offline Camp California,Live,180 493,"Homepage PUBLISHED IN AUTONOMOUS AGENTS — #AI Follow Sign in / Sign up 33 Preetham V V Blocked Unblock Follow Following #AI & #MachineLearning enthusiast. Author: Java Web Services / Internet Security & Firewalls. VP, Brand Sciences & Products @inMobi #UltraRunner 3 days ago 11 min read -------------------------------------------------------------------------------- BAYESIAN REGULARIZATION FOR #NEURALNETWORKS image creditIf you are a Science or Math nerd, there is no way in hell you would have not heard of Bayes’s Theorem . It’s pervasive and quite a powerful inference model to understand and model anything from growth of Cancer cells, to obstacle detection in Autonomous Robots, to fixing the probability of a collision course of a Asteroid towards Earth. The simplicity of the Model is where it draws its power from. Specifically in the Artificial Intelligence community, you cannot do away with Bayesian Inference and Reasoning for optimizing your models. In the past post titled ‘ Emergence of the Artificial Neural Network ” I had mentioned that ANNs are emerging prominently among all other models due to its ability to accommodate techniques and theories from all other AI approaches quite well. I did mention that a full Bayesian Model can be used for interpreting weight decay. In this post, I intend to showcase the Bayesian techniques for Regularizing Neural Networks. This concept is also called Bayesian Regularized Artificial Neural Networks or BRANN for short. -------------------------------------------------------------------------------- WHAT IS BAYES’S THEOREM? (Feel free to skip this section if you already understand Bayes’s Theorem) Bayes’s Theorem fundamentally is based on the concept of “validity of Beliefs”. Reverend Thomas Bayes was a Presbyterian minster and a Mathematician who pondered much about developing the proof of existence of God. He came up with the Theorem in 18th century (which was later refined by Pierre-Simmon Laplace) to fix or establish the validity of ‘existing’ or ‘previous’ Beliefs in the face of best available ‘new’ evidence. Think of it as a equation to correct prior beliefs based on new evidence. One of the popular example used to explain Bayes’s Theorem is to detect if a patient has a certain disease or not. The key inferences in the Theorem is a follows: Event : An event is a fact. The patient truly having a disease is an event. Also, truly NOT having the disease is also an event. Test : A test is a mechanism to detect if a patient has the disease (or a test devised to prove that a patient does not have the disease. Note that they are not the same tests) Subject : A patient is a subject who may or may not have the disease. A test needs to be devised for the subject to detect the presence of disease or devise a test to prove that the disease does not exist. Test Reliability : A test devised to detect the disease may not be 100% reliable. The test may not possibly detect the disease all the time. When the detection fails to recognize the disease in a subject who truly has the disease, we call them false negatives . Also the test on the subject who truly does not have the disease may show that the subject does have the disease. This is called false positives . Test Probability : This the probability of a test to detect the event (disease) given a subject (patient). This does not account the Test Reliability. Event Probability (Posterior Probability) : This is the “corrected” test probability to detect the event given a subject by considering the reliability of the devised test. Belief (Prior Probability) : A belief, also called a prior probability (or prior in short) is the subjective assumption that disease exits in a patient (based on symptoms or other subjective observations) prior to conducting the test. This is the most important concept in Bayes’s Theorem. You need to start with the priors (or Beliefs) before you make corrections to that belief. The following is the equation which shall accommodate the stated concepts. In the equation, * A1.. A2.. are the events. A1 and A2 are mutually exclusive and collectively exhaustive. Let A1 mean that the disease is present in the subject and A2 mean that the disease is absent. * Let Ai refer to either one of the event A1 or A2. * B is a test devised to detect the disease (alternatively, it can also be a test that is devised to prove that the disease does not exist in the subject. Again, note that these are completely two different tests) * Let us say there is a population of people (in a random city) where there is a prior belief (based on some random observation, which may or may not be subjective) that 5% of the population “has the disease”. So, for any given subject in the population, the prior probability P(A1) “has the disease” is 5% and the prior probability P(A2) “does not have the disease” is 95%. * Let’s say, the test ‘B’ which is devised to “detect” the presence of a disease has a reliability of 90% (In other words, it detects the presence of a disease in a patient who truly have the disease only 9 out of 10 tests). Written mathematically, the probability of the test to detect a disease when the disease is truly present P(B|A1) = 0.9. * Unfortunately, the test ‘B’ also has a flaw which sometimes shows that the patient has the disease even when the disease is truly not present in the patient. Let us say that the 2 out of 10 patients who really does not have a disease gets falsely detected as having a disease. Mathematically, P(B|A2) = 0.2. * Now, if you randomly select a subject from the population and conduct the test on the subject, AND if the test result shows positive (The patient does have the disease), can we calculate the “Event probability” (or the Posterior Probability) of the person truly having the disease * Mathematically, calculate P(A1|B). Which can be read as, calculate the probability of A1 (presence of disease), given B (given test results being positive) So let’s assign the values for each probabilities. * Prior Probability of person having the disease = P(A1) = 0.05 * Prior Probability of person NOT having disease = P(A2) = 0.95 * Conditional Probability that the test shows positive, given that the person truly does have a disease = P(B|A1) = 0.9 * Conditional Probability that the test shows positive, even if the person truly does NOT have a disease = P(B|A2) = 0.2 * What is the “event probability” of a randomly selected person from the population who was performed the test, and the test result shows positive, to truly have the disease? = What is P(truly has disease given test is positive) = P(A1|B)? The posterior probability can be calculated based on Bayes’s Theorem as follows: So the posterior probability of the person truly having the disease, given that the test result is positive is only 19% !! Note the stark difference in the corrected probability even if the test results are 90% accurate ? Why do you think, this is the case? The answer lies in the ‘priors’. Note that the “belief” that only 5% of the population may have a disease, is the strong reason for a 19% posterior probability. It’s easy to prove. Change your prior beliefs (all else being equal) from 5% to let’s say a 30%. Then you shall get the following results. Note that the posterior probability for the same test with a higher prior jumped significantly to 65%. Hence, while all evidence and tests being equal, Bayes’s theorem is strongly influenced by priors . If you start with a very low prior, even in the face of strong evidence the posterior probability will be closer to the prior (lower). A prior is not something you randomly make up. It should be based on observations even if subjective. There should be some emphasis on why someone holds on to a belief before assigning a percentage. If you belief that God does not exist (prior), then strong test/evidence/hypothesis, which positively detects the possible existence of God moves your prior belief only a little bit, no matter how accurate the tests are. -------------------------------------------------------------------------------- WHAT DOES BAYESIAN INFERENCE MEAN FOR NEURAL NETS? Now that we understand Bayes’s Theorem, let’s see how this is applicable for Regularizing Neural Networks. In past few posts, we learnt about how Neural Nets overfit data and also techniques to regularize the Network towards reducing bias and variance. (A high-variance state is a state when the network is overfitted). One of the techniques to reduce variance and improve generalization is to apply weight decay and weight constraints. If we manage to trim the growing weights on a Neural Network to some meaningful degree, then we can control the variance of the network and avoid overfitting. So let’s focus on the probability distribution of the weight vector given a set of training data. First, let’s relook at what happens in a Neural Network. * We initialize the weight vector of a Neural Network to some optimal initial state. * We have a set of training data that will be run through the network continuously which shall change the weight vector to meet a stated output during training. * Every time we start with a new input (from the training data set) to train, we have a prior distribution of the weight vector and a probability of an output for the given input based on the weight vector. * Based on the new output, a cost function calculates the error deviations. * Back-propagation is used to fix the prior weights to reduce error. * We seen a posterior distribution of the weight vector for a given training data. The question we ask here is two fold: 1. Can we use the Bayesian Inference in such a way that the weight distribution is made optimal to learn the correct function that relevantly maps the input to the output. 2. Can we ensure that the network is NOT overfitting. To recap, mathematically, if ‘t’ is a expected target output and ‘y’ was the output of the Neural Net, then local error is nothing but E=(t-y) . The global error meanwhile can be a MSE as follows: or a ESS as follows: * Note that the dominant part of the equation is the squared Error in the equation. * We are trying to find the weight vector that minimizes the squared errors. * In likelihood terms , we can also state that we want to find the weight vectors that maximizes the log probability density towards a correct answer. * Minimizing the squared error is the same as maximizing the log probability density of the correct answer. This is called Maximum Likelihood Estimation . MAXIMUM LIKELIHOOD LEARNING First, let us look at the Maximum Likelihood learning before we apply Bayesian Inference. To do so, let’s assume that we are applying Gaussian Noise to the output of the Neural Network to regularize the network. In the previous post titled “ Mathematical foundation for Noise, Bias and Variance ”, we used Noise as a regularizer in the input. Note that we can apply Noise even for the output.Again, mathematically: In other words, let the output for a given training case y_c be some function of an input x_c and the weight vector w. Now assuming that we are applying a Gaussian Noise to the output, we get: We are simply stating that the probability density of the target value given the output after applying Gaussian Noise is the Gaussian distribution centered around the output . Let’s use negative log probability as the cost function as we want to minimize the cost. So we get: When we are working on multiple training cases ‘c’ in the dataset ‘D’, we intend to maximize the product of the probabilities of output of every training case ‘c’ in the dataset ‘D’, to be closer to the target. Since the output error for every training case is NOT dependent on the previous training case. We can mathematically state this as : In other words, the probability of observed data given a weight vector ‘w’ is the product of all probabilities of training case given the output. (Note that the output y_c is a function of inputs x_c and weight vector ‘w’). But, instead of the product of the probability of the target value given an output, we stated that we can work in the Log domain by taking negative log probabilities . So we can instead work on maximizing the Sum of log probabilities as shown: The above is the log probability of observed data, given a weight vector that helps in maximizing the log probability density of the output to be closer to the target value (assuming we are adding a Gaussian noise to the output). BAYESIAN INFERENCE AND MAXIMUM A POSTERIORI (MAP) We worked on a equation for the Maximum Likelihood learning, but can we use the Bayesian Inference to regularize the Maximum Likelihood? Indeed, the solution seems to lie in applying a Maximum A Posteriori or MAP in short. MAP tries to find the mode of the posterior distribution by employing Bayes’s Theorem. So for Neural Networks, this can be written as: Where, * P(w|D) is the posterior probability of the weight vector ‘w’ given the training data set D. * P(w) is the prior probability of the weight vector. * P(D|w) is the probability of the observed data given weight vector ‘w’. * And, the denominator is the integral of all possible weight vectors. We can convert the above equation to a cost function again applying the negative log likelihood as follows: Here, * P(D) is an integral over all possible weights and hence log P(D) converts to some constant. * From the Maximum Likelihood, we already learnt the equation for log P (D|w) Let’s look at log P(w), which is the log probability of the prior weights. This is based on how we initialize the weights. In the post titled “ Is Optimizing your Neural Network a Dark Art ? ” we learnt that the best way to initialize the weights is to apply a zero-mean-gaussian So, mathematically: So, the Bayesian Inference for MAP is as follows: Again, notice the similarity of the loss function to L2 regularization. Also note that we started we a randomly initialized zero-mean-gaussian weight vector for MAP and then started working towards fixing it to improve P(w|D). This has the same side-effect as L2 regularizers which can get stuck in local minima. We take the MAP approach because a full bayesian approach over all possible weights is computational intensive and is not tractable. There are tricks with MCMC which can help approximate a unbiased sample from true posteriors over the entire weights. I may cover this later in another post. Maybe now, you are equipped to validate the belief in God… Machine Learning Artificial Intelligence Neural Networks Deep Learning Bayesian Statistics 33 Blocked Unblock Follow FollowingPREETHAM V V #AI & #MachineLearning enthusiast. Author: Java Web Services / Internet Security & Firewalls. VP, Brand Sciences & Products @inMobi #UltraRunner FollowAUTONOMOUS AGENTS — #AI Notes of Artificial Intelligence and Machine Learning. × Don’t miss Preetham V V’s next story Blocked Unblock Follow Following Preetham V V","If you are a Science or Math nerd, there is no way in hell you would have not heard of Bayes’s Theorem. It’s pervasive and quite a powerful…",Bayesian Regularization for #NeuralNetworks – Autonomous Agents — #AI,Live,181 494,"Skip navigation Sign in SearchLoading... Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE. WATCH QUEUE QUEUE Watch Queue Queue * Remove all * Disconnect The next video is starting stop 1. Loading... Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: OVERVIEW OF RSTUDIO IDE developerWorks TVLoading... Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking... Subscribe Subscribed Unsubscribe 17KLoading... Loading... Working... Add toWANT TO WATCH THIS AGAIN LATER? Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO? Sign in to report inappropriate content. Sign in * Transcript * Statistics * Add translations 16 views 0LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1Loading... Loading... TRANSCRIPT The interactive transcript could not be loaded.Loading... Loading... Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning * CATEGORY * Science & Technology * LICENSE * Standard YouTube License Show more Show lessLoading... Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * NEAR DEATH CAPTURED by GoPro and camera pt.21 [FailForceOne] - Duration: 8:57. Fail Force One 450,345 views * New 8:57 -------------------------------------------------------------------------------- * A day with a BILLIONAIRE! Join Rich Kids of Instagram's Emir Bahadir as he works out and shops! - Duration: 12:30. LA Muscle 1,277,381 views 12:30 * Google Sheets and Python - Duration: 6:53. Twilio 355,684 views 6:53 * Inside a Google data center - Duration: 5:28. G Suite 6,810,063 views 5:28 * My Daily Life In NORTH KOREA (MYSTERIOUS 7 DAY TRIP) - Duration: 14:36. Jacob Laukaitis 7,916,310 views 14:36 * Data Science Experience: Analyze Db2 Warehouse on Cloud data in RStudio - Duration: 5:30. developerWorks TV 3 views * New 5:30 * CSV Files in Python - Learn Python Programming (Computer Science) - Duration: 9:33. Socratica 59,956 views 9:33 * How to Become a Data Scientist in 2017? | Data Scientist Career | Data Science Future - Duration: 1:17:14. HackerEarth 133,464 views 1:17:14 * Predicting Stock Prices - Learn Python for Data Science #4 - Duration: 7:39. Siraj Raval 233,577 views 7:39 * Introduction to Natural Language Processing - Cambridge Data Science Bootcamp - Duration: 22:23. Cambridge Coding Academy 63,240 views 22:23 * REST API concepts and examples - Duration: 8:53. WebConcepts 1,573,734 views 8:53 * Tetiana Ivanova - How to become a Data Scientist in 6 months a hacker’s approach to career planning - Duration: 56:26. PyData 131,889 views 56:26 * Learning R in RStudio: corrplot - Duration: 9:00. R at Colby 2,961 views 9:00 * Soren Macbeth - Data Science in Clojure - Duration: 44:04. ClojureTV 6,690 views 44:04 * Introduction To Web Scraping (with Python and Beautiful Soup) - Duration: 33:31. Data Science Dojo 134,381 views 33:31 * Data Science Experience: Build SQL queries with Apache Spark - Duration: 3:29. developerWorks TV 2 views * New 3:29 * IBM Big SQL: Analyze HDFS data with IBM Cognos Analytics - Duration: 6:54. developerWorks TV No views * New 6:54 * Introduction - Learn Python for Data Science #1 - Duration: 6:55. Siraj Raval 174,005 views 6:55 * 14-Year-Old Prodigy Programmer Dreams In Code - Duration: 8:42. THNKR 6,587,062 views 8:42 * What is an API? - Duration: 3:25. MuleSoft Videos 960,137 views 3:25 * Loading more suggestions... * Show more * Language: English * Content location: United States * Restricted Mode: Off History HelpLoading... Loading... Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Test new features * Loading... Working... Sign in to add this to Watch LaterADD TO Loading playlists...",This video is a quick tour of the RStudio Integrated Development Environment inside IBM Data Science Experience (DSX). ,Overview of RStudio IDE in DSX,Live,182 496,"Jump to navigation * Twitter * LinkedIn * Facebook * About * Contact * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chats * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Subscribe ×VIDEOS DATA SCIENCE EXPERT INTERVIEW: HOLDEN KARAU Post Comment July 28, 2016 | 6:20OVERVIEW James Kobielus, data science evangelist at IBM, interviews Holden Karau, principal software engineer of big data at IBM and coauthor of Learning Spark. To ensure data science success, you need to provide data scientists with an environment that is open, engaging and collaborative. To explore how your data scientists can access all the open functionality and expertise they’ll need for critical projects, join the new Data Science Experience . To learn how the next generation of open analytics will boost data scientist productivity, click here to register for IBM DataFirst launch event taking place on Tuesday September 27 in New York, or, if you can’t make it in person, click here to register for the livestream to the event. Follow @IBMBigData Topics: Analytics , Big Data Technology , Big Data Use Cases , Data Scientists , Hadoop Tags: data science , Spark , R , Hadoop , predictive analyticsRELATED CONTENT VIDEO DATA SCIENCE EXPERT INTERVIEW: JOE CASERTA Joe Caserta is founder and president of Caserta Concepts, a New York–based innovation technology and consulting firm that specializes in big data analytics, data warehousing, ETL and business intelligence. Don’t miss this enlightening discussion between Joe Caserta and IBM data science evangelist... Watch Video Blog IBM Big Replicate: Complete resilience through active-active replication Video Data science expert interview: Chris Maddern Video Data science expert interview: Dave Saranchak Blog A sneak peek at the future of data migration: The DNA of Bluemix Lift Video Data science expert interview: Jennifer Shin Blog Next-generation data scientist: Harnessing an integrated development environment Blog The 7 drivers of effective decision optimization Blog Insight Ops: The road to a collaborative self-service model Blog Unstructured and structured data versus repetitive and non-repetitive data Video Data science expert interview: Imran Younus Video Data science expert interview: Nick Pentreath Blog Why data science should be your top priority View the discussion thread. IBM * Site Map * Privacy * Terms of Use * 2014 IBM FOLLOW IBM BIG DATA & ANALYTICS * Facebook * YouTube * Twitter * @IBMbigdata * LinkedIn * Google+ * SlideShare * Twitter * @IBManalytics * Explore By Topic * Use Cases * Industries * Analytics * Technology * For Developers * Big Data & Analytics Heroes * Explore By Content Type * Blogs * Videos * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Events * Around the Web * About The Big Data & Analytics Hub * Contact Us * RSS Feeds * Additional Big Data Resources * AnalyticsZone * Big Data University * Channel Big Data * developerWorks Big Data Community * IBM big data for the enterprise * IBM Data Magazine * Smarter Questions Blog * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics Heroes More * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics Heroes SearchEXPLORE BY TOPIC: Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Infographic Analytics for banking: ""Keep up with my changing needs"" Blog Are you a master data chef? Infographic Unleash enterprise content to help power your digital transformation Presentation 9 results of data integration mistakesMORE Infographic Analytics for banking: ""Keep up with my changing needs"" Blog Are you a master data chef? Infographic Unleash enterprise content to help power your digital transformation Presentation 9 results of data integration mistakes Blog Forming deep partnerships with the business as CDO: A two-way street Blog IBM Big Replicate: Complete resilience through active-active replication Blog Torchbearer or follower: Priorities for today’s CIO Infographic Analytics for banking: ""Keep up with my changing needs"" Blog Are you a master data chef? Infographic Unleash enterprise content to help power your digital transformation Presentation 9 results of data integration mistakesMORE Infographic Analytics for banking: ""Keep up with my changing needs"" Blog Are you a master data chef? Infographic Unleash enterprise content to help power your digital transformation Presentation 9 results of data integration mistakes Blog 4 ways banking regulation is driving innovation in counterfraud analytics Podcast CIO Insights: The CIO-CDO collaboration on a master data organization Video Create valuable business practices with customer insights Infographic Analytics for banking: ""Keep up with my changing needs"" Presentation 4 lessons telcos can learn from delivering bad customer service Presentation 9 results of data integration mistakes Blog Customer care professionals: Don't ignore the elephant in the roomMORE Infographic Analytics for banking: ""Keep up with my changing needs"" Presentation 4 lessons telcos can learn from delivering bad customer service Presentation 9 results of data integration mistakes Blog Customer care professionals: Don't ignore the elephant in the room Podcast Finance in Focus: The science of customer insight Blog 4 ways banking regulation is driving innovation in counterfraud analytics Podcast CIO Insights: The CIO-CDO collaboration on a master data organization Blog Are you a master data chef? Video Data science expert interview: Joe Caserta Presentation 9 results of data integration mistakes Blog Forming deep partnerships with the business as CDO: A two-way streetMORE Blog Are you a master data chef? Video Data science expert interview: Joe Caserta Presentation 9 results of data integration mistakes Blog Forming deep partnerships with the business as CDO: A two-way street Blog Gearing up for the General Data Protection Regulation Blog IBM Big Replicate: Complete resilience through active-active replication Video Data science expert interview: Chris Maddern * Home * Explore By Topic * Use Cases * All * Acquire, Grow & Retain Customers * Create New Business Models * Improve IT Economics * Manage Risk * Optimize Operations & Reduce Fraud * Transform Financial Processes * Industries * All * Banking * Consumer Products * Education * Energy & Utilities * Government * Healthcare & Life Sciences * Industrial * Insurance * Media & Entertainment * Retail * Telecommunications * Analytics * All * Content Analytics * Customer Analytics * Entity Analytics * Social Media Analytics * Technology * All * Business Intelligence * Cloud Database * Data Governance * Data Warehouse * Database Management Systems * Data Science * Hadoop & Spark * Internet of Things * Predictive Analytics * Streaming Analytics * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chat * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Big Data & Analytics Heroes * For Developers * Events * Upcoming Events * Webcasts * Twitter Chat * Meetups * Around The Web * About Us * Contact Us * Search Site","James Kobielus, data science evangelist at IBM, interviews Holden Karau, principal software engineer of big data at IBM and coauthor of Learning Spark.",Data science expert interview: Holden Karau,Live,183 503,"* Home * Research * Partnerships and Chairs * Staff * Books * Articles * Videos * Presentations * Contact Information * Subscribe to our Newsletter * 中文 * Marketing Analytics * Credit Risk Analytics * Fraud Analytics * Process Analytics * Human Resource Analytics * Prof. dr. Bart Baesens * Prof. dr. Seppe vanden Broucke * Aimée Backiel * Sandra Mitrović * Klaas Nelissen * María Óskarsdóttir * Michael Reusens * Eugen Stripling * Tine Van Calster * Basic Java Programming * Principles of Database Management * Business Information Systems * Mini Lecture Series * Other Videos IMPROVING THE ROI OF BIG DATA AND ANALYTICS THROUGH LEVERAGING NEW SOURCES OF DATA Posted on April 21, 2017Contributed by: Bart Baesens This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps . Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch! -------------------------------------------------------------------------------- Big Data and Analytics are all around these days. Most companies already have their first analytical models in production and are thinking about further boosting their performance. Far too often, they hereby focus on the analytical techniques rather than on the key ingredient: data! We believe the best way to boost the performance and ROI of an analytical model is by investing in new sources of data which can help to further unravel complex customer behavior and improve key analytical insights. In what follows, we briefly explore various types of data sources that could be worthwhile pursuing in order to squeeze more economic value out of your analytical models. A first option concerns the exploration of network data by carefully studying relationships between customers. These relationships can be explicit or implicit. Examples of explicit networks are calls between customers, shared board members between firms, and social connections (e.g., family, friends ). Explicit networks can be readily distilled from underlying data sources (e.g., call logs) and their key characteristics can then be summarized using featurization procedures resulting into new characteristics which can be added to the modeling data set. In our previous research (Verbeke et al., 2014; Van Vlasselaer et al., 2017), we found network data to be highly predictive for both customer churn prediction and fraud detection. Implicit networks or pseudo networks are a lot more challenging to define and featurize. Martens and Provost (2016) built a network of customers where links were defined based upon which customers transferred money to the same entities (e.g., retailers) using data from a major bank. When combined with non-network data, this innovative way of defining a network based upon similarity instead of explicit social connections gave a better lift and generated more profit for almost any targeting budget. In another, award-winning study they built a geosimilarity network among users based upon location-visitation data in a mobile environment (Provost et al., 2015). More specifically, two devices are considered similar and thus connected, when they share at least one visited location. They are more similar if they have more shared locations and as these are visited by fewer people. This implicit network can then be leveraged to target advertisements to the same user on different devices or to users with similar tastes, or to improve online interactions by selecting users with similar tastes. Both of these examples clearly illustrate the potential of implicit networks as an important data source. A key challenge here is to creatively think about how to define these networks based upon the goal of the analysis. Data are often branded as the new oil. Hence, data pooling firms capitalize on this by gathering various types of data, analyzing them in innovative and creative ways, and selling the results thereof. Popular examples are Equifax, Experian, Moody’s, S&P, Nielsen, and Dun & Bradstreet, among many others. These firms consolidate publically available data, data scraped from websites or social media, survey data, and data contributed by other firms. By doing so, they can perform all kinds of aggregated analyses (e.g., geographical distribution of credit default rates in a country, average churn rates across industry sectors), build generic scores (e.g., the FICO in the US) and sell these to interested parties. Because of the low-entry barrier in terms of investment, externally purchased analytical models are sometimes adopted by smaller firms (e.g., SMEs) to take their first steps in analytics. Besides commercially available external data, open data can also be a valuable source of external information. Examples are industry and government data, weather data, news data, and search data (e.g., Google Trends). Both commercial and open external data can significantly boost the performance and thus economic return of an analytical model. Macro-economic data are another valuable source of information. Many analytical models are developed using a snapshot of data at a particular moment in time. This is obviously conditional on the external environment at that moment. Macro-economic up- or down-turns can have a significant impact on the performance and thus ROI of the analytical model. The state of the macro-economy can be summarized using measures such as gross domestic product (GDP), inflation and unemployment. Incorporating these effects will allow us to further improve the performance of analytical models and make them more robust against external influences. Textual data are also an interesting type of data to consider. Examples are product reviews, Facebook posts, Twitter tweets, book recommendations, complaints, and legislation. Textual data are difficult to process analytically since they are unstructured and cannot be directly represented into a matrix format. Moreover, these data depend upon the linguistic structure (e.g., type of language, relationship between words, negations, etc.) and are typically quite noisy data due to grammatical or spelling errors, synonyms and homographs. However, they can contain very relevant information for your analytical modeling exercise. Just as with network data (see above), it will be important to find ways to featurize text documents and combine it with your other structured data. A popular way of doing this is by using a document term matrix indicating what terms (similar to variables) appear and how frequently in which documents (similar to observations). It is clear that this matrix will be large and sparse. Dimension reduction will thus be very important as the following activities illustrate: * represent every term in lower case (e.g., PRODUCT, Product, product become product) * remove terms which are uninformative such as stop words and articles (e.g., the product, a product, this product become product) * use synonym lists to map synonym terms to one single term (product, item, article become product) * stem all terms to their root (products, product become product) * remove terms that only occur in a single document Even after the above activities have been performed, the number of dimensions may still be too big for practical analysis. Singular Value Decomposition (SVD) offers a more advanced way to do dimension reduction (Meyer, 2000). SVD works similar to principal component analysis (PCA) and summarizes the document term matrix into a set of singular vectors (also called latent concepts) which are linear combinations of the original terms. These reduced dimensions can then be added as new features to your existing, structured data set. Besides textual data, other types of unstructured data such as audio, images, videos, fingerprint, GPS, and RFID data can be considered as well. To successfully leverage these types of data in your analytical models, it is of key importance to carefully think about creative ways of featurizing them. When doing so, it is recommended that any accompanying metadata are taken into account; for example, not only the image itself might be relevant, but also who took it, where, and at what time. This information could be very useful for fraud detection. To summarize, we strongly believe that the best way to boost the performance and ROI of your analytical models is by investing in data first! In this contribution, we gave some examples of alternative data sources which can contain valuable information about the behavior of your customers. REFERENCES * Martens D., Provost F., Mining Massive Fine-Grained Behavior Data to Improve Predictive Analytics, MIS Quarterly, Volume 40, Number 4, pp. 869-888, 2016. * Meyer C.D., Matrix Analysis and Applied Linear Algebra, SIAM, Philadelphia, 2000. * Provost F., Martens D., Murray A., Finding Similar Mobile Consumers with a Privacy-Friendly Geosocial Design, Information Systems Research, Volume 26, Issue 2, pp. 243 – 265, 2015. * Van Vlasselaer V., Eliassi-Rad T., Akoglu L., Snoeck M., Baesens B., GOTCHA! Network-based Fraud Detection for Security Fraud, Management Science , forthcoming, 2017 * Verbeke W., Martens D., Baesens B., Social network analysis for customer churn prediction, Applied Soft Computing , Volume 14, pp. 341-446, 2014. ‹ Web Picks (week of 17 April 2017) —Ad—We display ads on this section of the site. -------------------------------------------------------------------------------- Recent Posts * Improving the ROI of Big Data and Analytics through Leveraging New Sources of Data * Web Picks (week of 17 April 2017) * Why is benchmarking needed for credit risk modeling? * How To Gain Insights from Data Without Sacrificing Privacy? * Web Picks (week of 3 April 2017) Archives * April 2017 * March 2017 * February 2017 * January 2017 * December 2016 * November 2016 * October 2016 * September 2016 * August 2016 * July 2016 * June 2016 * May 2016 * April 2016 * March 2016 * February 2016 * January 2016 * December 2015 * November 2015 * October 2015 * September 2015 * August 2015 * July 2015 * June 2015 * May 2015 * * * © DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU Leuven KU Leuven, Department of Decision Sciences and Information Management Naamsestraat 69, 3000 Leuven, Belgium DataMiningApps on Twitter , Facebook , YouTube info@dataminingapps.com", We believe the best way to boost the performance and ROI of an analytical model is by investing in new sources of data which can help to further unravel complex customer behavior and improve key analytical insights.,Improving the ROI of Big Data and Analytics through Leveraging New Sources of Data,Live,184 508,"With Cloudant, building location-aware systems is within the reach of any web developer. This demo application uses HTML5 and JavaScript to record a device's GPS locations, and then save them—both on the device and to IBM Cloudant.To get started, sign up or sign in to Cloudant and then grab the code on Github.Coming Soon: Add a middle tier to manage users, with NodeJSPlease enable JavaScript to view the comments powered by Disqus.","With Cloudant, building location-aware systems is within the reach of any web developer. This demo application uses HTML5 and JavaScript to record a device's GPS locations, and then save them—both on the device and to IBM Cloudant.",Location Tracker,Live,185 514,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectGEOSPATIAL QUERY WITH CLOUDANT SEARCHRaj R Singh / January 7, 2016Geospatial querying is such a basic requirement for modern applications. Manyapps are map-centric, like Yelp! or Hotels.com or retail store finders, whichhelp users find places nearby. But other geospatial query use cases live deepunder the covers of an app, like a ToDo list app that notifies you when you’renear the place you can accomplish a task.This is a quick tutorial on how to use Cloudant Search to add geospatial queryto your apps.GEOSPATIAL QUERY OPTIONS IN CLOUDANTFirst off, as a developer, you need to know that there are 2 different optionsfor performing geospatial queries in Cloudant: * Cloudant Geo offers the most flexible geospatial query options. You can query by radius, rectangle, and polygon, but you can’t query by any other attributes of the database at the same time. (At least not today, but engineering elves are hard at work building this feature!) * Cloudant Search only supports rectangle bounding box queries, but unlike Cloudant Geo, you can combine it with attribute and free text search. If you’re searching for a doctor, seeing mechanics in search results gets in the way, so refining your geospatial search with additional attributes is a must in many cases. If your result set is small, it’s easy to do that client-side, but if it gets big (for instance, if you’re in a densely populated city) a simple geo index won’t cut it, as you really want to include additional search requirements with your location data. Cloudant Search is powered by Apache Lucene, the most popular open-source search library. By drawing on the speed and simplicity of Lucene, the Cloudant service provides a familiar way to add search to apps. Cloudant Search lets you further enhance indexing and querying with: * Ranked searching . Search results can be ordered by relevance or by custom sort fields * Powerful query types , including phrase queries, wildcard queries, proximity queries, fuzzy searches, range queries and more * Language-specific analyzers * Faceted search and filtering * Bookmarking . Paginate results in the style of popular Web search engines INDEXING BOSTON CRIME DATA FOR SEARCHThere’s already a host of excellent resources on indexing and querying withCloudant Search, so if you’re not familiar with the basics, start here: * Cloudant Learning Center: video on Search * Formal API documentation on Cloudant Search * Cloudant For Developers: Search IndexesOnce you’re up-to-speed, we can have some fun with crime data! We’ll use asample of crimes in Boston, MA provided by the city government as open data here . We already have this data in Cloudant, and you can view a sample here , or replicate the database to your own Cloudant account. If you want to follow along whilecoding and don’t already have a Cloudant account, sign up for a free trial here .The first thing we need to do to the database is define our Search index. Hereis the Javascript function for that:function (doc) { if ( doc.properties.main_crimecode && doc.geometry.coordinates[0] && doc.geometry.coordinates[1]) { index(""type"", doc.properties.main_crimecode, {""store"": true, ""facet� index(""long� index(""lat� }}I save this to the crimes database in a design document called lucenegeoblog and name the index findcrimes (those 2 facts will be important next, when we write our queries).Note that I’m indexing 3 properties of the database, and indexing a documentonly if those properties exist. * doc.properties.main_crimecode tells us what the crime was (or at least the main crime, since people could be doing more than one bad thing at the same time) * doc.geometry.coordinates[0] is where the longitude value for the crime’s location lives * doc.geometry.coordinates[1] is where the latitude value for the crime’s location livesNow we’re ready to play with the data…QUERYING CRIMESTERM SEARCHLucene offers a whole range of interesting ways to query text, including fuzzymatching, proximity search, numerical ranges, and more. Here, since the focus ison the geospatial aspects, we’ll just do the most basic of text searches, barelyflexing Lucene’s muscles, but it’s enough to illustrate the point. Let’s justask for crimes involving an argument:https://examples.cloudant.com/crimes/_design/lucenegeoblog/_search/findcrimes?q=type:ArgueThis query returns 13 rows:{""total_rows"":13, ""bookmark"":""g1AAAAEWeJzLYWBgYMlgTmFQTElKzi9KdUhJMjTUy00tyixJTE_VS87JL01JzCvRy0styQEqZUpkSLL___9_VgaTmwNPqnMDUCzRFKRfAa7fEo_2JAcgmVQPM4H3rS3YBB00F5jgMSKPBUgyNAApoCn7wcYIioY-ABmjQYJHIMYcgBiD6h-jLADMN1fM"", ""rows"":[ {""id"":""79f14b64c57461584b152123e38a58ca"",""order"":[4.2708353996276855,0],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e38ec546"",""order"":[4.2708353996276855,40],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e38c4ce8"",""order"":[3.740839958190918,13],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e3908811"",""order"":[3.740839958190918,38],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e39108e1"",""order"":[3.740839958190918,44],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e38b11d4"",""order"":[3.549445152282715,8],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e38b5c12"",""order"":[3.549445152282715,10],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e38e2803"",""order"":[3.549445152282715,31],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e38e7cbf"",""order"":[3.549445152282715,39],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e3905861"",""order"":[3.549445152282715,44],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e390f947"",""order"":[3.549445152282715,50],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e390bc77"",""order"":[3.549445152282715,51],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e3912dab"",""order"":[3.549445152282715,53],""fields"":{""type"":""Argue""}} ]}Which would look like this if plotted on a map:Now, say we want to organize the results by proximity to a local bar we thinkmay be a problem. We know the coordinates of this bar, so we can use a clever sort parameter to accomplish our goal in this new query:https://examples.cloudant.com/crimes/_design/lucenegeoblog/_search/findcrimes?q=type:Argue&sort=""""This returns the same 13 rows, but take a look at the id s. The order is now different.{""total_rows"":13, ""bookmark"":""g1AAAAEmeJzLYWBgYMlgTmFQTElKzi9KdUhJMjTSy00tyixJTE_VS87JL01JzCvRy0styQEqZUpkSLL___9_Fpjj5iA578XsvIjgROMskBkKcDMs8BiR5AAkk-qRTOF5cLvueLNbIm8WmktM8BiTxwIkGRqAFNCk_TCjOM-udxWQPpzIgG4UPk9BjDoAMQruKsHexVECphqJOllZAFqPX6Q"", ""rows"":[ {""id"":""79f14b64c57461584b152123e3908811"",""order"":[0.0,38],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e38ec546"",""order"":[0.46176565188522095,40],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e38b5c12"",""order"":[0.9774288003583641,10],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e39108e1"",""order"":[1.399243473889131,44],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e38e2803"",""order"":[1.4297353780528468,31],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e3912dab"",""order"":[1.674393221777318,53],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e38b11d4"",""order"":[1.7185707796811796,8],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e390f947"",""order"":[2.1562546799337228,50],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e38a58ca"",""order"":[3.225431956819621,0],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e38c4ce8"",""order"":[3.6097936539303275,13],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e38e7cbf"",""order"":[3.7522872699576357,39],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e3905861"",""order"":[4.388318450202213,44],""fields"":{""type"":""Argue""}}, {""id"":""79f14b64c57461584b152123e390bc77"",""order"":[6.405184200868535,51],""fields"":{""type"":""Argue""}} ]}Now we can pay more attention to the crimes at the top of the list, and notwaste time looking at crimes far from the bar. This doesn’t seem like a big dealwith 13 results, but if we were using the full crime database, which has almosthalf a million crimes, optimizations like this are crucial.Another way to restrict our search to a small area around the bar would be toadd a geospatial bounding box (or rectangular ‘fence’) to the query, limitingresponses to documents whose longitude falls between -71.08 and -71.04 and whoselatitude falls between 42.28 and 42.32. Let’s also throw an include_docs=true parameter in the query so we can see all the information in the document.https://examples.cloudant.com/crimes/_design/lucenegeoblog/_search/findcrimes?q=type:Argue AND long:[-71.08 TO -71.04] AND lat:[42.28 TO 42.32]&sort=""""&include_docs=trueI won’t reproduce the entire response here, but it contains only 7 rows. Itworked!You’ve glimpsed the power of combining basic geospatial queries with Lucene’sextraordinary text search capabilities. The possibilities are truly endless.Comment here to let us know how you use it, and you could be a future guest starhere on our blog.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Apache Lucene / cloudant / geospatial Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",A powerful way to fine-tune location search results. Combine basic geospatial queries with Lucene's extraordinary text search capabilities.,Geospatial query with Cloudant Search,Live,186 517,"COMPOSE AND RETHINKDB 2.3'S DRIVERS Apr 25, 2016TL;DR : The latest RethinkDB drivers don't work with previous versions of RethinkDB.Take steps to ""pin"" your drivers to a compatible version.RethinkDB recently released their latest excellent update to their database inthe form of version 2.3, ""Fantasia"". There's quite a few improvements in the newversion such as user account support, a fold command, integrated SSL encryption and an official Windows version. You canread about them on RethinkDB's blog . We're working on incorporating support for those changes and releasing anupdated RethinkDB.What also happened was that RethinkDB took the opportunity to update the wayclients and servers communicate and that can lead to some problems for Node,Ruby, Python, Java and Go users. The issue here is about drivers. Using Node.jsas an example of the issue, it'll present itself something like this inpractice:$ node index.js ERROR: Received an unsupported protocol version. This port is for RethinkDB queries. Does your client driver version not match the server? ^SyntaxError: Unexpected token E at Object.parse (native) at TLSSocket.handshake_callback (/Users/dj/sandbox/noderethink/node_modules/rethinkdb/net.js:624:35) at emitOne (events.js:90:13) at TLSSocket.emit (events.js:182:7) at readableAddChunk (_stream_readable.js:153:18) at TLSSocket.Readable.push (_stream_readable.js:111:10) at TLSWrap.onread (net.js:529:20)WHAT'S BREAKING?RethinkDB drivers use a wire protocol which lets clients talk to the server. Youcan read about it in Writing RethinkDB Drivers . The protocol got up to version 0.4 for RethinkDB 2.0 to 2.2 but for 2.3 therewas a major change; specifically an updated protocol 1.0 which would be able todetect previous protocols and fall back to them. RethinkDB 2.3.0 servers supportthis new protocol, and previous protocol versions. One of the advantages of thenew 1.0 protocol it that it can be updated much more easily so that in future,when a protocol change is introduced, the clients and servers will know how tofall back to a version of the protocol they both speak.The catch is that this once-in-a-development lifecycle change also creates abreak in compatibility. While older drivers are be able talk to the newRethinkDB 2.3 server, the new 2.3 drivers only speak version 1.0 of the protocolwhich means that clients using the newest driver can't speak to previousversions of the server.If this sounds like it shouldn't affect you, think again. Modern softwaredevelopment platforms use package managers to download the various componentsthat applications need to run. They are npm for Node, gem for Ruby, pip forPython, Maven for Java and Go has it baked into its platform. It makesdevelopment much easier - to add the RethinkDB driver to a Node project all youhave to do is run npm install rethinkdb --save and you are ready to go. Or not. By default, these package managers downloadthe latest version of the package which is sensible.With the release of RethinkDB 2.3 though, all those repositories have had theirRethinkDB drivers updated so if you ask to install a driver without qualifyingwhat version you want, you'll get the 2.3 version. Then, when you go to connectto a Compose RethinkDB installation you get the protocol incompatibilitymessage:ERROR: Received an unsupported protocol version. This port is for RethinkDB queries. Does your client driver version not match the server? Of course, package managers do allow you to set the version, or range ofversions, you want to go with your package. When you find yourself in thissituation, you need to uninstall the latest driver, find out what the mostrecent previous driver you can download is – it'll be version 2.2.something –and install that.THE SOLUTIONS...NODE.JSTo install the correct driver to talk to Compose servers run:npm uninstall rethinkdb npm install rethinkdb@2.2.3 You can add a dependency in your package.json file so that it only uses a version prior to 2.3.0 like this:{ ""dependencies"" : { ""rethinkdb"" : ""<2.3.0"" }}RUBYFor Ruby, the quick way to set up the driver is to run:gem uninstall rethinkdb gem install rethinkdb -v 2.2.0.4 You can get a list of available versions by running gem list -ra rethinkdb . You can specify in your applications Gemfile that you want any driver up tobut not including 2.3.0 and later by adding:gem 'rethinkdb', '< 2.3.0' PYTHONFor Python applications, you need to run:pip uninstall rethinkdb pip install rethinkdb==2.2.0.post6 If you want an idea of what versions are available in future, run pip install rethinkdb==noversion and the pip will fail to find version ""noversion"" and list all the otheravailable versions. If your Python program has a pip requirements file, add thisto require a pre-2.3.0 driver:rethinkdb >=2.2.0,<2.3.0 There is, we are told, an undocumented flag in the 2.3.0 driver which lets ittalk to older RethinkDB databases, designed mainly for testing. It's probablybest to ignore that though as it will involve modifying your code to use anunsupported path and why do that when you can just set versions as part of thebuild.JAVAThere's no world of command line package management for Java; it tends to be alldeclared in build configuration files for the various tools. With Maven, as anexample, you may have this in your pom.xml : com.rethinkdb rethinkdb-driver LATEST ``` This would pull the latest version of the driver down. It's not that common a setting – Java developers often set version numbers in their `pom.xml` files – but it is how you can unwittingly be caught by this protocol change. Simply replace the version tag with: ``` 2.2-beta-6This, of course, pins the application to that version and we're done.GOThe previous drivers are all official drivers, but our favourite unofficialdriver is GoRethink . If you use v1 of the driver you won't need to do anything as that justsupports the 0.4 protocol. If you upgrade – which means specifically importingthe v2 package - then that does use the 1.0 protocol by default, but even there as the CHANGELOG notes , you can set the HandshakeVersion to 0.4 when connecting to enable access to older servers.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writersince Apples came in II flavors and Commodores had Pets. Love this article? Headover to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","TL;DR: The latest RethinkDB drivers don't work with previous versions of RethinkDB. Take steps to ""pin"" your drivers to a compatible version.",Compose and RethinkDB 2.3's drivers,Live,187 519,* Select a country/region: United StatesIBM� * Site mapSearch * Related materials Download * NO RELATED MATERIALS FOUND * * * * * * LinkedIn * Google+ * Twitter * Facebook * * Related materials * NO RELATED MATERIALS FOUND * Download * * * * * * DownloadCONTACT IBMCONSIDERING A PURCHASE? * Email IBMFOOTER LINKS * Contact * Privacy * Terms of use * Accessibility,"In the domain of data science, solving problems and answering questions through data analysis is standard practice. Often, data scientists construct a model to predict outcomes or discover underlying patterns, with the goal of gaining insights. Organizations can then use these insights to take actions that ideally improve future outcomes.",Foundational Methodology for Data Science,Live,188 522,"Toggle navigation * * About * * Archives * * PRACTICAL BUSINESS PYTHON Taking care of business, one python script at a time Sun 30 November 2014COMMON EXCEL TASKS DEMONSTRATED IN PANDAS Posted by Chris Moffitt in articles INTRODUCTION The purpose of this article is to show some common Excel tasks and how you would execute similar tasks in pandas . Some of the examples are somewhat trivial but I think it is important to show the simple as well as the more complex functions you can find elsewhere. As an added bonus, I’m going to do some fuzzy string matching to show a little twist to the process and show how pandas can utilize the full python system of modules to do something simply in python that would be complex in Excel. Make sense? Let’s get started. ADDING A SUM TO A ROW The first task I’ll cover is summing some columns to add a total column. We will start by importing our excel data into a pandas dataframe. importpandasaspdimportnumpyasnpdf=pd.read_excel(""excel-comp-data.xlsx"")df.head() account name street city state postal-code Jan Feb Mar 0 211829 Kerluke, Koepp and Hilpert 34456 Sean Highway New Jaycob Texas 28752 10000 62000 35000 1 320563 Walter-Trantow 1311 Alvis Tunnel Port Khadijah NorthCarolina 38365 95000 45000 35000 2 648336 Bashirian, Kunde and Price 62184 Schamberger Underpass Apt. 231 New Lilianland Iowa 76517 91000 120000 35000 3 109996 D’Amore, Gleichner and Bode 155 Fadel Crescent Apt. 144 Hyattburgh Maine 46021 45000 120000 10000 4 121213 Bauch-Goldner 7274 Marissa Common Shanahanchester California 49681 162000 120000 35000We want to add a total column to show total sales for Jan, Feb and Mar. This is straightforward in Excel and in pandas. For Excel, I have added the formula sum(G2:I2) in column J. Here is what it looks like in Excel: Next, here is how we do it in pandas: df[""total""]=df[""Jan""]+df[""Feb""]+df[""Mar""]df.head() account name street city state postal-code Jan Feb Mar total 0 211829 Kerluke, Koepp and Hilpert 34456 Sean Highway New Jaycob Texas 28752 10000 62000 35000 107000 1 320563 Walter-Trantow 1311 Alvis Tunnel Port Khadijah NorthCarolina 38365 95000 45000 35000 175000 2 648336 Bashirian, Kunde and Price 62184 Schamberger Underpass Apt. 231 New Lilianland Iowa 76517 91000 120000 35000 246000 3 109996 D’Amore, Gleichner and Bode 155 Fadel Crescent Apt. 144 Hyattburgh Maine 46021 45000 120000 10000 175000 4 121213 Bauch-Goldner 7274 Marissa Common Shanahanchester California 49681 162000 120000 35000 317000Next, let’s get some totals and other values for each month. Here is what we are trying to do as shown in Excel: As you can see, we added a SUM(G2:G16) in row 17 in each of the columns to get totals by month. Performing column level analysis is easy in pandas. Here are a couple of examples. df[""Jan""].sum(),df[""Jan""].mean(),df[""Jan""].min(),df[""Jan""].max() (1462000, 97466.666666666672, 10000, 162000) Now, we want to add a total by month and grand total. This is where pandas and Excel diverge a little. It is very simple to add totals in cells in Excel for each month. Because pandas need to maintain the integrity of the entire DataFrame, there are a couple more steps. First, create a sum for the month and total columns. sum_row=df[[""Jan"",""Feb"",""Mar"",""total""]].sum()sum_row Jan 1462000 Feb 1507000 Mar 717000 total 3686000 dtype: int64 This is fairly intuitive however, if you want to add totals as a row, you need to do some minor manipulations. We need to transpose the data and convert the Series to a DataFrame so that it is easier to concat onto our existing data. The T function allows us to switch the data from being row-based to column-based. df_sum=pd.DataFrame(data=sum_row).Tdf_sum Jan Feb Mar total 0 1462000 1507000 717000 3686000The final thing we need to do before adding the totals back is to add the missing columns. We use reindex to do this for us. The trick is to add all of our columns and then allow pandas to fill in the values that are missing. df_sum=df_sum.reindex(columns=df.columns)df_sum account name street city state postal-code Jan Feb Mar total 0 NaN NaN NaN NaN NaN NaN 1462000 1507000 717000 3686000Now that we have a nicely formatted DataFrame, we can add it to our existing one using append . df_final=df.append(df_sum,ignore_index=True)df_final.tail() account name street city state postal-code Jan Feb Mar total 11 231907 Hahn-Moore 18115 Olivine Throughway Norbertomouth NorthDakota 31415 150000 10000 162000 322000 12 242368 Frami, Anderson and Donnelly 182 Bertie Road East Davian Iowa 72686 162000 120000 35000 317000 13 268755 Walsh-Haley 2624 Beatty Parkways Goodwinmouth RhodeIsland 31919 55000 120000 35000 210000 14 273274 McDermott PLC 8917 Bergstrom Meadow Kathryneborough Delaware 27933 150000 120000 70000 340000 15 NaN NaN NaN NaN NaN NaN 1462000 1507000 717000 3686000ADDITIONAL DATA TRANSFORMS For another example, let’s try to add a state abbreviation to the data set. From an Excel perspective the easiest way is probably to add a new column, do a vlookup on the state name and fill in the abbreviation. I did this and here is a snapshot of what the results looks like: You’ll notice that after performing the vlookup, there are some values that are not coming through correctly. That’s because we misspelled some of the states. Handling this in Excel would be really challenging (on big data sets). Fortunately with pandas we have the full power of the python ecosystem at our disposal. In thinking about how to solve this type of messy data problem, I thought about trying to do some fuzzy text matching to determine the correct value. Fortunately someone else has done a lot of work in this are. The fuzzy wuzzy library has some pretty useful functions for this type of situation. Make sure to get it and install it first. The other piece of code we need is a state name to abbreviation mapping. Instead of trying to type it myself, a little googling found this code . Get started by importing the appropriate fuzzywuzzy functions and define our state map dictionary. fromfuzzywuzzyimportfuzzfromfuzzywuzzyimportprocessstate_to_code={""VERMONT"":""VT"",""GEORGIA"":""GA"",""IOWA"":""IA"",""Armed Forces Pacific"":""AP"",""GUAM"":""GU"",""KANSAS"":""KS"",""FLORIDA"":""FL"",""AMERICAN SAMOA"":""AS"",""NORTH CAROLINA"":""NC"",""HAWAII"":""HI"",""NEW YORK"":""NY"",""CALIFORNIA"":""CA"",""ALABAMA"":""AL"",""IDAHO"":""ID"",""FEDERATED STATES OF MICRONESIA"":""FM"",""Armed Forces Americas"":""AA"",""DELAWARE"":""DE"",""ALASKA"":""AK"",""ILLINOIS"":""IL"",""Armed Forces Africa"":""AE"",""SOUTH DAKOTA"":""SD"",""CONNECTICUT"":""CT"",""MONTANA"":""MT"",""MASSACHUSETTS"":""MA"",""PUERTO RICO"":""PR"",""Armed Forces Canada"":""AE"",""NEW HAMPSHIRE"":""NH"",""MARYLAND"":""MD"",""NEW MEXICO"":""NM"",""MISSISSIPPI"":""MS"",""TENNESSEE"":""TN"",""PALAU"":""PW"",""COLORADO"":""CO"",""Armed Forces Middle East"":""AE"",""NEW JERSEY"":""NJ"",""UTAH"":""UT"",""MICHIGAN"":""MI"",""WEST VIRGINIA"":""WV"",""WASHINGTON"":""WA"",""MINNESOTA"":""MN"",""OREGON"":""OR"",""VIRGINIA"":""VA"",""VIRGIN ISLANDS"":""VI"",""MARSHALL ISLANDS"":""MH"",""WYOMING"":""WY"",""OHIO"":""OH"",""SOUTH CAROLINA"":""SC"",""INDIANA"":""IN"",""NEVADA"":""NV"",""LOUISIANA"":""LA"",""NORTHERN MARIANA ISLANDS"":""MP"",""NEBRASKA"":""NE"",""ARIZONA"":""AZ"",""WISCONSIN"":""WI"",""NORTH DAKOTA"":""ND"",""Armed Forces Europe"":""AE"",""PENNSYLVANIA"":""PA"",""OKLAHOMA"":""OK"",""KENTUCKY"":""KY"",""RHODE ISLAND"":""RI"",""DISTRICT OF COLUMBIA"":""DC"",""ARKANSAS"":""AR"",""MISSOURI"":""MO"",""TEXAS"":""TX"",""MAINE"":""ME""} Here are some example of how the fuzzy text matching function works. process.extractOne(""Minnesotta"",choices=state_to_code.keys()) ('MINNESOTA', 95) process.extractOne(""AlaBAMMazzz"",choices=state_to_code.keys(),score_cutoff=80) Now that we know how this works, we create our function to take the state column and convert it to a valid abbreviation. We use the 80 score_cutoff for this data. You can play with it to see what number works for your data. You’ll notice that we either return a valid abbreviation or an np.nan so that we have some valid values in the field. defconvert_state(row):abbrev=process.extractOne(row[""state""],choices=state_to_code.keys(),score_cutoff=80)ifabbrev:returnstate_to_code[abbrev[0]]returnnp.nan Add the column in the location we want and fill it with NaN values df_final.insert(6,""abbrev"",np.nan)df_final.head() account name street city state postal-code abbrev Jan Feb Mar total 0 211829 Kerluke, Koepp and Hilpert 34456 Sean Highway New Jaycob Texas 28752 NaN 10000 62000 35000 107000 1 320563 Walter-Trantow 1311 Alvis Tunnel Port Khadijah NorthCarolina 38365 NaN 95000 45000 35000 175000 2 648336 Bashirian, Kunde and Price 62184 Schamberger Underpass Apt. 231 New Lilianland Iowa 76517 NaN 91000 120000 35000 246000 3 109996 D’Amore, Gleichner and Bode 155 Fadel Crescent Apt. 144 Hyattburgh Maine 46021 NaN 45000 120000 10000 175000 4 121213 Bauch-Goldner 7274 Marissa Common Shanahanchester California 49681 NaN 162000 120000 35000 317000We use apply to add the abbreviations into the approriate column. df_final['abbrev']=df_final.apply(convert_state,axis=1)df_final.tail() account name street city state postal-code abbrev Jan Feb Mar total 11 231907 Hahn-Moore 18115 Olivine Throughway Norbertomouth NorthDakota 31415 ND 150000 10000 162000 322000 12 242368 Frami, Anderson and Donnelly 182 Bertie Road East Davian Iowa 72686 IA 162000 120000 35000 317000 13 268755 Walsh-Haley 2624 Beatty Parkways Goodwinmouth RhodeIsland 31919 RI 55000 120000 35000 210000 14 273274 McDermott PLC 8917 Bergstrom Meadow Kathryneborough Delaware 27933 DE 150000 120000 70000 340000 15 NaN NaN NaN NaN NaN NaN NaN 1462000 1507000 717000 3686000I think this is pretty cool. We have developed a very simple process to intelligently clean up this data. Obviously when you only have 15 or so rows, this is not a big deal. However, what if you had 15,000? You would have to do something manual in Excel to clean this up. SUBTOTALS For the final section of this article, let’s get some subtotals by state. In Excel, we would use the subtotal tool to do this for us. The output would look like this: Creating a subtotal in pandas, is accomplished using groupby df_sub=df_final[[""abbrev"",""Jan"",""Feb"",""Mar"",""total""]].groupby('abbrev').sum()df_sub Jan Feb Mar total abbrev AR 150000 120000 35000 305000 CA 162000 120000 35000 317000 DE 150000 120000 70000 340000 IA 253000 240000 70000 563000 ID 70000 120000 35000 225000 ME 45000 120000 10000 175000 MS 62000 120000 70000 252000 NC 95000 45000 35000 175000 ND 150000 10000 162000 322000 PA 70000 95000 35000 200000 RI 200000 215000 70000 485000 TN 45000 120000 55000 220000 TX 10000 62000 35000 107000Next, we want to format the data as currency by using applymap to all the values in the data frame. defmoney(x):return""${:,.0f}"".format(x)formatted_df=df_sub.applymap(money)formatted_df Jan Feb Mar total abbrev AR $150,000 $120,000 $35,000 $305,000 CA $162,000 $120,000 $35,000 $317,000 DE $150,000 $120,000 $70,000 $340,000 IA $253,000 $240,000 $70,000 $563,000 ID $70,000 $120,000 $35,000 $225,000 ME $45,000 $120,000 $10,000 $175,000 MS $62,000 $120,000 $70,000 $252,000 NC $95,000 $45,000 $35,000 $175,000 ND $150,000 $10,000 $162,000 $322,000 PA $70,000 $95,000 $35,000 $200,000 RI $200,000 $215,000 $70,000 $485,000 TN $45,000 $120,000 $55,000 $220,000 TX $10,000 $62,000 $35,000 $107,000The formatting looks good, now we can get the totals like we did earlier. sum_row=df_sub[[""Jan"",""Feb"",""Mar"",""total""]].sum()sum_row Jan 1462000 Feb 1507000 Mar 717000 total 3686000 dtype: int64 Convert the values to columns and format it. df_sub_sum=pd.DataFrame(data=sum_row).Tdf_sub_sum=df_sub_sum.applymap(money)df_sub_sum Jan Feb Mar total 0 $1,462,000 $1,507,000 $717,000 $3,686,000Finally, add the total value to the DataFrame. final_table=formatted_df.append(df_sub_sum)final_table Jan Feb Mar total AR $150,000 $120,000 $35,000 $305,000 CA $162,000 $120,000 $35,000 $317,000 DE $150,000 $120,000 $70,000 $340,000 IA $253,000 $240,000 $70,000 $563,000 ID $70,000 $120,000 $35,000 $225,000 ME $45,000 $120,000 $10,000 $175,000 MS $62,000 $120,000 $70,000 $252,000 NC $95,000 $45,000 $35,000 $175,000 ND $150,000 $10,000 $162,000 $322,000 PA $70,000 $95,000 $35,000 $200,000 RI $200,000 $215,000 $70,000 $485,000 TN $45,000 $120,000 $55,000 $220,000 TX $10,000 $62,000 $35,000 $107,000 0 $1,462,000 $1,507,000 $717,000 $3,686,000You’ll notice that the index is ‘0’ for the total line. We want to change that using rename . final_table=final_table.rename(index={0:""Total""})final_table Jan Feb Mar total AR $150,000 $120,000 $35,000 $305,000 CA $162,000 $120,000 $35,000 $317,000 DE $150,000 $120,000 $70,000 $340,000 IA $253,000 $240,000 $70,000 $563,000 ID $70,000 $120,000 $35,000 $225,000 ME $45,000 $120,000 $10,000 $175,000 MS $62,000 $120,000 $70,000 $252,000 NC $95,000 $45,000 $35,000 $175,000 ND $150,000 $10,000 $162,000 $322,000 PA $70,000 $95,000 $35,000 $200,000 RI $200,000 $215,000 $70,000 $485,000 TN $45,000 $120,000 $55,000 $220,000 TX $10,000 $62,000 $35,000 $107,000 Total $1,462,000 $1,507,000 $717,000 $3,686,000CONCLUSION By now, most people know that pandas can do a lot of complex manipulations on data - similar to Excel. As I have been learning about pandas, I still find myself trying to remember how to do things that I know how to do in Excel but not in pandas. I realize that this comparison may not be exactly fair - they are different tools. However, I hope to reach people that know Excel and want to learn what alternatives are out there for their data processing needs. I hope these examples will help others feel confident that they can replace a lot of their crufty Excel data manipulations with pandas. I found this exercise helpful to cement these ideas in my mind. I hope it works for you as well. If you have other Excel tasks that you would like to learn how to do in pandas, let me know via the comments below and I will try to help. * ← Creating a Waterfall Chart in Python * Common Excel Tasks Demonstrated in Pandas - Part 2 → Tags pandas excel -------------------------------------------------------------------------------- Tweet Vote on Hacker NewsCOMMENTS SOCIAL * Github * Twitter * BitBucket * Reddit * LinkedIn CATEGORIES * articles * news POPULAR * Pandas Pivot Table Explained * Common Excel Tasks Demonstrated in Pandas * Overview of Python Visualization Tools * Web Scraping - It's Your Civic Duty * Simple Graphing with IPython and Pandas TAGS sets pygal csv barnum process s3 matplotlib plotting stdlib oauth2 xlsxwriter pelican jinja python google matplot pandas ipython seaborn notebooks cases xlwings gui excel vcs ggplot beautifulsoup powerpoint bokeh plotly analyze-this pdf github FEEDS * Atom Feed -------------------------------------------------------------------------------- Site built using Pelican • Theme based on VoidyBootstrap by RKI",Common excel tasks in pandas part,Common Excel Tasks Demonstrated in Pandas,Live,189 528,"Compose The Compose logo Articles Sign in Free 30-day trialHORIZONTAL SCALING ARRIVES ON COMPOSE ENTERPRISE Published Apr 25, 2017 compose scaling mongodb Horizontal Scaling arrives on Compose EnterpriseToday, Compose is bringing horizontal scaling to more databases on our Enterprise platform. MongoDB, Elasticsearch and ScyllaDB deployments join Compose's Redis as databases with horizontal scaling options on Compose Enterprise. For MongoDB, that means that MongoDB users will be able to add shards to their MongoDB deployments to spread their database load across systems. Collections can be split across shards and each can handle queries on its local data independent of other shards. For Elasticsearch and ScyllaDB users, they will have the option to add database nodes to their cluster and replicate their data across more hosts. By doing this, they increase redundancy in their configuration and allow more nodes to handle read loads (Elasticsearch) and read/write loads (SycllaDB). We're making this flexibility available to Compose Enterprise customers who need to do this particular form of scaling. Most users of Compose won't need horizontal scaling and can continue to use Compose's powerful vertical auto-scaling system which adds resources to your database deployment precisely when they are needed. DELIVERING HORIZONTAL SCALING When we developed the horizontal scaling options for Compose we found, from working with customers, that there were many variables to take account of. So many that we also decided that we would make this what we are calling a guided feature. The horizontal scaling technology is built into the Compose platform but we are only activating it for Compose Enterprise customers who have consulted with support on how well their needs fit with horizontally scaling their deployments. We'll guide them through the factors that will affect their deployment and make sure that they Once support has confirmed a good fit, we will activate the feature for them to use as they want. BEYOND ENTERPRISE We are constantly refining the Compose platform and the user experience and will be revisiting how we deliver horizontal scaling regularly. For the immediate future, it's an exclusive feature available to Compose Enterprise users. -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. Jason McCay is a proud member of the Compose elite. Love this article? Head over to Jason McCay ’s author page and keep reading.RELATED ARTICLES Oct 11, 2016COMPOSE: NOW AVAILABLE ON IBM BLUEMIX The power of IBM's Bluemix cloud platform is now able to seamlessly harness Compose's databases, making Compose-configured Mo… Dj Walker-Morgan Feb 19, 2016COMPOSE'S LITTLE BITS #17 - ELASTICSEARCH, POSTGRESQL, MONGODB, GO AND LINENOISE Elasticsearch 5 announced... Looking forward to PostgreSQL 9.6, MongoDB updated, Go goes 1.6 and Linenoise, the next generati… Dj Walker-Morgan Oct 21, 2015COMPOSE UPDATES - MONGODB AND ELASTICSEARCH Over the last few weeks, we've been quietly releasing some minor updates for MongoDB and Elasticsearch. We're happy now to le… Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Today, Compose is bringing horizontal scaling to more databases on our Enterprise platform.",Horizontal Scaling arrives on Compose Enterprise,Live,190 533,"☰ * Login * Sign Up * Learning Paths * Courses * Our Courses * Partner Courses * Badges * Our Badges * BDU Badge Program * BLOG Welcome to the BDUBlog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * This Week in Data Science (March 28, 2017) * This Week in Data Science (March 21, 2017) * Learn TensorFlow and Deep Learning Together and Now! * This Week in Data Science (March 14, 2017) * This Week in Data Science (March 7, 2017) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsBLOGROLL * RBloggers THIS WEEK IN DATA SCIENCE (MARCH 28, 2017) Posted on March 28, 2017 by Janice Darling Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * Data Analytics for Societal Good – An account of an instance of Data Analytics for Societal Good. * What Is Data Science, and What Does a Data Scientist Do? – A simple definition of the roles, experiences, qualifications etc. of the term Data Scientist. * In Defense of Simplicity, A Data Visualization Journey – Discussing the field of Data Visualization. * How does machine learning work? –Extract from the IBM booklet “How it works – Machine Learning. * Getting Started with Deep Learning –Different approaches to getting started with deep learning from a framework perspective * Interview questions for data scientists – Advice for recruiters and candidates for data science job interviews. * IBM launches blockchain as a service for the enterprise –How IBM is enabling developers to quickly build and host secure blockchain networks via the IBM Cloud. * Data Science vs. Data Analytics – Why Does It Matter? – Discussion of the difference between the terms Data Science and Data Analytics. * Sentiment Analysis of Warren Buffett’s Letters to Shareholders – Code and Visualization of the results of Sentiment Analysis on Warren Buffets Letter to Shareholders. * Galvanize will teach students how to use IBM Watson APIs with new machine learning course – IBM will Partner with Galvanize to familiarize students with IBM’s suite of Watson APIs. * How Data Science Can Help You Not to be Blindsided in Decision-Making – How Data Science can affect every business function. * The Future of Machine Learning in Finance – Discussion of the future of Machine Learning in Finance. * The Best Resources for Learning D3.js – A list of resources to learn the Javascript library d3.js. * Understanding the power of real-time geospatial analytics – How the Geospatial Analytics service in IBM Bluemix can monitor moving devices from the Internet of Things. * The Top 12 Tips for Data Visualization – Tips for creating simple yet effective data visualization. * IBM Watson Health can now detect head trauma – IBM Watson Health partners with MedyMatch Technology, an Israel-based startup that uses advanced cognitive analytics and artificial intelligence to deliver medical solutions. UPCOMING DATA SCIENCE EVENTS * Introduction to Python with Data Analysis(Hands-On) –March 30, 2017 @ 6:00 pm – 9:00 pm FEATURED COURSES FROM BDU * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out. * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used to detect patterns hidden in data. * Using R with Databases – Learn how to unleash the power of R when working with relational databases in our newest free course. * Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to apply deep learning to different data types in order to solve real world problems. COOL DATA SCIENCE VIDEOS * Machine Learning With Python – Supervised Learning K Nearest Neighbors – An introduction to the K Nearest Neighbors Algorithm. * Machine Learning With Python – Supervised Learning Decision Trees – An overview of Decision Trees. * Machine Learning With Python – Supervised Learning Random Forests – A brief discussion of Random Forests and their applications. SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * RELATED Tags: analytics , Big Data , data science , weekly roundup -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Events * Ambassador Program * Resources * FAQ * Legal Follow us * * * * * * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (March 28, 2017)",Live,191 535,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSENSOR SENSIBILITY AT HULL DIGITALGlynn Bird / February 12, 2016C4Di (Centre for Digital Innovation) is a newly opened Digital Hub based in Kingston-upon-Hull. It hosts officespace for established businesses and startups together with hot desks forindividuals and companies. There are game developers, drone builders, musicdistributors, marketing agencies, and Kickstarter projects all hosted in abuilding with a 60Gbps, symmetric internet connection; that's 1Gbps per desk!I was invited to speak at the latest Hull Digital Meetup which hosts gatherings for developers and entrepeneurs at regular intervals.Tonight's talk was entitled IoT – Sensor Sensibility , a title that I'm more proud of than I should be. It described the buzzwordthat is Internet of Things , the insane amount of investment that is pouring into IoT-related startups,and how the technology works; how MQTT is used to transmit data from sensors to the cloud and how the same protocolcan be used to close the feedback loop. The talk also touched on using an Offline-First approach, storing data on the local device and syncing to the cloud later usingApache CouchDB & IBM Cloudant. Finally the talk ran through some of the hardwarethat's available, from Raspberry Pis to SensorTags.Thanks to Jon Moss for hosting me at the fabulous C4Di headquarters. Here arethe slides from talk:IoT Sensor Sensibility – Hull Digital – C4Di – Feb 2016 from Glynn BirdSHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",My IoT talk in Hull included an Offline-First approach. Store data on a local device and sync to the cloud later using Apache CouchDB and IBM Cloudant.,Sensor Sensibility at Hull Digital,Live,192 536,"Homepage Follow Sign in / Sign up Homepage * Home * Data Science Experience * * Watson Data Platform * Greg Filla Blocked Unblock Follow Following Product manager & Data scientist — Data Science Experience and Watson Machine Learning May 22 -------------------------------------------------------------------------------- SPARK 2.1 AND JOB MONITORING AVAILABLE IN DSX Today we are announcing support for Apache® Spark™ 2.1 and enhanced Spark job monitoring in the IBM Data Science Experience. SPARK 2.1 The latest official release of Spark comes with plenty of new features, such as expanded structured streaming support (welcome Kafka 0.10 :-)), new algorithms available in SparkR, plus 1200 bug fixes to help make everything run smoothly. If you are interested in seeing the full list of changes from Spark 2.0 to 2.1 check out the Spark 2.1 Release Announcement . SPARK JOB MONITORING Do you ever kick off a Spark job in a Jupyter notebook and wonder if it’s making any progress? Today, we are announcing Python and Scala DSX notebooks now generate progress bars for Spark jobs. Let’s see it in action: In addition to showing or hiding the progress bars at the cell level, you also have the option to hide all progress bar output as shown in the following example: Another way you can track activity on your Spark cluster is by using the Spark History Server. This can be accessed inside DSX notebooks by navigating to the environment tab. You can try out these features and more by creating a free account for Data Science Experience . -------------------------------------------------------------------------------- Originally published at datascience.ibm.com on May 22, 2017. * Spark A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out. Blocked Unblock Follow FollowingGREG FILLA Product manager & Data scientist — Data Science Experience and Watson Machine Learning FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",Today we are announcing support for Apache® Spark™ 2.1 and enhanced Spark job monitoring in the IBM Data Science Experience. The latest official release of Spark comes with plenty of new features…,Spark 2.1 and Job Monitoring Available in DSX,Live,193 538,"Jump to navigation * Twitter * LinkedIn * Facebook * About * Contact * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chats * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Subscribe ×PODCASTS DATA SCIENCE FOR REAL-TIME STREAMING ANALYTICS Post Comment April 18, 2017 | 11:38 play mute max volume MP3 Data science for real-time streaming analytics Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin .OVERVIEW Listen to this podcast where Roger Rea, Senior Offering Manager for IBM Streams, shares his thoughts on how data scientists can create real-time applications using IBM Streams. Find more information about IBM Streams Follow @IBMBigData Topics: Analytics , Big Data Use Cases , Data Scientists Tags: streaming analytics , real-time analytics , Streams , data scienceRELATED CONTENT VIDEO INTERCONNECT VOICES: EXPLORING OUR NEW COGNITIVE WORLD Artificial Intelligence expert Steve Ardire discusses why cognitive computing, artificial intelligence, and machine learning can create faster times to insights and better customer experiences. Watch Video Video InterConnect voices: Maneuvering in a big data world Blog Big Replicate: A big insurance policy for your big data Blog Development lifecycles for defining the meaning and structure of the data lake Blog What to do with all that machine learning data Blog Building a cognitive data lake with ODPi-compliant Hadoop Blog Analytics and the cloud: NoSQL databases Video What is IBM Cloud for Financial Services? Infographic Show your employees only the information they need—and nothing more Blog Cognitive technology for competitive advantage in credit risk management Video InterConnect 2017: Conversations with Jeff Spicer and Dez Blanchfield Blog Recapping the IBM Chief Data Officer Strategy Summit Spring 2017 Blog Incorporating machine learning in the data lake for robust business results View the discussion thread. IBM * Site Map * Privacy * Terms of Use * 2014 IBM FOLLOW IBM BIG DATA & ANALYTICS * Facebook * YouTube * Twitter * @IBMbigdata * LinkedIn * Google+ * SlideShare * Twitter * @IBManalytics * Explore By Topic * Use Cases * Industries * Analytics * Technology * For Developers * Big Data & Analytics Heroes * Explore By Content Type * Blogs * Videos * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Events * About The Big Data & Analytics Hub * Contact Us * RSS Feeds * Additional Big Data Resources * AnalyticsZone * Big Data University * Channel Big Data * developerWorks Big Data Community * IBM big data for the enterprise * IBM Data Magazine * Smarter Questions Blog * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * For Developers * Big Data & Analytics Heroes More * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * For Developers * Big Data & Analytics Heroes SearchEXPLORE BY TOPIC: Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Sales Performance Management Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog Big Replicate: A big insurance policy for your big data Podcast Data science for real-time streaming analytics Blog Building a cognitive data lake with ODPi-compliant Hadoop Blog Analytics and the cloud: NoSQL databasesMORE Blog Big Replicate: A big insurance policy for your big data Podcast Data science for real-time streaming analytics Blog Building a cognitive data lake with ODPi-compliant Hadoop Blog Analytics and the cloud: NoSQL databases Video InterConnect 2017: Conversations with Jeff Spicer and Dez Blanchfield Blog What the Academy Awards mix-up teaches us about data integration Blog Recapping the IBM Chief Data Officer Strategy Summit Spring 2017 Podcast Data science for real-time streaming analytics Blog What to do with all that machine learning data Blog Analytics and the cloud: NoSQL databases Blog Cognitive technology for competitive advantage in credit risk managementMORE Podcast Data science for real-time streaming analytics Blog What to do with all that machine learning data Blog Analytics and the cloud: NoSQL databases Blog Cognitive technology for competitive advantage in credit risk management Blog What the Academy Awards mix-up teaches us about data integration Blog Recapping the IBM Chief Data Officer Strategy Summit Spring 2017 Blog Incorporating machine learning in the data lake for robust business results Blog Big Replicate: A big insurance policy for your big data Blog Building a cognitive data lake with ODPi-compliant Hadoop Video What is IBM Cloud for Financial Services? Infographic Show your employees only the information they need—and nothing moreMORE Blog Big Replicate: A big insurance policy for your big data Blog Building a cognitive data lake with ODPi-compliant Hadoop Video What is IBM Cloud for Financial Services? Infographic Show your employees only the information they need—and nothing more Blog Cognitive technology for competitive advantage in credit risk management Blog What the Academy Awards mix-up teaches us about data integration Podcast Finance in Focus: Women in Wealth Management Blog Big Replicate: A big insurance policy for your big data Blog Development lifecycles for defining the meaning and structure of the data lake Podcast Data science for real-time streaming analytics Blog What to do with all that machine learning dataMORE Blog Big Replicate: A big insurance policy for your big data Blog Development lifecycles for defining the meaning and structure of the data lake Podcast Data science for real-time streaming analytics Blog What to do with all that machine learning data Blog Building a cognitive data lake with ODPi-compliant Hadoop Blog Analytics and the cloud: NoSQL databases Video InterConnect 2017: Conversations with Jeff Spicer and Dez Blanchfield * Home * Explore By Topic * Use Cases * All * Acquire, Grow & Retain Customers * Create New Business Models * Improve IT Economics * Manage Risk * Optimize Operations & Reduce Fraud * Transform Financial Processes * Industries * All * Banking * Consumer Products * Education * Energy & Utilities * Government * Healthcare & Life Sciences * Industrial * Insurance * Media & Entertainment * Retail * Telecommunications * Analytics * All * Content Analytics * Customer Analytics * Entity Analytics * Social Media Analytics * Technology * All * Business Intelligence * Cloud Database * Data Governance * Data Warehouse * Database Management Systems * Data Science * Hadoop & Spark * Internet of Things * Predictive Analytics * Streaming Analytics * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chat * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Big Data & Analytics Heroes * For Developers * Events * Upcoming Events * Webcasts * Twitter Chat * Meetups * About Us * Contact Us * Search Site","Listen to this podcast where Roger Rea, Senior Offering Manager for IBM Streams, shares his thoughts on how data scientists can create real-time applications using IBM Streams.",Data science for real-time streaming analytics,Live,194 544,"Homepage Follow Sign in / Sign up mark simmonds Blocked Unblock Follow Following Program Director, IBM Analytics Development, Snowboarder and Archer. Jul 24 -------------------------------------------------------------------------------- ARTIFICIAL INTELLIGENCE, ETHICALLY SPEAKING Wikimedia Commons license“I’m Sorry Dave — I’m afraid I can’t do that.” Many readers will recognize that line from Stanley Kubrick’s “2001 — A Space Odyssey” a film in which the onboard computer, a HAL 9000, perceives an astronaut to be a threat to its “existence” and refuses to open the airlock to allow the crew member back into the ship. Other films like “Ex-Machina”, “i-Robot”, “Terminator” sow similar fears of Artificial Intelligence systems with cognitive capabilities taking control from humans, rendering us defenseless. Of course, there are also films that focus of on the positive aspects of AI such as “Bicentennial Man”. My view is that AI systems are increasingly necessary to augment what we do in our everyday lives — whether that means… … turning devices on or off, intelligently learning when and where to do so, … repeating mundane tasks, … giving us additional insights into human existence, or … guiding us toward better decisions … and beyond. So, why all the fear? Partly because there is so much misinformation and hype — and some people just like to sell fear, uncertainly and doubt (F.U.D.). And it’s true that there will always be people who seek to exploit technology to do bad things — the dark side v the light side (Star Wars fans). Nonetheless, hype is a valuable part of the technology lifecycle. It allows us to consider use cases (sometimes extreme) that were not initially considered relevant. What’s clear is that machine learning in all its forms is here to stay. It has established its place in the world and particularly in business — from detecting and identifying trends and patterns faster and more often than humans alone could ever achieve (while learning and become progressively smarter as they go) — to helping predict outcomes and taking action to prevent fraud — to slashing the time it takes to design advanced cancer treatment programs and health programs (see figure 1) — to anticipating terror attacks — to recognizing business opportunities that might last only a moment — to ridding processes of personal bias and prejudice. I believe that machine learning and AI systems have the potential to make our world a safer and better place. Figure 1 : Healthcare embraces AI and machine learningEven so, the ethical side of machine learning is increasingly called into question. The potential of machine learning and its application to all things AI means we need rules and controls — not to prevent progress but to help manage and control how and when progress occurs. Let’s walk through some scenarios. MACHINE LEARNING APP ENVY What if machine learning algorithms are pitched against each other to win a battle, say a game of chess or other simulation? Not a big deal. An outcome could be defined as not losing to an opponent — establishing a win or at least a draw. But what if these machine learning systems were used in a war situation against each other? Human life, entire civilizations, and life itself are the stakes. Winning is this case could be defined as seeking an acceptable outcome while minimizing losses. That’s why we humans need to be careful to avoid delegating 100% authority to an AI system in such situations. ENDING LIFE VS. SAVING LIFE There’s general agreement that humans should have the final say where human life is concerned, but does that allow us to play “God” if an AI system demonstrates it can preserve life even though a human may believe it is best to end a life? While most humans would seek to preserve life, greed, personal bias, hate, jealousy can often be powerful dark forces that can be used to serve judgement. It is important that any decision involving an AI system must have an audit trail clearly showing a path to the outcome. After all, AI systems can learn from these outcomes also. NON-HUMAN LIFE FORMS Moving beyond humans, nothing stops us from applying AI to animal behavior. We have performed enough animal psychology over the years to think we understand animals. Would an AI system be better at training an animal? Would it be ethical in man’s perceived superiority and domination over all other species to subject those species to AI? Again, we must consider under what circumstances AI can be used to make decisions over other life forms. CONSCIENCE AND COMPASSION Today, limited by what we know of life, physics and computing, AI systems are just computer models and simulations of human behavior. Could a network of AI systems have a conscience — even though it may be simulated? My personal feelings are unique to my life experiences so what makes me happy or reduces me to tears is different from other humans. Emotions are chemical reactions. AI systems are not. But what if AI systems could apply cognitive actions and outcomes to a bank of human chemicals in a controlled environment to learn about emotion? It is conceivable that an AI system could therefore develop a conscience and even compassion. BIG RELIGION Big Religion is a phrase I hear more and more as Big Data became an established term. It means looking at scriptures and religions with other sources of data, events and the tools of science. It scares a lot of people for the challenge it might pose to their belief systems. Some may fear that is also challenges the power and control associated with some religious establishments. Nonetheless this is happening today and can’t be stopped. Humans inevitably seek to more deeply understand the universe and world around us, challenging ourselves about what we perceive as the truth beyond our faith. SPANNING CULTURAL DIVIDES AND VALUE SYSTEMS Diversity makes the world a fascinating place. It’s one of the reasons many of us decide to vacation in different parts of the world to experience other cultures, food, traditions, languages. In doing so we learn more about history, different belief and value systems. I wonder whether AI systems built within different cultures with different value systems will behave differently with those of other cultures. Consider a global AI system that encompasses/embraces all of this diversity and difference. What might be the global impact on world leaders? WE ARE ONE With recent advances is nanotechnology, it’s possible for nanobots to enter our bodies — even our bloodstreams — to attack viruses and potentially repair damaged bones and tissue. There may be a time where we can use nanotechnology to fight obesity or vainly to enhance our looks, our physical performance. If these nanobots exist forever in our bodies, do we become part human and part something else? There is a lot of research happening in the area. RESPONSIBILITY V ACCOUNTABILITY There are some things that humans must have the final say on. Checks and balances. How far are we prepared to go in delegating responsibility to the AI system? How well are the policies designed? Have some policies been designed or even adapted over time by machine learning? While machines could be responsible for sustaining or ending life, can they be held accountable — and if so, what are the legal implications? Today humans carry the burden of both responsibility and accountability. This aligns with our legal systems, but we can’t put an AI system on trial — we just don’t have the legal capacity to do that today. AI systems learn from human interactions and both the data we produce and the data it produces. Would that imply that many people would potentially be on trial should a legal case emerge involving AI systems? Could the AI system or its creators assert that the human legal system has no jurisdiction over it or that the legal system even infringes the rights of the AI system? Ethics in this area are just not mature enough today to give us clear answers. But it’s only a matter of time before we encounter such situations. Finally, we could ask whether it’s ethical for AI systems to design and implement their own set of ethics? I guess my answer would be yes — provided humans remain involved and a can override any final outcomes where decisions involving human life and welfare are concerned. SUMMARY AI systems already augment what we do and the decisions we make today. The human species will push the boundaries of machine learning, cognitive computing and AI systems beyond our current perceptions of its application through positive and negative exploitation that will ultimately result in AI systems capable of achieving outcomes beyond our imaginations. The ethics will only emerge as cases arise that test our legal systems, our value systems and even our belief systems. Despite some of the F.U.D we read, I believe that machine learning can help our world become a smarter, safer and better place for us and future generations — future generations of people and AI systems. For more information on AI, cognitive computing and IBM research click here . * Artificial Intelligence * Ethics * Machine * Scenario 1 Blocked Unblock Follow FollowingMARK SIMMONDS Program Director, IBM Analytics Development, Snowboarder and Archer. FollowINSIDE MACHINE LEARNING Deep-dive articles about machine learning and data. Curated by IBM Analytics. * Share * 1 * * * Never miss a story from Inside Machine learning , when you sign up for Medium. Learn more Never miss a story from Inside Machine learning Get updates Get updates","My view is that AI systems are increasingly necessary to augment what we do in our everyday lives — whether that means… So, why all the fear?","Artificial Intelligence, Ethically Speaking – Inside Machine learning – Medium",Live,195 546,"Compose The Compose logo Articles Sign in Free 30-day trialCREATING AN AWS VPC AND SECURED COMPOSE MONGODB WITH TERRAFORM Published Mar 2, 2017 writestuff guest aws Creating an AWS VPC and Secured Compose MongoDB with TerraformConnecting to Compose MongoDB from Amazon VPC? Using Terraform for orchestration? In this Write Stuff article, Yamil Asusta shows us how to create secure connections to Compose MongoDB using Terraform and Amazon VPC. Security is often overlooked when busy shipping products. As a result of that, thousands of databases are being held captive from their operators . The attack was possible because none of the security alternatives were implemented for their deployments. Luckily for us, developers, Compose provides us with deployments that include security defaults which can be further expanded to reduce risk. In this post, I hope to explain some basic security practices to lock down access to a MongoDB deployment from VPC. AWS VPC Assuming we are starting from scratch, we need to spin up some infrastructure in which we can launch our servers. To do so, we will use one of my favorite tools, Terraform . Create a main.tf file and add the following: provider ""aws"" { region = ""us-east-1"" # feel free to adjust } This will indicate Terraform our target region for the next operations. CREATING A VPC Let's proceed with creating a VPC . For the purposes of this post, we will only launch 1 public subnet and 1 private subnet using Segment.io's Stack . Add the following to the file: module ""vpc"" { source = ""github.com/segmentio/stack//vpc"" name = ""my-test-vpc"" environment = ""staging"" cidr = ""10.30.0.0/16"" internal_subnets = [""10.30.0.0/24""] external_subnets = [""10.30.100.0/24""] availability_zones = [""us-east-1a""] # ensure it matches the one for your provider } Note: Do not go to production with this setup since it will leave you prone to downtime in the scenario where the Availability Zone collapses. This ""vpc"" module will launch an Internet Gateway and attach it to the VPC, thus allowing instances launched in the public subnet to reach the internet (assuming the were assigned a public IP). Additionally, it launches the most important piece, a NAT server. The NAT is launched in a public subnet and is linked to a private subnet which in result, gives instances in the subnet access to the internet. The NAT is provisioned with an Elastic IP and all requests coming from the private subnet will have this IP (see where I'm going with this?). MAKING THE PRIVATE SUBNET AVAILABLE Now we have reachable subnet and one that isn't. How do we fix that? Let's create a bastion which will let us jump from our public subnet to our private ones. Add this to the file: module ""bastion"" { source = ""github.com/segmentio/stack//bastion"" region = ""us-east-1"" # make sure it matches the one for the provider environment = ""staging"" key_name = ""my awesome key"" # upload this in the AWS console vpc_id = ""${module.vpc.id}"" subnet_id = ""${module.vpc.external_subnets[0]}"" security_groups = ""${aws_security_group.bastion.id}"" } resource ""aws_security_group"" ""bastion"" { name = ""bastion"" description = ""Allow SSH traffic to bastion"" vpc_id = ""${module.vpc.id}"" ingress { from_port = 22 to_port = 22 protocol = ""tcp"" cidr_blocks = [""0.0.0.0/0""] } egress { from_port = 0 to_port = 0 protocol = ""-1"" cidr_blocks = [""0.0.0.0/0""] } lifecycle { create_before_destroy = true } } The security group of the bastion only allows SSH for inbound. We could further tighten it up but we are going to keep it simple for the sake of example. Let's launch an instance in the private subnet using the following: resource ""aws_instance"" ""instance"" { ami = ""ami-0b33d91d"" # Amazon Linux AMI key_name = ""my awesome key"" instance_type = ""t2.nano"" subnet_id = ""${module.vpc.internal_subnets[0]}"" vpc_security_group_ids = [""${aws_security_group.instance.id}""] associate_public_ip_address = false tags { Name = ""ComposeIPWhitelisted"" } } resource ""aws_security_group"" ""instance"" { name = ""instance"" description = ""Allow SSH traffic from bastion"" vpc_id = ""${module.vpc.id}"" ingress { from_port = 22 to_port = 22 protocol = ""tcp"" security_groups = [""${aws_security_group.bastion.id}""] # only the bastion SG can access me :) } egress { from_port = 0 to_port = 0 protocol = ""-1"" cidr_blocks = [""0.0.0.0/0""] } lifecycle { create_before_destroy = true } } Notice that the security group for the instance only allows traffic from the bastion's security group. Once we have this ready, let's add some outputs so we can get going. output ""bastion-ip"" { value = ""${module.bastion.external_ip}"" } output ""nat-ips"" { value = ""${module.vpc.internal_nat_ips}"" } output ""instance-ip"" { value = ""${aws_instance.instance.private_ip}"" } At this point, your main.tf must look similar to this one . Terraform time: $ terraform get # pulls dependencies $ terraform plan # this will show you are the things to be created/destroyed on the next step $ terraform apply # applies the plan, effectively creating our infrastructure Once the apply is complete, we can SSH into our bastion using the resulting IP by running: $ ssh -A ubuntu@bastionIP # assuming we selected the same key pair, -A will forward our keys allowing us to jump with them Within the bastion, SSH into our private instance by running: $ ssh ec2-user@instanceIP # ec2-user is the default user of Amazon Linux AMI CONFIGURING MONGODB Go ahead and provision a MongoDB deployment from the Compose dashboard . Be sure to select Enable SSL access . By enabling this, Compose will provide us with SSL certificates, which will allow us to encrypt our data in transit. This prevents Man-in-the-middle attacks . When the deployment is ready, we will be able to access the deployment dashboard. From here we need to do two things: 1. Create a user that we can later use to authenticate against the database. To do so, click on the Browser tab, select the admin database and click Add User . Make sure to remember the password as it will not be available from this point forward. 2. Obtain the SSL certificate we will use to connect to our database. In the Overview , tab there will be a section called ""SSL Certificate (Self-Signed)"". Its contents are hidden and you will be prompted for your password in order to make them visible. This will be available at all times for your convenience. Let tie everything up now! Within our target host, install the MongoDB shell. If you kept the same AMI (Amazon Linux AMI) you can follow this guide . Additionally, create a file called cert.pem which contents are the SSL certificate found in the dashboard. You should be able to connect to your MongoDB using this command now: $ mongo --ssl --sslCAFile cert.pem /admin -u -p The data we transmit will be encrypted when we use our certificate. Only one problem left, our MongoDB is still open to anyone to try to authenticate. Let's fix it by using the IP Whitelist feature. Back in the dashboard, visit the Security tab. Under the section Whitelist TCP/HTTP IPs , select Add IP . When prompted, add the IP address value of the nats-ip output from Terraform. Once the feature is active, all connections that are not from Compose or our designated list will be dropped. Let's make a quick test! Try connecting to MongoDB one more time from our instance. It should work as intended. Now try accessing it from your local network and tell me how it goes ;) attribution Pexels This article is licensed with CC-BY-NC-SA 4.0 by Compose.RELATED ARTICLES Mar 2, 2017USE ALL THE DATABASES - PART 1 Loren Sands-Ramshaw, author of GraphQL: The New REST, shows how to combine data from multiple sources using GraphQL in this W… Guest Author Sep 28, 2016POWERING SOCIAL FEEDS AND TIMELINES WITH ELASTICSEARCH Evolving from MongoDB and Redis to Elasticsearch, Campus Discounts' founder and CTO Don Omondi talks about how and why the co… Guest Author Nov 16, 2015ARRAYS AND REPLICATION: A MONGODB PERFORMANCE PITFALL In this Write Stuff article, Gigi Sayfan tells the tale of a performance problem with MongoDB. It's one of those problems tha… Guest Author Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Compose provides us with deployments that include security defaults which can be further expanded to reduce risk. In this post, I hope to explain some basic security practices to lock down access to a MongoDB deployment from VPC.",Creating an AWS VPC and Secured Compose MongoDB with Terraform,Live,196 550,"CLOUDANT QUERY GROWS UP TO HANDLE AD HOC QUERIESBy Glynn BirdJune 1, 2015Cloudant's NoSQL Database-as-a-Service allows you to store JSON documents in thecloud using a simple HTTP API. Cloudant comes equipped with a number of indexesthat allow you to query your data in several powerful ways: * Primary Index to retrieve documents by their id, which is the primary key * MapReduce to do secondary key lookups and online analytics * Cloudant Search for full-text, wildcard and faceted search * Cloudant GeoSpatial for complex polygon and 4D spatial queries * Cloudant Query , a declarative query language that incorporates a number of indexing capabilitiesCloudant Query is the best way to get started with querying Cloudant databases;a simple API call is used to define the list of fields to be indexed. Under thehood, Cloudant Query can leverage various indexes to provide a full breadth ofquerying capabilities.SAMPLE DATAIn order to demonstrate the new features we need some sample data. The followingdatabase contains 9,000 movie documents in the following format:{ ""_id"": ""71562"", ""_rev"": ""1-72726eda3b8b2973ef259dd0c7410a83"", ""title"": ""The Godfather: Part II"", ""year"": 1974, ""rating"": ""R"", ""runtime"": ""200 min"", ""genre"": [ ""Crime"", ""Drama"" ], ""director"": ""Francis Ford Coppola"", ""writer"": [ ""Francis Ford Coppola (screenplay)"", ""Mario Puzo (screenplay)"", ""Mario Puzo (based on the novel \""The Godfather\"")"" ], ""cast"": [ ""Al Pacino"", ""Robert Duvall"", ""Diane Keaton"", ""Robert De Niro"" ], ""poster"": ""http://ia.media-imdb.com/images/M/..._V1_SX300.jpg"", ""imdb"": { ""rating"": 9.1, ""votes"": 656, ""id"": ""tt0071562"" }}To use this data set: * Sign up for a Cloudant account * Replicate the database into your account. Choose Replication → New Replication and complete the form * Source Database: Remote database - https://examples.cloudant.com/query-movies Target Database: New local database - ""movies""CREATING A CLOUDANT QUERY INDEXOnce the data has replicated to your Cloudant account, we can instruct Cloudantto create an index from the Cloudant Dashboard by selecting the database andchoosing Query → + → New Query Index:The form will be pre-filled with an index definition of:{ ""index"": { ""fields"": [ ""foo"" ] }, ""type"": ""json""}In our case, we are going to overwrite the sample with a text type index thatautomatically indexes all fields in all documents in the database. Replace theJSON text to have { ""index"": {}, ""type"": ""text""} as shown in screenshot below:Simply click ""Create Index"" to instruct Cloudant to index the movie data.The ""text"" index type is new in this iteration of Cloudant Query and by defaultindexes all the fields in your document. We can supply the individual fields to be indexed (in the index object), but by supplying an empty object we are asking for everything to be indexed.The same instruction can be issued using the Cloudant API --curl -X POST https://user:pass@account.cloudant.com/movies/_index -d '{ ""index"": {}, ""type"": ""text""}'-- substituting user , pass and account for your own personal Cloudant credentials.QUERYING A CLOUDANT QUERY INDEXCloudant Query queries are JSON documents with the following top-level items: * selector - which subset of the data to return; the equivalent of the WHERE part of an SQL statement * fields - the fields to be returned; the equivalent of the SELECT part of an SQL statement * sort - how the result set is to be ordered; the equivalent of the ORDER BY part of an SQL statement * limit - how many results to returnSQL Cloudant QuerySELECT title, yearFROM moviesWHERE imdb.rating 9.0SORT year ASCLIMIT 10{ ""fields"": [""title"", ""year""], ""selector"": { ""imdb.rating"": { ""$gt"": 9.0 } }, ""sort"": [ { ""year:number"": ""asc"" } ], ""limit"": 10 }At its simplest, a query looks like this:{ ""selector"": { ""year"": 2012 }}The above query is looking for films where the year field is equal to 2012.Queries can be cut-and-pasted into the Cloudant Dashboard. Clicking ""Run Query""posts the results in the right-hand panel:The Cloudant Query API can also be used to perform queries by POSTing to adatabase's _find endpoint:curl -X POST https://user:pass@account.cloudant.com/movies/_find -d '{ ""selector"": { ""year"": 2012 }, ""limit"": 10}'CLOUDANT QUERY SELECTORThe selector part of the JSON query allows you to specify which subset of the database toreturn. Selectors can be dealt with several ways:one field:""selector"": { ""year"": 2012 }multiple fields:""selector"": { ""year"": 2012, ""rating"": ""R"" }condition operators ( $gt , $lt , $eq , $ne ... see our docs for full list ):""selector"": { ""imdb.rating"": { ""$gt"": 9.0 } }free-text match (the $text operator matches any field in your document):""selector"": { ""$text"": ""Al Pacino"" }match arrays (exactly):""selector"": { ""genre"": [ ""Animation"", ""Comedy"" ] }match value is in array:""selector"": { ""genre"": { ""$in"": [""Horror""] } }match any values are in array:""selector"": { ""year"": { ""$in"": [2013,2015] } }match values are not in array""selector"": { ""year"": { ""$nin"": [2013,2015] } }the existence of fields:""selector"": { ""rating"": { ""$exists"": true } }We can combine the $and , $or and $not operators to produce complex queries:""selector"": { ""$and"" : [ { ""year"": { ""$lt"": 1990 } }, { ""imdb.rating"": { ""$gt"": 7.0 } }, { ""$text"": ""Marlon Brando"" } ]}""selector"": { ""$and"" : [ { ""year"": { ""$gt"": 1980 } }, { ""year"": { ""$lt"": 1990 } }, { ""$not"": { ""title"": ""Aliens"" } }, { ""$text"": ""Sigourney Weaver"" } ]}""selector"": { ""$or"" : [ { ""director"": ""George Lucas"" }, { ""director"": ""Steven Spielberg"" } ]}CLOUDANT QUERY FIELDSThe fields element can be used to instruct the Cloudant Query engine to only return asubset of the underlying documents e.g.{ ""selector"": { ""cast"": { ""$in"": [ ""Julia Roberts"" ] } }, ""fields"": [ ""title"", ""year"", ""imdb.rating"" ], ""limit"": 10}returns only partial documents e.g.{ ""title"": ""Flatliners"", ""year"": 1990, ""imdb"": { ""rating"": 6.5 }}CLOUDANT QUERY SORTIf a sort element is supplied, then the results set is sorted according to the suppliedarray e.g.{ ""selector"": { ""cast"" : { ""$in"" : [""Tom Hanks""] } }, ""sort"": [ { ""year:number"": ""desc"" } ] }With indexes where type=""text"", each field must be paired with the type of thatfield (number or string) to instruct Cloudant Query to treat it as a numericalor alphabetic sorting algorithm. Sort orders can be either ascending ( asc ) or descending ( desc ).Multi-dimensional sorts can be achieved by adding to the sort array:{ ""selector"": { ""cast"" : { ""$in"" : [""Tom Hanks""] } }, ""sort"": [ { ""year:number"": ""asc"" }, { ""title:string"": ""asc"" } ] }CLOUDANT QUERY PAGINATIONWhen using Cloudant Query's type=""text"" indexes, pagination is performed by: * page 1 - performing a query to get first page of search results * page 2 - repeating the query but adding the bookmark parameter received in the reply to the first requeste.g.we perform our first query:curl -X POST https://user:pass@account.cloudant.com/movies/_find -d '{ ""selector"": { ""year"": 2012 }, ""limit"": 10}'which gives a reply of:{ ""docs"":[ ... ], ""bookmark"": ""g2wAAAABaANkABxkYmNvcmVAZGIxLm""}To get the second page of results, we repeat the query and add the firstrequest's bookmark into our object:curl -X POST https://user:pass@account.cloudant.com/movies/_find -d '{ ""selector"": { ""year"": 2012 }, ""limit"": 10, ""bookmark"": ""g2wAAAABaANkABxkYmNvcmVAZGIxLm""}'The bookmark concept is the same mechanism used by Cloudant Search and providesa scalable way to paginate through large result sets.WHAT'S THE DIFFERENCE BETWEEN ""JSON"" AND ""TEXT"" INDEXES?Indexes based on type=""json"" become MapReduce-based materialized views under thehood. Their fixed key structure will only allow queries that match the keystructure. i.e., if we create a ""json"" index based on title , firstname and lastname , we can perform queries based on those three fields but not just lastname , for instance. Type=""json"" indexes are quicker to build and may be quicker forsingle-field lookups.Indexes based on type=""text"" become Lucene-based indexes under the hood and cananswer arbitrary queries based on any of the indexed fields in any order.Type=""text"" indexes are the easiest way to start with Cloudant Query as theyindex all fields by default allowing ad-hoc querying of a data set.WATCH CLOUDANT QUERY IN ACTIONThis video provides an overview of Cloudant Query.This video shows you how to build and query a Cloudant Query index.REFERENCES * For further information on Cloudant Query text indexes, please refer to our documentation . * The movie database is a subset of data from OMDB API and is published with permission under Creative Commons licence.Please enable JavaScript to view the comments powered by Disqus.SIGN UP FOR UPDATES!RECENT POSTS * Data Privacy and Governance Update * Cloudant Warehousing: New features and improvements * Announcing ISO 27001 Compliance for Cloudant, dashDB and BigInsights! * Understanding Mango View-Based Indexes vs. Search-Based Indexes * Introducing Monitoring Plugins for IBM Cloudant LocalBlog archive Follow @cloudantPRODUCT * Why DBaaS? * Features * Pricing * DBaaS ComparisonDOCS * Getting Started * API Reference * Libraries * GuidesFOR DEVELOPERS * FAQ * Sample AppsRESOURCES * Blog * Case Studies * Data Sheets * Training * Webinars * Whitepapers * Videos * EventsCOMPANY * About Us * Contact UsNEWS * In the Press * Press Releases * Awards * Terms Of Use * | * Privacy * | * ©IBM Corporation 2016","Cloudant Query is the best way to get started with querying Cloudant databases; a simple API call is used to define the list of fields to be indexed. Under the hood, Cloudant Query can leverage various indexes to provide a full breadth of querying capabilities.",Cloudant Query Grows Up to Handle Ad Hoc Queries,Live,197 553,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectTHE NEW SIMPLE DATA PIPEMike Broberg / February 24, 2016Today, we’re introducing a refactored and streamlined Simple Data Pipe , our open-source data movement project. While the workflow for piping data haschanged, the new architecture opens up more free options for data movement onto,or off of, the IBM cloud.WHY CHANGE THE PIPE?Services are changing rapidly on IBM’s Bluemix application platform . As these services evolve, we wanted to create a more modular Simple Data Pipethat could better deal with new features and brand new products.If you’re already using the Simple Data Pipe, don’t fear. We can still move datato dashDB , IBM’s cloud data warehouse. I’ll cover the mechanics of analytics workflowslater on. For now, let’s look at The Pipe’s new architecture and our motivationsbehind it.A SIMPLER DATA PIPE ARCHITECTUREIt’s all about getting data. The big problem the Simple Data Pipe solves hasalways been about sourcing data from disparate Web APIs. The Pipe captures thatdata in its native structure, and persists it in a database that’s flexibleenough to adapt to your plans for processing it.The new Simple Data Pipe no longer assumes that you plan to process data for aparticular use (analytics), in a particular place (dashDB). We’ve modularizedthe architecture of The Pipe by separating the step of landing data in Cloudant from the step of moving data to a different, more specialized place. Here’s an “annotated” architecturediagram:The new Simple Data Pipe lands data in CloudantInstead of automating the process of moving data from REST sources → Cloudant → dashDB , the new Simple Data Pipe is scoped more narrowly to REST sources → Cloudant and ends the process there. It’s a cleaner, more modular approach that webelieve better handles the rate of innovation in the Bluemix ecosystem and makesthe data pipe more useful to applications beyond analytics use-cases.What the Pipe has lost in push-button, end-to-end data movement, it has gainedin flexibility. Also, it still allows for future implementations that do move data end-to-end, whenever free APIs are available for analytics engineslike IBM’s Apache Spark service , warehouses like dashDB, and other tools.MORE OPTIONS FOR YOUR NEXT MOVEFor users who are focused on analytics use-cases, the new Simple Data Pipe canstill connect to dashDB, although that connection is no longer baked in. It’snow a separate step completed in Cloudant. While this roster will expand, hereis the current set of options for moving data out of Cloudant: * dashDB , via native Cloudant integration with dashDB. Finish movement using Cloudant’s web dashboard . * Apache Spark , via native Cloudant integration with Bluemix’s Spark service. Finish movement by calling the Cloudant connector in a Spark Scala Notebook . * Transporter , the open source ETL pipeline by Compose.io. Finish movement by configuring package info and associated JavaScript code. * DataWorks , enterprise-grade APIs for data shaping & movement. A paid service on Bluemix as of February 2016. Provision DataWorks on Bluemix first, before deploying the new Simple Data Pipe.When compared to the previous version of the Simple Data Pipe — aside from astreamlined architecture — we’ve removed The Pipe’s dependence on DataWorks.Connecting the DataWorks APIs to the data pipe is still an option, but byremoving this dependency, Cloudant can provide more options for data movement.Moving “Piped” data into dashDB via the Cloudant dashboardWHERE TO GET THE NEW PIPEThe same place as always on our developerWorks site . There you’ll find links to our GitHub repos and other instructions. In thecoming weeks we’ll be updating content to reflect the new Simple Data Pipe.We’ll also kick off a new series of tutorials that shows all the ways you canwork with the Data Pipe’s additional targets.Let’s get that data moving, y’all.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Introducing a refactored Pipe architecture for cloud data movement. Connect to REST data sources, and land data all in one place, in its native structure.",New Simple Data Pipe: Easier cloud data movement,Live,198 556,"DATALAYER: STORAGE WARS - THE ART GENOME PROJECT Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 21, 2016As you can see DataLayer Conf was full of great talks and this next one is no exception. Daniel Doubrovkine , CEO of Artsy.net and 2016 Ruby prize award nominee, took the stage. Daniel presented Artsy.net's Art Genome Project, a classification system and technological framework that powers Artsy. The Art Genome Project maps the characteristics (Artsy.net calls them “genes”) that connect artists, artworks, architecture, and design objects across history. There are currently over 1,000 characteristics in The Art Genome Project, including art-historical movements, subject matter, and formal qualities. This is the story of the evolution of the data layer and nearest neighbor search technology, and the lessons learned, from MongoDB and PostgreSQL through Elasticsearch at Artsy.net. --------------------------------------------------------------------------------","Daniel Doubrovkine, CEO of Artsy.net and 2016 Ruby prize award nominee, took the stage. Daniel presented Artsy.net's Art Genome Project, a classification system and technological framework that powers Artsy.",DataLayer Conference: Storage Wars - The Art Genome Project,Live,199 557,"* Services * Augmented & Virtual Reality * Internet Of Things * Growth Hacking * Artificial Intelligence * Tech * FinTech * Manufacturing & IoT * Nano & Quantum * Next-Gen Living * Security * Transportation & Energy * Gadgets * Drones * Eye on Sci-Fi * Implants * Mobile * Smart Lives * Wearables * Science * Biology & Chemistry * Data Science * Earth Science * Physics * Warp Drive * Growth Hacking * Advertising * Content Strategies * Social Media * VR * Experiences * Gaming * Marketing * Training & Simulations * AI * Algorithms * Computer Vision * Language * Robotics * The Singularity * Specials * Awards & Recognitions * Entertainment * Industry Conventions * Talk Nerdy 2Me Search * Services * About Us * Contact Us Sign in Welcome! Log into your account your username your password Forgot your password? Get help Password recovery Recover your password your email A password will be e-mailed to you. Edgy Labs * Services * Augmented & Virtual Reality * Internet Of Things * Growth Hacking * Artificial Intelligence * Tech * All FinTech Manufacturing & IoT Nano & Quantum Next-Gen Living Security Transportation & Energy Featured6 EDIBLE FOOD PACKAGING PRODUCTS FOR THE FUTURE Biology & ChemistryA HAPPY WORLD SCIENCE DAY 2016 TO ALL! FeaturedIEA PROJECTS RENEWABLE ENERGY GROWTH LED BY CHINA Biology & ChemistryBIO-MONITORING CONTACT LENSES IMPROVE DIABETES TREATMENT * Gadgets * All Drones Eye on Sci-Fi Implants Mobile Smart Lives Wearables AIAI REVOLUTIONIZES INDUSTRIES, NOT WORLD DOMINATION Biology & ChemistryBIO-MONITORING CONTACT LENSES IMPROVE DIABETES TREATMENT Biology & ChemistrySHAPE MEMORY POLYMERS TO CREATE ORIGAMI-LIKE BIOMPLANTS AIROBOT TUTORS CAN TELL WHETHER YOU’RE DISTRACTED * Science * All Biology & Chemistry Data Science Earth Science Physics Warp Drive AIAI REVOLUTIONIZES INDUSTRIES, NOT WORLD DOMINATION Biology & ChemistryRED DWARF’S HABITABLE ZONE HOME TO WATERWORLDS Featured6 EDIBLE FOOD PACKAGING PRODUCTS FOR THE FUTURE Biology & ChemistryA HAPPY WORLD SCIENCE DAY 2016 TO ALL! * Growth Hacking * All Advertising Content Strategies Social Media Featured8 REASONS HOUSTON WILL BE AMERICA’S STARTUP HUB AdvertisingCAN PLAYSTATION VR USE PS4’S MASS APPEAL? AdvertisingTALKING MARKETING STRATEGIES WITH HOUSTON-BASED AMA WINNER TRI NGUYEN Content Strategies5 REASONS WHY YOUR BRAND SHOULD USE PINTEREST * VR * All Experiences Gaming Marketing Training & Simulations Data ScienceT-RAYS TO MAKE COMPUTER MEMORY 1,000 TIMES FASTER AdvertisingCAN PLAYSTATION VR USE PS4’S MASS APPEAL? GadgetsWHY RAZER AND THX COULD DOMINATE VR DEVELOPMENT AlgorithmsHOW VR WILL MAKE HUMANITY MORE RESPONSIBLE * AI * All Algorithms Computer Vision Language Robotics The Singularity AIAI REVOLUTIONIZES INDUSTRIES, NOT WORLD DOMINATION AI4 WAYS AI IMPROVES MEDICINE AIROBOT TUTORS CAN TELL WHETHER YOU’RE DISTRACTED AI5 WHITE HOUSE RECOMMENDATIONS ON REGULATING AI * Specials * All Awards & Recognitions Entertainment Industry Conventions Talk Nerdy 2Me Biology & ChemistryA HAPPY WORLD SCIENCE DAY 2016 TO ALL! Featured8 REASONS HOUSTON WILL BE AMERICA’S STARTUP HUB Biology & ChemistryHOW SIMILARITIES BETWEEN HUMAN BONE AND NEUTRON STARS SHOW WE ARE… Biology & ChemistryQUOTE OF THE WEEK: RADIANT MARIE CURIE ON DISCOVERY Home AI AI Revolutionizes Industries, not World Domination * AI * Algorithms * Science * Data Science * Gadgets * Smart Lives * The Singularity AI REVOLUTIONIZES INDUSTRIES, NOT WORLD DOMINATION By John N - November 10, 2016 0 10 Share on Facebook Tweet on Twitter * * tweet Tatiana Shepeleva | Shutterstock.comSPEAKING AT A PROGRAMMER’S CONFERENCE THIS PAST SUMMER, BILL GATES REFERRED TO AI AS THE “HOLY GRAIL OF COMPUTER SCIENCE RESEARCH”, ILLUSTRATING HOW SCIENTISTS FROM ALL FIELDS OF STUDY ARE WORKING TO CREATE INTELLIGENT MACHINES AS A TOOL FOR HUMANITY. HOWEVER, FEARS OVER CREATING SMART MACHINES THAT WILL ENSLAVE HUMANITY PERSIST. IS IT POSSIBLE TO CHANGE THE POPULAR VIEW OF TECHNOLOGY AS INEVITABLY EVIL BY ENCOURAGING A MORE PROFOUND UNDERSTANDING OF HOW AI WORKS AND ITS POTENTIAL TO ASSIST US IN OUR DAILY LIVES? AI seems to have an ever-increasing influence on how our world works, from Facebook ‘s facial recognition software for image tagging to spellcheck. Other big technology companies like Google also use AI extensively, and the expectation is that AI will revolutionize industries and will continue to supplement work that humans no longer (have to) do themselves. However, because automation is beginning to eliminate thousands of human jobs each year and investments in AI research and start-ups is exploding, many fear the rise of Matrix -style machines. AI REVOLUTIONIZES INDUSTRIES Traditional computers, while powerful, lack the capacity to be self-aware or make independent decisions. AI, in contrast, can make decisions on its own and even adapt to new rules without being prompted to do so. Therefore, AI already demonstrates some ability to understand the environment and become self-aware. With this realization, the assumption is that the machine will be more efficient at solving complex problems and doing large volumes of calculations as quickly as possible. For example, Google uses AI to deliver better search engine results and even experiment with self-driving cars. In addition, Facebook also uses AI to optimize customer experience on its social media sites. Amazon , for its part, uses intelligent robots in its warehouses to collect items for packaging. In the manufacturing industry, AI driven machines can do everything from coordinate whole production lines to carrying out the smallest and most menial tasks. The evidence of broad adoption of AI is there, but for the average person, AI might appear a distant and abstract concept. However, we engage AI every time we ask Google to help us find a good fried chicken joint. ROBOT OVERLORD Despite the benefits of AI, it continues to be treated with suspicion due, in part, to Hollywood depictions of hyper-aggressive intelligent robots harvesting our bodies for energy. In the Terminator movies, for example, robot assassins are sent by Skynet , an AI defense network that seeks to exterminate the human race. Even when a company’s goal is to use AI to improve the quality of human life, they must account for consumer suspicion. Distrust and anxiety make it harder to garner interest in many AI-powered technologies, perhaps because fear is often a product of a lack of understanding or a lack of information. “we won’t stop needing technology until we run out of problems” -Tim O’reilly COULD SKYNET BECOME A REALITY? Technologies (like AI) are a tool, and tools are neither inherently good or bad. Instead, their merit depends on how we use them. AI is a tool that, so far, seems to be making life easier and safer. For instance, the adoption of self-driving cars has the potential to save thousands of lives lost every year from traffic accidents – again depending on how the technology is used. Furthermore, AI can save the lives of patients through better, earlier diagnosis brought by seamless access to medical data. These AI applications are limited and highly specialized, however. Until someone desires to build a machine hellbent on world domination, it is unlikely that an AI would choose that path. According to Tim O’Reilly via Data Center Frontier , “we won’t stop needing technology until we run out of problems”. As people continue to innovate and complicate their personal and professional lives, more simple tasks are being automated behind the scenes. As long as machines can complete these tasks for us, their use is inevitable. For now, it seems that Hollywood fears won’t stop AI research and development. At the end of the day, understanding is our most effective weapon against the fear accompanies ignorance. SOURCE Data Center Frontier SHARE Facebook Twitter * * tweet Previous article Red Dwarf’s Habitable Zone Home to Waterworlds John NRELATED ARTICLES MORE FROM AUTHOR Biology & ChemistryRED DWARF’S HABITABLE ZONE HOME TO WATERWORLDS Featured6 EDIBLE FOOD PACKAGING PRODUCTS FOR THE FUTURE Biology & ChemistryA HAPPY WORLD SCIENCE DAY 2016 TO ALL! - Advertisement -HOT NEWS Next-Gen LivingHOW 3D PRINTING WILL HELP SHELTER DISASTER VICTIMS TechBLIMPS ARE THE FUTURE OF FREIGHT TRANSPORT (NOT FLYING BOMBS) GadgetsWHY YOU’LL NEVER FOLD YOUR LAUNDRY AGAIN FinTechWALL STREET NEEDS ANOTHER BAILOUT, AND FINTECH CAN HELP I LOVED YOU LIKE A BROTHER 2,636 Fans Like 1,169 Followers Follow 4,182 Followers FollowMOST POPULAR HOW 3D PRINTING WILL HELP SHELTER DISASTER VICTIMS October 10, 2016BLIMPS ARE THE FUTURE OF FREIGHT TRANSPORT (NOT FLYING BOMBS) September 29, 2016WHY YOU’LL NEVER FOLD YOUR LAUNDRY AGAIN September 20, 2016WALL STREET NEEDS ANOTHER BAILOUT, AND FINTECH CAN HELP September 16, 2016 Load moreEDITOR PICKS RED DWARF’S HABITABLE ZONE HOME TO WATERWORLDS November 10, 2016BIO-MONITORING CONTACT LENSES IMPROVE DIABETES TREATMENT November 10, 20168 REASONS HOUSTON WILL BE AMERICA’S STARTUP HUB November 9, 2016POPULAR POSTS HOW TO KILL ANTIBIOTIC-RESISTANT ‘SUPERBUGS’ WITHOUT ANTIBIOTICS October 17, 2016THE TRUE SIGNIFICANCE OF GOOGLE’S PRESS EVENT: A GOOGLE FOR EVERYONE October 4, 2016MORE POWERFUL SUPERCAPACITORS NOW CARBON-FREE November 6, 2016POPULAR CATEGORY * Science 48 * Tech 47 * Gadgets 26 * Biology & Chemistry 23 * Data Science 18 * Specials 18 * Next-Gen Living 16 * Transportation & Energy 15 * AI 14 ABOUT US Edgy Labs is a technology services company focused on the edge of technology. EDGY LABS: sharp ideas for a competitive edge. Contact us: hello@edgylabs.com FOLLOW US * About Us * Privacy Policy * Contact Us © 2016 EDGYLABS.COM All Rights Reserved MORE STORIESWHY THIS DEEP GENERATIVE MODEL IS MAKING WAVES September 21, 2016THE ICHIP THAT COULD MAKE ANIMAL TESTING OBSOLETE September 21, 2016","From autocorrect to Google Maps, AI has already started picking up some of the slack. AI revolutionizes industries and our daily lives.","AI Revolutionizes Industries, not World Domination",Live,200 558,"DATALAYER CONFERENCE: KEYNOTE WITH MITCH PIRTLE, CAPITALONE Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 25, 2016This week we're introducing our first talk from DataLayer Conf, the Keynote with Mitch Pirtle from CaptialOne. If you've been in the open source world for awhile, chances are, you've seen Mitch around the Joomla or Postgres community, among countless others. So what's DataLayer. DataLayer is a Compose sponsored conference that we held last month. It was great. We had speakers from CapitalOne, GitHub, Artsy, Meteor, Princeton, ZenDesk and more. Over the next several weeks, we're going to share the videos of the presentations from the conference so those of you who were unable to attend can still benefit. In this Keynote, Mitch discusses the current state of the data layer for enterprise, focusing on the polyglot experience. We are, according to Mitch, in a world of constantly changing and evolving technology stacks; and this means an ever-changing roster of languages and platforms that access data (including even the very ways in which we store and access data from these apps). Where do databases fit in this rapidly expanding picture of ‘one tool, one task’, especially at the scale of petabytes and zetabytes? -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the beach, reading, spending time with his wife and daughter and tinkering. Love this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud * Subscribe Join 75,000 other devs on our weekly database newsletter. © 2016 Compose","This week we're introducing our first talk from DataLayer Conf, the Keynote with Mitch Pirtle from CaptialOne.","DataLayer Conference: Keynote with Mitch Pirtle, CapitalOne",Live,201 559,"Homepage Follow Sign in / Sign up Homepage * Home * Data Science Experience * * Watson Data Platform * Margriet Groenendijk Blocked Unblock Follow Following Developer Advocate | IBM Watson Data Platform | Data Science | Climate and Weather | Geography Aug 29, 2016 -------------------------------------------------------------------------------- ANALYZE OPEN DATA SETS USING PANDAS IN A PYTHON NOTEBOOK Open data is freely available, which means you can modify, store, and use it without any restrictions. Governments, academic institutions, and publicly focused agencies are the most common providers of open data. They typically share things like environmental, economic, census, and health data sets. You can learn more about open data from The Open Data Institute or from wikipedia . Two great places to start browsing are data.gov and data.gov.uk where you can find all sorts of data sets. Other good sources are the World Bank , the FAO , eurostat and the bureau for labor statistics . If you’re interested in a specific country or region, just do a quick Google search, and you’ll likely uncover other sources as well. Open data can be a powerful analysis tool, especially when you connect multiple data sets to derive new insights. This tutorial features a notebook that helps you get started with analysis using pandas . Pandas is one of my favorite data analysis packages. It’s very flexible and includes tools that make it easy to load, index, classify, and group data. In this tutorial, you will learn how to work with a DataFrame in 2 basic steps: 1. Load data from open data sets into a Python notebook in Data Science Experience. 2. Work with a Python notebook on Data Science Experience (join data frames, clean, check, and analyze the data using simple statistical tools). DATA & ANALYTICS ON DATA SCIENCE EXPERIENCE Data Science Experience features a selection of open data sets that you can download and use any way you want. It’s easy to get an account, start a notebook, and grab some data: 1. Sign in to Data Science Experience (or sign up for a free trial) . 2. Open the sample notebook called Analyze open data sets with pandas DataFrames . To open the sample notebook, click here (or type its name in the Search field on the home page of Data Science Experience and select the card for the notebook), then click the button on the top of the preview page that opens. Select a project and Spark service and click Create Notebook . The sample notebook opens for you to work with. 3. Find the first data set and get its access key URL. 4. From the Data Science Experience home page, search for “life expectancy”. 5. Click the card with the title Life expectancy at birth by country in total years . 6. Click the Manage Access Keys button. 7. Click Request a New Access Key . 8. Copy the access key URL, and click Close . You’ll use this link in a minute to load data into the Python notebook. Tip: If you don’t want to run the commands yourself, you can also just open the notebook in your browser and follow along: https://apsportal.ibm.com/exchange/public/entry/view/47ed96c50374ccd15f93ef262c1af63bLOAD DATA INTO A DATAFRAME Paste the access key URL you copied from the Life Expectancy data set into the following code (replacing the string). Then run the following code to load the data in a data frame. This code keeps 3 columns and renames them. import pandas as pd import numpy as np # life expectancy at birth in years life = pd.read_csv("""",usecols=['Country or Area','Year','Value']) life.columns = ['country','year','life'] life.head() Life expectancy figures might be more meaningful if we combine them with other open data sets from Data Science Experience. Let’s start by loading the data set Total Population by country. To do so, find the data set on the DSX home page, request an access key for it, and replace with your access key URL in the following code. Then run the code. # population population = pd.read_csv("""",usecols=['Country or Area', 'Year','Value']) population.columns = ['country', 'year','population'] print ""Nr of countries in life:"", np.size(np.unique(life['country'])) print ""Nr of countries in population:"", np.size(np.unique(population['country'])) Nr of countries in life: 246 Nr of countries in population: 277 JOINING DATA FRAMES These two data sets don’t fit together perfectly. For instance, one lists more countries than the other. When we join the two data frames we’re sure to introduce nulls or NaNs into the new data frame. We’ll use the pandas merge function to handle this problem. This function includes many options . In the following code, how='outer' makes sure we keep all data from life and population . on=['country','year'] specifies which columns to perform the merge on. df = pd.merge(life, population, how='outer', sort=True, on=['country','year']) df[400:405] We can add more data to the data frame in a similar way. For each data set in the following list, find the data set on the DSX home page, request an access key URL, and copy the the URL into the code (again replacing the string with the corresponding access key URL): * Population below national poverty line, total, percentage * Primary school completion rate % of relevant age group by country * Total employment, by economic activity (Thousands) * Births attended by skilled health staff (% of total) by country * Measles immunization % children 12–23 months by country # poverty (%) poverty = pd.read_csv("""",usecols=['Country or Area', 'Year','Value']) poverty.columns = ['country', 'year','poverty'] df = pd.merge(df, poverty, how='outer', sort=True, on=['country','year']) # school completion (%) school = pd.read_csv("""",usecols=['Country or Area', 'Year','Value']) school.columns = ['country', 'year','school'] df = pd.merge(df, school, how='outer', sort=True, on=['country','year']) # employment employmentin = pd.read_csv("""",usecols=['Country or Area','Year','Value','Sex','Subclassification']) employment = employmentin.loc[(employmentin.Sex=='Total men and women') & (employmentin.Subclassification=='Total.')] employment = employment.drop('Sex', 1) employment = employment.drop('Subclassification', 1) employment.columns = ['country', 'year','employment'] df = pd.merge(df, employment, how='outer', sort=True, on=['country','year']) # births attended by skilled staff (%) births = pd.read_csv("""",usecols=['Country or Area', 'Year','Value']) births.columns = ['country', 'year','births'] df = pd.merge(df, births, how='outer', sort=True, on=['country','year']) # measles immunization (%) measles = pd.read_csv("""",usecols=['Country or Area', 'Year','Value']) measles.columns = ['country', 'year','measles'] df = pd.merge(df, measles, how='outer', sort=True, on=['country','year']) df.head() The resulting table looks kind of strange, as it contains incorrect values, like numbers in the country column and text in the year column. You can manually remove these errors from the data frame. Also, we can now create a multi-index with country and year. df2=df.drop(df.index[0:40]) df2 = df2.set_index(['country','year']) df2.head(10) If you are curious about other variables, you can keep adding data sets from Data Science Experience to this data frame. Be aware that not all data is equally formatted and might need some clean-up before you add it. Use the code samples you just read about, and make sure you keep checking results with a quick look at each of your tables when you load or change them with commands like df2.head() . CHECK THE DATA You can run a first check of the data with describe() , which calculates some basic statistics for each of the columns in the dataframe. It gives you the number of values (count), the mean , the standard deviation (std), the min and max, and some percentiles . df2.describe() DATA ANALYSIS At this point, we have enough sample data to work with. Let’s start by finding the correlation between different variables. First we’ll create a scatter plot, and relate the values for two variables of each row. In our code, we also customize the look by defining the font and figure size and colors of the points with matplotlib. import matplotlib.pyplot as plt %matplotlib inline plt.rcParams['font.size']=11 plt.rcParams['figure.figsize']=[8.0, 3.5] fig, axes=plt.subplots(nrows=1, ncols=2) df2.plot(kind='scatter', x='life', y='population', ax=axes[0], color='Blue' df2.plot(kind='scatter', x='life', y='school', ax=axes[1], color='Red' plt.tight_layout() The figure on the left shows that increased life expectancy leads to higher population. The figure on the right shows that the life expectancy increases with the percentage of school completion. But the percentage ranges from 0 to 200, which is odd for a percentage. You can remove the outliers by keeping the values within a specified range df2[df2.school>100]=float('NaN') . Even better, would be to check where these values in the original data came from. In some cases, a range like this could indicate an error in your code somewhere. In this case, the values are correct, see the description of the school completion data. We don’t have data for all the exact same years. So we’ll group by country (be aware that we lose some information by doing so). Also because variables are percentages, we’ll convert our employment figures to percent. Probably, we no longer need the population column, so let's drop it. Then we create scatter plots from the data frame using scatter_matrix , which creates plots for all variables and also adds a histogram for each. from pandas.tools.plotting import scatter_matrix # group by country grouped = df2.groupby(level=0) dfgroup = grouped.mean() # employment in % of total population dfgroup['employment']=(dfgroup['employment']*1000.)/dfgroup['population']*100 dfgroup=dfgroup.drop('population',1) scatter_matrix(dfgroup,figsize=(12, 12), diagonal='kde') You can see that the data is now in a pretty good state. There are no large outliers. We can even start to see some relationships: life expectancy increases with schooling, employment, safe births, and measles vaccination. You are deriving insights from the data and can now build a statistical model — for instance, have a look at an ordinary least squares regression ( OLS ) from StatsModels . SUMMARY In this tutorial, you learned how to use open data from Data Science Experience in a Python notebook. You saw how to load, clean and explore data using pandas. As you can see from this example, data analysis entails lots of trial and error. This experimentation can be challenging, but is also a lot of fun! -------------------------------------------------------------------------------- Originally published at datascience.ibm.com on August 30, 2016. * Data Science * Pandas * Python Blocked Unblock Follow FollowingMARGRIET GROENENDIJK Developer Advocate | IBM Watson Data Platform | Data Science | Climate and Weather | Geography FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Open data is freely available, which means you can modify, store, and use it without any restrictions. Governments, academic institutions, and publicly focused agencies are the most common providers…",Analyze open data sets using pandas in a Python notebook,Live,202 563,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Aug 21 -------------------------------------------------------------------------------- SERVERLESS AUTOCOMPLETE WHICH WAY OF DEPLOYING AN AUTOCOMPLETE SERVICE IS RIGHT FOR YOU? Autocomplete is everywhere. As you type into a web form, the page offers you completions that match. An autocomplete service consists of three components: 1. The data, a list of strings defining the accepted values for a web form. 2. The code that navigates the list, comparing the typed input with accepted completions. 3. The front end that renders the matches under the web form, allowing an answer to be picked. The three basic components of any autocomplete service: data, code, and a front end.Now, I’ll consider three ways you might deploy autocomplete in practice. CLIENT-SIDE AUTOCOMPLETE When the data size is small (say, less than 100 options), then it makes sense to bundle the data and the code into the front end itself. When the web page is loaded, the entire list of options arrives at the web browser and the autocomplete logic executes locally. Client-side autocomplete: bundle it all in the browser.This approach gives you the fastest performance, but it’s only suitable for small data sets. CLIENT-SERVER AUTOCOMPLETE For larger data sizes, it becomes impractical to have the web page download the entire data set. Instead, the web page makes an HTTP call to a server-side process which queries a database and returns the matching answers: Client-server autocomplete: a server-side process and a database server work in tandem to get the browser its data.For this process to be quick enough to respond as a user types, the server needs to be geographically close to the client and connected to a fast database — typically an in-memory store like Redis . I wrote last year about a Simple Autocomplete Service that creates multiple autocomplete API microservices for you using Bluemix and Redis. But there is a third way. SERVERLESS AUTOCOMPLETE Instead of deploying server-side code that runs 24x7 waiting for autocomplete requests to arrive, “serverless” platforms like Apache OpenWhisk™ allow your micoservices to be deployed on a pay-as-you-go basis. The more computing capacity you use, the more you pay: from zero to lots. With this minimalist approach, your autocomplete service can bundle both the data and the code into an OpenWhisk “action”, so you don’t need to have a separate database: Serverless autocomplete: The browser’s request triggers a new serverless function with bundled data. The code runs on-demand, with no servers to maintain.Bundling the data and the code in the same serverless package makes for faster performance with fewer moving parts and automatic scaling. BUILDING A SERVERLESS AUTOCOMPLETE SERVICE Removing the database from your application architecture means you’ll have to implement the indexing and lookup functions yourself. To reduce the repetition, I’ve built a utility that builds an OpenWhisk action for you. First, you’ll need Node.js and npm installed on your machine, together with the OpenWhisk wsk utility paired with your IBM Bluemix account. Then simply install the serverless-autocomplete package: npm install -g serverless-autocomplete Take a text file of strings that you want to use and run acsetup with the path of the file: acsetup names.txt The acsetup command configures an autocomplete service for you, and provides usage examples.The acsetup utility indexes your data, bundles it with an OpenWhisk autocomplete algorithm written in Node.js, and sends it to your OpenWhisk account. Here's what it returns: * The URL of your service. * An example curl statement. * An HTML snippet that you can use in your own web page — simply save it as an HTML file and open it in your web browser. You can create as many autocomplete services as you like: acsetup uspresidents.txt acsetup soccerplayers.txt acsetup gameofthrones.txt Each service will have its own URL and is ready to use immediately. If you need to change the data, simply update the text file and re-run acsetup . Happy searching! If you enjoyed this article, please ♡ it to recommend it to other Medium readers. * Openwhisk * Web Development * Serverless * Nodejs * Serverless Architecture Blocked Unblock Follow FollowingGLYNN BIRD Developer Advocate @ IBM Watson Data Platform. Views are my own etc. FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform. * Share * * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Autocomplete is everywhere. As you type into a web form, the page offers you completions that match. An autocomplete service consists of three components: When the data size is small (say, less than…",Serverless Autocomplete – IBM Watson Data Lab – Medium,Live,203 565,"Compose The Compose logo Articles Sign in Free 30-day trialDATALAYER EXPOSED: CHARITY MAJORS ON OBSERVABILITY & THE GLORIOUS FUTURE Published Jun 5, 2017 datalayer infrastructure DataLayer Exposed: Charity Majors on Observability & The Glorious FutureWe're bringing you video of all the sessions from this year's DataLayer conference, starting with the opening keynote from Charity Majors on Observability. Dive in now and start your own virtual DataLayer. Last month, Compose descended on Austin to host the second annual DataLayer Conference, a conference devoted to the space where apps meet data. The talks were unbelievable so we decided they needed to be shared with the world, not just those who were able to join us in Texas. We kicked the conference off with our morning keynote address with Charity Majors. Charity gave us a look at the infrastructure complexity today, from distributed systems and microservices, to automation and orchestration, to containers, schedulers and persistence layers to discuss looking at the next generation of observability. Moving forward, it's going to be important to engineer your systems to be understandable, explorable, and self-explanatory. To do that, you're going to need tooling which manages that complexity and some of the ways things have traditionally been done are just going to have to go. It was a great start to the day and she gave those of us in the audience quite a bit to think about. Watch her talk and let us know what you think using the hashtag #DataLayerConf. Be sure to check back every Monday for the next installment of DataLayer. -------------------------------------------------------------------------------- We're in the planning stages for DataLayer 2018 right now so, if you have an idea for a talk, start flushing that out. We'll have a CFP, followed by a blind submission review, and then select our speakers, who we'll fly to DataLayer to present. Sounds fun, right? Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the beach, reading, spending time with his wife and daughter and tinkering. Love this article? Head over to Thom Crowe ’s author page and keep reading.RELATED ARTICLES May 15, 2017DATALAYER DESCENDS ON AUSTIN After months of planning, it's finally here: DataLayer Conference. On Wednesday, we're hosting our second annual conference f… Thom Crowe Apr 13, 2017GETTING THE BEST CONFERENCE SPEAKERS WITH BLIND SUBMISSIONS When we decided to launch our conference last year, we knew we wanted the best speakers and topics. Here's how we ensured we… Thom Crowe Mar 23, 2017ANNOUNCING DATALAYER CONF 2017'S ALL-STAR LINEUP DataLayer Conf 2017 is coming to Austin, Texas on May 17th to the Alamo Draft House on Lamar and we couldn't be more excited.… Thom Crowe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","We're bringing you video of all the sessions from this year's DataLayer conference, starting with the opening keynote from Charity Majors on Observability.",Charity Majors on Observability & The Glorious Future,Live,204 568,"ERIC JANG Technology, A.I., Careers SUNDAY, AUGUST 7, 2016 A BEGINNER'S GUIDE TO VARIATIONAL METHODS: MEAN-FIELD APPROXIMATION Variational Bayeisan (VB) Methods are a family of techniques that are very popular in statistical Machine Learning. VB methods allow us to re-write statistical inference problems (i.e. infer the value of a random variable given the value of another random variable) as optimization problems (i.e. find the parameter values that minimize some objective function). This inference-optimization duality is powerful because it allows us to use the latest-and-greatest optimization algorithms to solve statistical Machine Learning problems (and vice versa, minimize functions using statistical techniques). This post is an introductory tutorial on Variational Methods. I will derive the optimization objective for the simplest of VB methods, known as the Mean-Field Approximation. This objective, also known as the Variational Lower Bound , is exactly the same one used in Variational Autoencoders (a neat paper which I will explain in a follow-up post). TABLE OF CONTENTS 1. Preliminaries and Notation 2. Problem formulation 3. Variational Lower Bound for Mean-field Approximation 4. Forward KL vs. Reverse KL 5. Connections to Deep Learning PRELIMINARIES AND NOTATION This article assumes that the reader is familiar with concepts like random variables, probability distributions, and expectations. Here's a refresher if you forgot some stuff. Machine Learning & Statistics notation isn't standardized very well, so it's helpful to be really precise with notation in this post: * Uppercase $X$ denotes a random variable * Uppercase $P(X)$ denotes the probability distribution over that variable * Lowercase $x \sim P(X)$ denotes a value $x$ sampled ($\sim$) from the probability distribution $P(X)$ via some generative process. * Lowercase $p(X)$ is the density function of the distribution of $X$. It is a scalar function over the measure space of $X$. * $p(X=x)$ (shorthand $p(x)$) denotes the density function evaluated at a particular value $x$. Many academic papers use the terms ""variables"", ""distributions"", ""densities"", and even ""models"" interchangeably. This is not necessarily wrong per se, since $X$, $P(X)$, and $p(X)$ all imply each other via a one-to-one correspondence. However, it's confusing to mix these words together because their types are different (it doesn't make sense to sample a function, nor does it make sense to integrate a distribution). We model systems as a collection of random variables, where some variables ($X$) are ""observable"", while other variables ($Z$) are ""hidden"". We can draw this relationship via the following graph: The edge drawn from $Z$ to $X$ relates the two variables together via the conditional distribution $P(X|Z)$. Here's a more concrete example: $X$ might represent the ""raw pixel values of an image"", while $Z$ is a binary variable such that $Z=1$ ""if $X$ is an image of a cat"". $X = $ $P(Z=1) = 1$ (definitely a cat) $X= $ $P(Z=1) = 0$ (definitely not a cat) $X = $ $P(Z=1) = 0.1$ (sort of cat-like) Bayes' Theorem gives us a general relationship between any pair of random variables: $$p(Z|X) = \frac{p(X|Z)p(Z)}{p(X)}$$ The various pieces of this are associated with common names: $p(Z|X)$ is the posterior probability : ""given the image, what is the probability that this is of a cat?"" If we can sample from $z \sim P(Z|X)$, we can use this to make a cat classifier that tells us whether a given image is a cat or not. $p(X|Z)$ is the likelihood : ""given a value of $Z$ this computes how ""probable"" this image $X$ is under that category ({""is-a-cat"" / ""is-not-a-cat""}). If we can sample from $x \sim P(X|Z)$, then we generate images of cats and images of non-cats just as easily as we can generate random numbers. If you'd like to learn more about this, see my other articles on generative models: [1] , [2] . $p(Z)$ is the prior probability . This captures any prior information we know about $Z$ - for example, if we think that 1/3 of all images in existence are of cats, then $p(Z=1) = \frac{1}{3}$ and $p(Z=0) = \frac{2}{3}$. HIDDEN VARIABLES AS PRIORS This is an aside for interested readers. Skip to the next section to continue with the tutorial. The previous cat example presents a very conventional example of observed variables, hidden variables, and priors. However, it's important to realize that the distinction between hidden / observed variables is somewhat arbitrary, and you're free to factor the graphical model however you like. We can re-write Bayes' Theorem by swapping the terms: $$\frac{p(Z|X)p(X)}{p(Z)} = p(X|Z)$$ The ""posterior"" in question is now $P(X|Z)$. Hidden variables can be interpreted from a Bayesian Statistics framework as prior beliefs attached to the observed variables. For example, if we believe $X$ is a multivariate Gaussian, the hidden variable $Z$ might represent the mean and variance of the Gaussian distribution. The distribution over parameters $P(Z)$ is then a prior distribution to $P(X)$. You are also free to choose which values $X$ and $Z$ represent. For example, $Z$ could instead be ""mean, cube root of variance, and $X+Y$ where $Y \sim \mathcal{N}(0,1)$"". This is somewhat unnatural and weird, but the structure is still valid, as long as $P(X|Z)$ is modified accordingly. You can even ""add"" variables to your system. The prior itself might be dependent on other random variables via $P(Z|\theta)$, which have prior distributions of their own $P(\theta)$, and those have priors still, and so on. Any hyper-parameter can be thought of as a prior. In Bayesian statistics, it's priors all the way down . PROBLEM FORMULATION The key problem we are interested in is posterior inference , or computing functions on the hidden variable $Z$. Some canonical examples of posterior inference: * Given this surveillance footage $X$, did the suspect show up in it? * Given this twitter feed $X$, is the author depressed? * Given historical stock prices $X_{1:t-1}$, what will $X_t$ be? We usually assume that we know how to compute functions on likelihood function $P(X|Z)$ and priors $P(Z)$. The problem is, for complicated tasks like above, we often don't know how to sample from $P(Z|X)$ or compute $p(X|Z)$. Alternatively, we might know the form of $p(Z|X)$, but the corresponding computation is so complicated that we cannot evaluate it in a reasonable amount of time. We could try to use sampling-based approaches like MCMC , but these are slow to converge. VARIATIONAL LOWER BOUND FOR MEAN-FIELD APPROXIMATION The idea behind variational inference is this: let's just perform inference on an easy, parametric distribution $Q_\phi(Z|X)$ (like a Gaussian) for which we know how to do posterior inference, but adjust the parameters $\phi$ so that $Q_\phi$ is as close to $P$ as possible. This is visually illustrated below: the blue curve is the true posterior distribution, and the green distribution is the variational approximation (Gaussian) that we fit to the blue density via optimization. What does it mean for distributions to be ""close""? Mean-field variational Bayes (the most common type) uses the Reverse KL Divergence to as the distance metric between two distributions. $$KL(Q_\phi(Z|X)||P(Z|X)) = \sum_{z \in Z}{q_\phi(z|x)\log\frac{q_\phi(z|x)}{p(z|x)}}$$ Reverse KL divergence measures the amount of information (in nats, or units of $\frac{1}{\log(2)}$ bits) required to ""distort"" $P(Z)$ into $Q_\phi(Z)$. We wish to minimize this quantity with respect to $\phi$. By definition of a conditional distribution, $p(z|x) = \frac{p(x,z)}{p(x)}$. Let's substitute this expression into our original $KL$ expression, and then distribute: $$ \begin{align} KL(Q||P) & = \sum_{z \in Z}{q_\phi(z|x)\log\frac{q_\phi(z|x)p(x)}{p(z,x)}} && \text{(1)} \\ & = \sum_{z \in Z}{q_\phi(z|x)\big(\log{\frac{q_\phi(z|x)}{p(z,x)}} + \log{p(x)}\big)} \\ & = \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big) + \Big(\sum_{z}{\log{p(x)}q_\phi(z|x)}\Big) \\ & = \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big) + \Big(\log{p(x)}\sum_{z}{q_\phi(z|x)}\Big) && \text{note: $\sum_{z}{q(z)} = 1 $} \\ & = \log{p(x)} + \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big) \\ \end{align} $$ To minimize $KL(Q||P)$ with respect to variational parameters $\phi$, we just have to minimize $\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}$, since $\log{p(x)}$ is fixed with respect to $\phi$. Let's re-write this quantity as an expectation over the distribution $Q_\phi(Z|X)$. $$ \begin{align} \sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}} & = \mathbb{E}_{z \sim Q_\phi(Z|X)}\big[\log{\frac{q_\phi(z|x)}{p(z,x)}}\big]\\ & = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - \log{p(x,z)} \big] \\ & = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - (\log{p(x|z)} + \log(p(z))) \big] && \text{(via $\log{p(x,z)=p(x|z)p(z)}$) }\\ & = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - \log{p(x|z)} - \log(p(z))) \big] \\ \end{align} \\ $$ Minimizing this is equivalent to maximizing the negation of this function: $$ \begin{align} \text{maximize } \mathcal{L} & = -\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}} \\ & = \mathbb{E}_Q\big[ -\log{q_\phi(z|x)} + \log{p(x|z)} + \log(p(z))) \big] \\ & = \mathbb{E}_Q\big[ \log{p(x|z)} + \log{\frac{p(z)}{ q_\phi(z|x)}} \big] && \text{(2)} \\ \end{align} $$ In literature, $\mathcal{L}$ is known as the variational lower bound , and is computationally tractable if we can evaluate $p(x|z), p(z), q(z|x)$. We can further re-arrange terms in a way that yields an intuitive formula: $$ \begin{align*} \mathcal{L} & = \mathbb{E}_Q\big[ \log{p(x|z)} + \log{\frac{p(z)}{ q_\phi(z|x)}} \big] \\ & = \mathbb{E}_Q\big[ \log{p(x|z)} \big] + \sum_{Q}{q(z|x)\log{\frac{p(z)}{ q_\phi(z|x)}}} && \text{Definition of expectation} \\ & = \mathbb{E}_Q\big[ \log{p(x|z)} \big] - KL(Q(Z|X)||P(Z)) && \text{Definition of KL divergence} && \text{(3)} \end{align*} $$ If sampling $z \sim Q(Z|X)$ is an ""encoding"" process that converts an observation $x$ to latent code $z$, then sampling $x \sim Q(X|Z)$ is a ""decoding"" process that reconstructs the observation from $z$. It follows that $\mathcal{L}$ is the sum of the expected ""decoding"" likelihood (how good our variational distribution can decode a sample of $Z$ back to a sample of $X$), plus the KL divergence between the variational approximation and the prior on $Z$. If we assume $Q(Z|X)$ is conditionally Gaussian, then prior $Z$ is often chosen to be a diagonal Gaussian distribution with mean 0 and standard deviation 1. Why is $\mathcal{L}$ called the variational lower bound? Substituting $\mathcal{L}$ back into Eq. (1), we have: $$ \begin{align*} KL(Q||P) & = \log p(x) - \mathcal{L} \\ \log p(x) & = \mathcal{L} + KL(Q||P) && \text{(4)} \end{align*} $$ The meaning of Eq. (4), in plain language, is that $p(x)$, the log-likelihood of a data point $x$ under the true distribution, is $\mathcal{L}$, plus an error term $KL(Q||P)$ that captures the distance between $Q(Z|X=x)$ and $P(Z|X=x)$ at that particular value of $X$. Since $KL(Q||P) \geq 0$, $\log p(x)$ must be greater than $\mathcal{L}$. Therefore $\mathcal{L}$ is a lower bound for $\log p(x)$. $\mathcal{L}$ is also referred to as evidence lower bound (ELBO), via the alternate formulation: $$ \mathcal{L} = \log p(x) - KL(Q(Z|X)||P(Z|X)) = \mathbb{E}_Q\big[ \log{p(x|z)} \big] - KL(Q(Z|X)||P(Z)) $$ Note that $\mathcal{L}$ itself contains a KL divergence term between the approximate posterior and the prior, so there are two KL terms in total in $\log p(x)$. FORWARD KL VS. REVERSE KL KL divergence is not a symmetric distance function, i.e. $KL(P||Q) \neq KL(Q||P)$ (except when $Q \equiv P$) The first is known as the ""forward KL"", while the latter is ""reverse KL"". So why do we use Reverse KL? This is because the resulting derivation would require us to know how to compute $p(Z|X)$, which is what we'd like to do in the first place. I really like Kevin Murphy's explanation in the PML textbook , which I shall attempt to re-phrase here: Let's consider the forward-KL first. As we saw from the above derivations, we can write KL as the expectation of a ""penalty"" function $\log \frac{p(z)}{q(z)}$ over a weighing function $p(z)$. $$ \begin{align*} KL(P||Q) & = \sum_z p(z) \log \frac{p(z)}{q(z)} \\ & = \mathbb{E}_{p(z)}{\big[\log \frac{p(z)}{q(z)}\big]}\\ \end{align*} $$ The penalty function contributes loss to the total KL wherever $p(Z) > 0$. For $p(Z) > 0$, $\lim_{q(Z) \to 0} \log \frac{p(z)}{q(z)} \to \infty$. This means that the forward-KL will be large wherever $Q(Z)$ fails to ""cover up"" $P(Z)$. Therefore, the forward-KL is minimized when we ensure that $q(z) > 0$ wherever $p(z)> 0$. The optimized variational distribution $Q(Z)$ is known as ""zero-avoiding"" (density avoids zero when $p(Z)$ is zero). Minimizing the Reverse-KL has exactly the opposite behavior: $$ \begin{align*} KL(Q||P) & = \sum_z q(z) \log \frac{q(z)}{p(z)} \\ & = \mathbb{E}_{p(z)}{\big[\log \frac{q(z)}{p(z)}\big]} \end{align*} $$ If $p(Z) = 0$, we must ensure that the weighting function $q(Z) = 0$ wherever denominator $p(Z) = 0$, otherwise the KL blows up. This is known as ""zero-forcing"": So in summary, minimizing forward-KL ""stretches"" your variational distribution $Q(Z)$ to cover over the entire $P(Z)$ like a tarp, while minimizing reverse-KL ""squeezes"" the $Q(Z)$ under $P(Z)$. It's important to keep in mind the implications of using reverse-KL when using the mean-field approximation in machine learning problems. If we are fitting a unimodal distribution to a multi-modal one, we'll end up with more false negatives (there is actually probability mass in $P(Z)$ where we think there is none in $Q(Z)$). CONNECTIONS TO DEEP LEARNING Variational methods are really important for Deep Learning. I will elaborate more in a later post, but here's a quick spoiler: 1. Deep learning is really good at optimization (specifically, gradient descent) over very large parameter spaces using lots of data. 2. Variational Bayes give us a framework with which we can re-write statistical inference problems as optimization problems. Combining Deep learning and VB Methods allow us to perform inference on extremely complex posterior distributions. As it turns out, modern techniques like Variational Autoencoders optimize the exact same mean-field variational lower-bound derived in this post! Thanks for reading, and stay tuned! Posted by Eric at 11:50 PM Email This BlogThis! Share to Twitter Share to Facebook Share to Pinterest Labels: AI , Statistics17 COMMENTS: 1. Incognito August 8, 2016 at 11:34 AMThere should be a minus in equation (3) for E[log p(x|z)] i.e. E[ -log p(x|z)] otherwise your definition of KL-divergence isn't consistent. Ankur. Reply Delete Replies 1. Eric August 8, 2016 at 2:15 PMThanks for your sharp eyes! I added the minus in front of the KL term. Delete angusturner27 June 4, 2017 at 3:42 AMDo you mind explaining where that negative comes from? I was anticipating a plus... Delete 2. Reply 2. Vladislavs Dovgalecs August 8, 2016 at 11:58 PMThanks for the great post, Eric! Do you plan (or have a link to) to write a simple tutorial to illustrate the VB in practice? Reply Delete 3. John Barness August 10, 2016 at 5:42 AMThe post is worth reading. Reply Delete 4. Fahim Lee August 11, 2016 at 5:04 AMThis comment has been removed by a blog administrator. Reply Delete 5. Emery Goossens August 12, 2016 at 11:06 AMThis tutorial is fantastic! I believe the phrase ""must be strictly greater than"" should omit ""strictly"" seeing as equality could hold according to your definition. Reply Delete Replies 1. Eric August 14, 2016 at 7:12 PMThat's correct! Thank you :) Delete 2. Reply 6. David N. Olson August 15, 2016 at 12:07 AMThis comment has been removed by a blog administrator. Reply Delete 7. skim October 18, 2016 at 1:01 PMGiven the title of your post, it's worth giving some motivation behind the name ""mean-field approximation"". From a statistical physics point of view, ""mean-field"" refers to the relaxation of a difficult optimization problem to a simpler one which ignores second-order effects. For example, in the context of graphical models, one can approximate the partition function of a Markov random field via maximization of the Gibbs free energy (i.e., log partition function minus relative entropy) over the set of product measures, which is significantly more tractable than global optimization over the space of all probability measures (see, e.g., M. Mezard and A. Montanari, Sect 4.4.2). From an algorithmic point of view, ""mean-field"" refers to the naive mean field algorithm for computing marginals of a Markov random field. Recall that the fixed points of the naive mean field algorithm are optimizers of the mean-field approximation to the Gibbs variational problem. This approach is ""mean"" in that it is the average/expectation/LLN version of the Gibbs sampler, hence ignoring second-order (stochastic) effects (see, e.g., M. Wainwright and M. Jordan, (2.14) and (2.15)). Reply Delete Replies 1. Eric November 6, 2016 at 10:47 PMI didn't know that! Thank you for sharing this. I hope that interested readers will scroll down and find your comment. Delete 2. Reply 8. Aafiya Designer February 24, 2017 at 2:42 PMOn the off chance that you have not exploited surveillance cameras to ensure your property, please consider to begin utilizing them. best home surveillance system Reply Delete 9. sutony April 25, 2017 at 10:09 PMI read a few blogs/articles/slides about variational autoencoders, and I personally think this is the best one. The key ideas are pointed out clearly. The technical terms(e.g., ELBO) are well explained, too. Thanks so much. Reply Delete 10. Magdiel Jiménez Guarneros May 3, 2017 at 4:04 PMHi, can you explain me the relation of the sum over q(z) equal to 1 in equation (1)?. Thanks, I don't catch it. Reply Delete Replies 1. SunFish7 May 7, 2017 at 2:52 AMProbabilities sum to 1. i.e. Given a probability distribution q over Z, summing q(z) over all possible z in Z must give 1. Delete 2. Reply 11. SunFish7 May 7, 2017 at 2:53 AMThanks for this, it is a key resource for our reading group discussion on VAE today https://github.com/p-i-/machinelearning-IRC-freenode/blob/master/ReadingGroup/README.md Reply Delete 12. mathnathan May 12, 2017 at 1:01 PMI believe the last formula for reverse KL should be an expectation over q, not over p. Great post. Thanks for your effort. Reply Delete Add comment Load more... Newer Post Older Post Home Subscribe to: Post Comments (Atom)BLOG ARCHIVE * ► 2017 (1) * ► January (1) * ▼ 2016 (11) * ► November (1) * ► September (2) * ▼ August (1) * A Beginner's Guide to Variational Methods: Mean-Fi... * ► July (3) * ► June (4) Not for reproduction. Simple theme. Powered by Blogger .",Variational Bayeisan (VB) Methods are a family of techniques that are very popular in statistical Machine Learning. VB methods allow us to r...,A Beginner's Guide to Variational Methods,Live,205 570,,Watch how to convert XML data to CSV format to load into dashDB. This video shows a tool called Convert XML to CSV found here: http://www.convertcsv.com/xml-to-csv.htm,Load XML data into dashDB,Live,206 574,"Compose The Compose logo Articles Sign in Free 30-day trialCOMPOSE TIPS: DATES AND DATING IN MONGODB Published May 2, 2017 mongodb datetime compose tips Compose Tips: Dates and Dating in MongoDBWorking with dates in MongoDB can be surprisingly nuanced, and knowing how dates are stored can make avoiding pitfalls much easier. Read on as we examine the inner workings of MongoDB dates and show how to choose the right date type for your needs. At Compose Tips, we like to address the issues that can leave even experienced developers scratching their head. We'll kick off this series by taking a look at how MongoDB stores dates. WHAT'S IN A DATE? Dates in MongoDB have a few different representations, and getting the right one can mean the difference between being able to effectively search your data by date range using aggregations and being forced to manage your dates on the client side of your application. Let's take a look at the different ways that MongoDB can store dates. Internally, MongoDB can store dates as either Strings or as 64-bit integers . If you intend to do any operations using the MongoDB query or aggregate functions, or if you want to index your data by date, you'll likely want to store your dates as integers. If you're using the built-in ""Date"" data type, or a date wrapped in the ISODate() function, you're also storing your date as an integer. If you're just looking for a simple way to display dates to a user and aren't concerned with performing operations on those dates, then a String will allow you to use the output from a MongoDB query directly without the need to convert the date into a String. This can be handy for platforms without a convenient or easy Date wrapper, or if you don't want to spend time processing the date on the client side of your application. Let's walk through each of the potential ways a Date can be represented in MongoDB and discuss the pros and cons of each. MILLISECONDS SINCE THE EPOCH One standard way that many databases store dates is as a count of milliseconds since the Epoch, with 0 representing January 1, 1970 at 00:00:00GMT. This is how dates are stored internally in most programming languages. Storing data as milliseconds since the Epoch makes comparing dates to each other a simple numeric comparison. Developers can also easily modify dates, including adding time frames (such as adding 1 day to a date) by computing the number of milliseconds in a day. While milliseconds are easy to manipulate programmatically, they're difficult for programmers to conceptualize. It's difficult to tell even what decade a date is in just by looking at the millisecond count, so it needs to be converted into a readable String before most developers will be able to display a date represented in this format. ISODATE() If you've tried to save a Javascript Date object into MongoDB, you might've noticed that MongoDB automatically wrapped your date with a peculiar function: ISODate() . ISODate(""2012-12-19T06:01:17.171Z"") ISODate() is a helper function that's built into to MongoDB and wraps the native JavaScript Date object. When you use the ISODate() constructor from the Mongo shell, it actually returns a JavaScript Date object. So why bother with ISODate() ? ISODate() provides a convenient way to represent a date in MongoDB as a String visually, while still allowing the full use of date queries and indexing. By wrapping the ISO date String in a function, the developer can inspect date objects quickly and visually without having to convert from a Unix timestamp to a time String. We can see this by comparing the following date using the ISODate() constructor: ISODate(""2012-12-19T06:01:17.171Z"") To the corresponding JavaScript Date constructor: Date(1355897837000) The ISODate() constructor is clearly easier to read at-a-glance for developers. One other major benefit is that, while there are many ways to represent dates, the ISODate() uses the standardized ISO format. Your clients don't have to do any guesswork to figure out what format they'll need to store dates in your system. Probably the biggest downside is that the ISODate() will convert your date to ISO format and, should you need a different date format on the client-side of your applications, you'll have to convert that date on the client side. This can be a concern when processing a lot of records that need to have the dates in a specific format, or when real-time processing of date stamps is a concern. STRING FORMAT The final format we'll take a look at is String format, which stores a date as a simple String in a human-readable format. Dates stored in String format are very easy to display and don't require any processing to use in visual displays. Also, assuming the date matches a standard format, the date stored in String format will be relatively easy to convert to a date on any platform. When a date is stored in String format, it can sometimes be difficult to determine what the actual format of the date is in the String. The following example illustrates this issue: does the following date represent the 1st of February 2017 or the 2nd of January 2017? 2017-02-01 This ambiguity can cause major issues if the date is parsed incorrectly. Developers using loosely-typed languages like JavaScript will sometimes accidentally store a time String rather than a date, so sometimes the presence of a String in the date field can indicate a logic error. WRAPPING UP MongoDB dates can initially cause some frustration for developers just starting out, and understanding the different ways that a date can be stored in MongoDB can help to ease that frustration. In a future article, we'll cover how to manipulate and compute with dates in MongoDB. -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. Image by Fabrizio Verrecchia John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of gadgets, turning caffeine into code, and writing about it all. Love this article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES Apr 28, 2017NEWSBITS - MYSQL, ELASTICSEARCH, MONGODB, ETCD, COCKROACHDB, SQL SERVER, CRICKET AND JUICE NewBits for the week ending 28th April - MySQL 8.0.1's preview demos better replication, Elasticsearch, MongoDB and etcd get… Dj Walker-Morgan Apr 26, 2017FINDING DUPLICATE DOCUMENTS IN MONGODB Need to find duplicate documents in your MongoDB database? This article will show you how to find duplicate documents in your… Abdullah Alger Apr 25, 2017HORIZONTAL SCALING ARRIVES ON COMPOSE ENTERPRISE Today, Compose is bringing horizontal scaling to more databases on our Enterprise platform. MongoDB, Elasticsearch and Scylla… Jason McCay Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Working with dates in MongoDB can be surprisingly nuanced. Here, we examine the inner workings of MongoDB dates and show how to choose the right date type for your needs.",Compose Tips: Dates and Dating in MongoDB,Live,207 575,"UNDERSTANDING MANGO VIEW-BASED INDEXES VS. SEARCH-BASED INDEXESBy Tony SunNovember 18, 2015MANGO: DECLARATIVE QUERYING FOR APACHE COUCHDB™Mango allows users to declaratively define and query Apache CouchDB indexes.(It's also the open source library that powers Cloudant Query.) For anintroduction to its features, please refer to this post: https://cloudant.com/blog/introducing-cloudant-query/Recently, Cloudant open-sourced its Apache Lucene™-based full-text searchcapabilities for CouchDB as well: https://cloudant.com/blog/open-sourcing-cloudant-search/Mango leverages Lucene not only to perform text search, but also to enablead-hoc querying capabilities: https://cloudant.com/blog/cloudant-query-grows-up-to-handle-ad-hoc-queries/Users can now use either the original CouchDB view-based indexes or the newsearch-based indexes. In this post, I'll compare the two index types to giveusers an idea of when to use each (""json"" or view-based vs. ""text"" orsearch-based).WHEN JSON SYNTAX GETS TRICKYView-based indexes are most efficient for large datasets, but containlimitations on how a user could query an index. For example, given an indexdefined as:{ ""index"": { ""fields"": [""foo"", ""bar""] }, ""name"" : ""foo-index"", ""type"" : ""json""}The following query would fail:{ ""selector"": {""$or"": [{""foo"": ""val1""}, {""bar"": ""val2""}]}}{""error"":""no_usable_index"",""reason"":""There is no index available for this selector.""}To understand the limitation above, users must realize that the underlying indexis still a CouchDB view-based index. The values of the fields are used to compose the keys in theindex. When performing a query, a selector is transformed into a start_key and end_key range search against the index.To satisfy the $or query above, Mango would have to scan the index twice, once for ""foo"" , then another for ""bar"" and perform merging logic. This can get extremely complicated as queries becomemore complex.In order to bypass this limitation, users need to add a ""sub-query"" that willallow the query engine to scan the index once and return results. The rest ofthe query will then be used as an in-memory filter.To continue with the example above, the query must become:{ ""selector"": {""_id"": {""$gt"" : null}, ""$or"": [{""company"": ""x""}, {""twitter"": ""ba""}]}}The above query essentially does a full index scan to return all the documentsand then applies the rest of the $or query as a filter on those documents.CLEANER SYNTAX WITH SEARCH INDEX TYPESMango search-based indexes resolve this issue by using Lucene indexes. To seehow to create those indexes, refer to the cloudant-query-grows-up blog linked to above. Users no longer have to add these ""sub-queries"" toperform operations such as $or , $in , or $elemMatch .A user might be tempted to always use search-based indexes due to their ad-hocquery capabilities. Mango's view-based indexes, however, will perform better inscenarios where query patterns are well known and the user is already familiarwith their data model. Imagine these view-based indexes as a more pleasantabstraction of the traditional CouchDB Map-Reduce view indexing system.If users don't know in advance what queries will be executed, then Mangosearch-based indexes are the way to go. This flexibility, however, comes withits own tradeoffs.Underneath the covers, Mango search indexes create a single default field thatcatalogs every field in the document. Moreover, individual elements in an arrayare also indexed and enumerated in this field. This comprehensive approachallows the user to perform a full-text search via the $text operator. The behavior is turned on automatically when the user creates asearch-based index. So for large databases, index build times can be long. Thesystem provides an option to disable search-based index builds, but disabling italso turns off the full-text search feature.Users who want full ad-hoc capabilities can index the entire database withsearch-based indexes. It's important to note that this approach is differentthan the default field mentioned above. The default field is a single field thathas all the values in the document stored in that one field. When a user indexesthe entire database, all the fields in the document will have their ownrespective values stored in the index. Again, indexing the entire database willcreate long index build times.Given a document such as:{ ""first_name"" : ""john"", ""last_name"" : ""doe""}... with BOTH text search enabled AND the entire database indexed will have anindex that has:""default_field"" - ""john"", ""doe""""first_name"" - ""john""""last_name"" - ""doe""Users who don't want to index their entire database can specify fieldsindividually. Suppose a user only wants to index ""first_name"" . Then the index would look like:""default_field"" - ""john"", ""doe""""first_name"" - ""john""A query that searches for ""last_name"" would then throw an ""index not found"" error.Finally, a user can turn off ""default_field"" and only index specific fields:""first_name"" - ""john""But this would limit the ad-hoc capabilities of search-based indexes, and theuser should use a view-based index instead.ARRAYSArrays can also be confusing for first-time users of Mango. Subtle differencesalso exist for arrays when using view-based indexes vs. search-based indexes.ARRAYS: JSONCurrently, view-based indexes cannot index individual array elements with onefield definition. Given an array such as:""array_field"": [10, 20, 30]If the view-based index is defined as:{ ""index"": { ""fields"": [""array_field""] }, ""name"" : ""array-index"", ""type"" : ""json""}Users can query against the index to match the array exactly:{ ""selector"": {""array_field"" : [10, 20, 30]}}However, the user cannot access an individual array element. Note that mangouses dot-notation to access the individual elements, i.e., my_array.0 , my_array.1 , etc.In the example above, if a user tried:{ ""selector"": {""array_field.0"" : 10}}... then the user would get:{""error"":""no_usable_index"",""reason"":""There is no index available for this selector.""}Users would have specifically index each element in the array to access theindividual elements, For example:{ ""index"": { ""fields"": [""array_field.0"", ""array_field.1"", ""array_field.2""] }, ""name"" : ""array-index"", ""type"" : ""json""}However, if a user did not specify individual elements — and indexed the arrayas a whole — he or she can still perform operations such as $in on the array.For example:{""selector"": {""_id"": {""$gt"": null},""array_field"": {""$in"": [10]}}}The reason this works is, again, because Mango performs the above $in operation as a filtering mechanism against all the documents. As we saw in theconclusion of the previous section on JSON syntax, the performance tradeoff withthe query above is that it, essentially, performs a full index scan and thenapplies a filter.ARRAYS: TEXTWith Mango search-based indexes, the user can query the index however he or shelikes with one index definition:{ ""index"": { ""fields"": [{""name"": ""array_field.[]"", ""type"": ""number""}] }, ""name"" : ""array-index"", ""type"" : ""text""}This not only indexes the entire array, but also individual elements in thearray. Users can then ad-hoc query the array.WHAT'S NEXT?Hopefully this post helps clarify Mango view-based indexes vs. search-basedindexes. In order to enable search-based indexes, currently, users must firstenable text search in their CouchDB distribution. For instructions onrecompiling the current release of CouchDB to use the new search features, readthis article by fellow Apache CouchDB project committer Robert Kowalski: https://cloudant.com/blog/enable-full-text-search-in-apache-couchdb/If recompiling seems like too much work, don't worry! Lucene text search, alongwith Mango's declarative query system, will be included in the upcoming 2.0release of Apache CouchDB. For updates, follow the project on Twitter at: https://twitter.com/couchdb... or join one of the many excellent mailing lists: http://couchdb.apache.org/#mailing-lists© ""Apache"", ""CouchDB"", ""Lucene"", ""Apache CouchDB"", ""Apache Lucene"", and theCouchDB and Lucene logos are trademarks or registered trademarks of The ApacheSoftware Foundation. All other brands and trademarks are the property of theirrespective owners.Please enable JavaScript to view the comments powered by Disqus.SIGN UP FOR UPDATES!RECENT POSTS * Data Privacy and Governance Update * Cloudant Warehousing: New features and improvements * Announcing ISO 27001 Compliance for Cloudant, dashDB and BigInsights! * Understanding Mango View-Based Indexes vs. Search-Based Indexes * Introducing Monitoring Plugins for IBM Cloudant LocalBlog archive Follow @cloudantPRODUCT * Why DBaaS? * Features * Pricing * DBaaS ComparisonDOCS * Getting Started * API Reference * Libraries * GuidesFOR DEVELOPERS * FAQ * Sample AppsRESOURCES * Blog * Case Studies * Data Sheets * Training * Webinars * Whitepapers * Videos * EventsCOMPANY * About Us * Contact UsNEWS * In the Press * Press Releases * Awards * Terms Of Use * | * Privacy * | * ©IBM Corporation 2016","Users can now use either the original CouchDB view-based indexes or the new search-based indexes to query Cloudant and CouchDB. In this post, I'll compare the two index types to give users an idea of when to use each (""json"" or view-based vs. ""text"" or search-based).",Understanding Mango View-Based Indexes vs. Search-Based Indexes in Cloudant and CouchDB,Live,208 580,"This video goes through a demo/workshop on how to build a Java EE App that using Cloudant and Watson to suggest employee recommendations. The app also using JQuery, Angular, and Bootstrap on the frontend. This app was internally developed by IBM employees at a 48 hour hackathon.The source code is available for the app at http://ibm.biz/talent-manager.The complete source code is available for the app at http://ibm.biz/talent-manager-complete (dont cheat!)For feedback please contact @jsloyer on twitter (http://twitter.com/jsloyer).Use Case Behind the App:Meet IvyShe's a talent manager at a growing tech startup.She's having trouble finding the right candidate based on: ..* technical skills ..* personal compatibilityI wish I could clone my developer, Emory Wren -- having two guys like Emory working here would be amazing.But that's not possible. So what's the next best thing?** Talent Hotspot A web application that allows you to search for candidates from a pool of applicants based on how closely they resemble one of your current employees. Talent Hotspot uses Watson's User Modeling API service to analyze a potential candidate's personality based on their answers to a questionnaie (completed upon application to Ivy's company.)The application can issue queries such as,""Find me a Developer like Craig Smith"". Then search through all possible candidate and return a ranked list of candidates sorted by highest-to-lowest percentage of personality resemblance. From here, searches can be refined by including technical skills. ""Find me a Developer like Craig Smith, and knows Java, C and Python""","This video goes through a demo/workshop on how to build a Java EE App that using Cloudant and Watson to suggest employee recommendations. The app also using JQuery, Angular, and Bootstrap on the frontend. This app was internally developed by IBM employees at a 48 hour hackathon.",Building a Java EE webapp on IBM Bluemix Using Watson and Cloudant,Live,209 582,"☰ * Login * Sign Up * Learning Paths * Courses * Our Courses * Partner Courses * Badges * Our Badges * BDU Badge Program * BLOG Welcome to the BDUBlog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * This Week in Data Science (February 14, 2017) * This Week in Data Science (February 7, 2017) * This Week in Data Science (January 31, 2017) * This Week in Data Science (January 24, 2017) * This Week in Data Science (January 17, 2017) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsTHIS WEEK IN DATA SCIENCE (FEBRUARY 14, 2017) Posted on February 14, 2017 by Janice Darling Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * IBM’s Watson and Bluemix step up: Free cloud-based push to boost tech skills – IBM aims to develop the tech skills of millions of young African for free using Watson AI and Bluemix cloud. * Job trends for R and Python – A look at the recent job trends for R, Python and SAS. * IBM launches cognitive computing hardware unit: Enter the Watson, Power 9 stack – IBM uses research and application to speed up training for Watson, neural networks and machine learning. * – A current list of some useful apis for Machine Learning and Prediction. * New IoT Cybersecurity Alliance formed by AT&T, IBM, others – IBM partners to form alliance to address concerns around IoT and solve its security challenges. * 6 ways Business Intelligence is going to change in 2017 – How businesses will utilize data and its advantages. * ​IoT devices will outnumber the world’s population this year for the first time – A prediction on the growth in number of IoT devices for the next three years. * What is a Data Scientist? – A broad definition for the term Data Scientist. * IBM’s big data meetup program approaches a significant milestone – IBM’s community of big data developers approaches 100,000 members. * 5 Career Paths in Big Data and Data Science, Explained – Resources to sharpen skills required for 5 different paths in Data Science and Analytics. * 6 Top Big Data and Data Science Trends 2017 – Predictions about Big Data and Data Science as the world becomes more dependent on Data. * Top R Packages for Machine Learning – A ranking of the top Machine Learning Packages for R. * Understanding data ownership in the data lake – Answers to questions dealing with the ownership of data and the importance of these questions. * Infographic: The 4 Types of Data Science Problems Companies Face. – The difficulties surrounding solutions to Data Science Problems. * Making the Most of Big Data Requires Effective Training in Data Science – A discussion of the type of training required to create effective data scientists. * City enlists IBM’s Watson to fix outdated 311 system – How NYC will use IBM’s Watson to handle 311 calls. FEATURED COURSES FROM BDU * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out. * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used to detect patterns hidden in data. * Using R with Databases – Learn how to unleash the power of R when working with relational databases in our newest free course. UPCOMING DATA SCIENCE EVENTS * IBM Event: Big Data and Analytics Summit – February 14, 2017 @ 7:15 am – 4:45 pm, Toronto Marriott Downtown Eaton Centre Hotel 525 Bay St. Toronto Ontario. SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * RELATED Tags: analytics , Big Data , data science -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Events * Ambassador Program * Resources * FAQ * Legal Follow us * * * * * * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (February 14, 2017)",Live,210 588,"THINKY AND RETHINKDB Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jul 28, 2016If you're looking for alternatives to writing queries and modeling your data using ReQL (the RethinkDB query language), you might want to consider looking at Thinky , an open-source ORM (Object Relational Mapper) designed for RethinkDB. Thinky provides a number of features for defining schema, querying, and adding relationships to your models. The ORM uses the same syntax as RethinkDB's Node.js driver , which makes it a great alternative to developing applications using ReQL. In this article, we will primarily concentrate on creating schemas and defining relations between your models. THE SCHEMA If you’re already familiar with building schemas in Mongoose , an ODM (Object Document Mapper) for MongoDB, then Thinky will look familiar since the creation of field names, validations, and queries are similar. If you are not familiar with Mongoose, then read our articles introducing you to it here and our article covering the latest version: Mongoose 4 . Overall, Thinky and Mongoose work on similar principles, which is to provide you with an efficient and object-oriented way to model your data. The difference between the two is that an ODM like Mongoose concerns itself with the structure of data within documents or tables, while an ORM like Thinky goes further by modeling the relationships between them. Thinky is a lightweight Node.js ORM that uses an alternative version of RethinkDB’s Node.js driver, rethinkdbdash , on the backend that has the added bonus of connection pools. While the ORM is not as fully featured as Mongoose, it enforces schema validations and creates indexes and tables automatically out of the box. One of the most useful features, however, is its four predefined relation methods that help you create relations between models, which we will discuss in more detail later. To give you a small example of the similarities between Mongoose and Thinky, let’s look at a schema in Thinky and Mongoose in the context of creating characters and houses from the popular series of novels and TV series “Game of Thrones"". THINKY SCHEMA var thinky = require(""thinky""); var type = thinky.type; var r = thinky.r; var Character = thinky.createModel(""Character"", { id: type.string(), // or String name: type.string(), createdAt: type.date().default(r.now()) // using RethinkDB’s r command through rethinkdbdash // to set the date on the server during creation }); var House = thinky.createModel(""House"", { id: type.string(), houseName: type.string(), characterId: type.string() }); MONGOOSE SCHEMA var mongoose = require(""mongoose""); var Schema = mongoose.Schema(); var CharacterSchema = Schema({ _id: String, // or {type: String} name: String, createdAt: {type: Date, default: Date.now} // uses JavaScript’s Date.now() method }); var HouseSchema = Schema({ _id: String, houseName: String, characterId: String }); var Character = mongoose.model(""Character"", CharacterSchema); var House = mongoose.model(""House"", HouseSchema); Looking at the example above, the advantage of using Thinky is that it defines our schemas, creates a model and assigns it a name, while creating the necessary tables within RethinkDB simultaneously. Whereas with Mongoose, you must define the schema then define the model. Thinky's documentation provides some good schema examples with the different field type options (String, Number, Date, Array, Boolean, etc.) and chainable methods that you can use to define fields further. RELATIONSHIPS AND JOINING DOCUMENTS An appealing feature of Thinky is its ability to help you assign relations between your models. The documentation for creating relations is not entirely clear; therefore, you might have to take a deep dive into its github issues for clarification. This section intends to explain how to define relationships between models and highlights some of the peculiarities and pitfalls you may encounter when using them. We will also provide you with a brief look at how they are interpreted in RethinkDB. RethinkDB's Node.js driver natively allows us to define many-to-many and one-to-many relationships by using the eqJoins and zip ReQL commands. For a brief overview of joining tables in RethinkDB using the eqJoin command, we've provided an overview with examples here . In general, the eqJoins and zip commands will take two tables, join them on foreign and primary keys ( eqJoin ), and merge them together returning you the joined documents. The syntax for this query would be as follows: r.table(""House"").eqJoin(""characterId"", r.table(""Character"")).zip() However, leveraging the power of Thinky, we are provided with four predefined relation methods ( hasOne , hasMany , belongsTo , and hasManyAndBelongsTo ) that write all of the ReQL commands for us using RethinkDB’s joins capabilities behind the scenes. All we need to do is provide Thinky with the primary and foreign keys so that it knows where the joining should occur. To define the relationships between our character and house models, all we have to write is the following: Character.hasOne(House, ""house"", ""id"", ""characterId""); House.belongsTo(Character, ""character"", ""characterId"", ""id""); Using the relation methods, we are able to choose the tables (or models) where the relations shall occur ( Character and House ). Then, we create a custom field name for that relationship ( house and character ) and provide the primary and foreign keys from the tables where each should be joined. If we need to add secondary indices, Thinky also makes this painless, since it does all the heavy lifting for you. The method ensureIndex , which under the hood wraps RethinkDB's indexCreate and indexWait commands together, checks to see if the index you defined exists and if it doesn't, creates it for you. Adding an index in our data just needs the following: Character.ensureIndex(""name""); House.ensureIndex(""houseName""); Now that we have prepared our models, relations, and indices, we are ready to start inserting data into our database. The only information that we must include is the name of the character and the name of the house they belong to. { ""createdAt"": ""2016-07-26T05:20:31.381Z"", ""id"": ""dc47cc90-f629-499b-8c45-efb1e987717c"", ""name"": ""Robert Baratheon"" } { ""houseName"": ""Baratheon"", ""id"": ""774b00a8-193b-468f-9784-919d56337baf"", ""characterId"": ""dc47cc90-f629-499b-8c45-efb1e987717c"" } Since our hasOne relationship points to characterId as the foreign key field in our House table, Thinky will automatically populate that field with the id of the appropriate character from the Character table when both tables are saved using the saveAll() method. In order to save these documents without running the risk of not inserting foreign keys, we must use the saveAll() method on our character and then pass in an object with the name of hasOne relationship we defined previously. character.house = house; character.saveAll({house: true}).then(function(data) {...}); Therefore, character.house joins the Character and House tables together. We assign it to house which will be the key where our house document will be stored when it is returned. Within the saveAll() method, we insert the name the document we want to join that was defined in our hasOne relation ( house ). Then, we set it to true in order to tell Thinky to save the house and the character tables together. After we've inserted our documents, we can use the getJoin query method to return the joined documents. Character.getJoin({house: true}).run().then(function(result) { console.log(result); }); In our query we call the getJoin method, which is also given an object with the name of the field we created in the hasOne relationship. It is also set to true so that Thinky knows which table to join to provide you the correct data. When we run this query, we get the following result: { ""createdAt"": ""2016-07-26T05:20:31.381Z"", ""house"": { ""houseName"": ""Baratheon"", ""id"": ""774b00a8-193b-468f-9784-919d56337baf"", ""characterId"": ""dc47cc90-f629-499b-8c45-efb1e987717c"" }, ""id"": ""dc47cc90-f629-499b-8c45-efb1e987717c"", ""name"": ""Robert Baratheon"" } If we ran the same query using RethinkDB's eqJoin command, it will produce a similar result. The only difference is that in the result above, the key house includes nested data from our joined table, whereas with eqJoin the data wouldn't be nested. RethinkDB's eqJoin command might provide a better solution than Thinky's getJoin method if you combine eqJoin with the zip and without commands. These commands will merge your documents together rather than storing them as nested data. (Refer to our article on RethinkDB joins for a more in depth discussion.) JOINING MULTIPLE DOCUMENTS Sometimes you have tables that have documents containing the same foreign key id. Using a hasMany relation is the optimal solution, which has the same syntax as the hasOne relation. For our use case, we might consider that some characters belong to two houses (i.e. Jon Snow) and what hasMany will do is modify our house object into an array of objects. Exclaimer: If you have two documents with the same foreign key and a hasOne relationship, Thinky will throw an error stating that you have more than one document with the same foreign key, so make sure that you have a hasMany relation defined beforehand. So, in the House table we might have two houses with the same characterId that refer to Jon Snow as in the following: { ""houseName"": ""Targaryen"", ""id"": ""c129274e-44f6-4224-9c30-e7112f596121"", ""characterId"": ""f39e479e-5961-4dc8-b763-5c0397279c6a"" }, { ""houseName"": ""Stark"", ""id"": ""4c459d81-3162-4bc7-acca-4bb5467ccd13"", ""characterId"": ""f39e479e-5961-4dc8-b763-5c0397279c6a"" } Querying our database for Jon Snow, using the same query as we executed above, will produce the following: { ""createdAt"": ""2016-07-26T05:20:31.381Z"", ""house"": [ { ""houseName"": ""Stark"", ""id"": ""4c459d81-3162-4bc7-acca-4bb5467ccd13"", ""characterId"": ""f39e479e-5961-4dc8-b763-5c0397279c6a"" }, { ""houseName"": ""Targaryen"", ""id"": ""c129274e-44f6-4224-9c30-e7112f596121"", ""characterId"": ""f39e479e-5961-4dc8-b763-5c0397279c6a"" } ], ""id"": ""f39e479e-5961-4dc8-b763-5c0397279c6a"", ""name"": ""Jon Snow"" } Thus, we are given an array of houses with Jon Snow's id as the characterId in the house array, which produces a nice result that will allow us to manipulate the data further. We did not have to change any of our queries, or our code, to implement the hasMany relation. This is nice when you want to write an application fast and without too many obscurities. GET THINKY So, we've looked at some of the ways we can model, query, create relations between tables, and store data using Thinky. While it does not have all the capabilities that Mongoose has for MongoDB, the author is actively adding new features in order to increase its functionality and usability. Overall, it provides you with a few shortcuts to create tables and relations between them, which reduces the amount of code you'd have to write if you decided to only use ReQL. Also, it makes your code readable and helps produce consistent results. Image by Kalen Emsley Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud technology. Love this article? Head over to Abdullah Alger’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Thinky is an open-source ORM designed for RethinkDB. Here, we show how to create schemas and define relations between your models.",Thinky and RethinkDB,Live,211 594,"THE POTENCY OF IDEMPOTENT WITH RABBITMQ AND MONGODB UPSERT Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 24, 2016Designing with possible failure in mind is a good strategy when building with distributed systems. The cloud is a distributed system. Latency, network failures, service end points changing and such are to be expected. If you don't expect them, well, be prepared for the unhappy Twitter responses or pager alert in the middle of the night when your application goes down. One of the easiest strategies to help mitigate potential failures is to design your data model for idempotent functions, which can be correctly called over and over again, and to rely on a messaging system to deliver data at least once. It's easy to do this with RabbitMQ and a ready made statement for idempotent data like Mongo's upsert . You Use Idempotent Data Models All the TimeWhen you make a deposit at your bank or transact with some large e-commerce companies, the data contained in those events is modeled for idempotent functions. This type of data model tends to be nothing more than a virtual ledger versus a non-idempotent model that relies on update in place semantics. In the ledger model a history is kept and a tally can be computed from the individual entries. In the non-ledger model there is no history just a balance field. Bank accounts, or even accounting in general, are great examples of this: In the ledger model the same piece of data can be processed more than once without that being an error. In the above example, if the Exxon withdrawal was inserted over and over again in the ledger model there wouldn't be any difference in the state assuming that the entry had some kind of key that identified it as being the same. Update the entry with the same data and rerun the tally to get the same balance as before. No harm other than some wasted compute cycles. In the non-ledger model though, every update other than the first would be an error. It would rerun the withdraw transaction steps and over debit the account. At Least OnceRabbitMQ is one of many messaging systems that provide at least once semantics if configured to do so. By requiring the consumer of a message from a queue to acknowledge when it has finished processing the message, one can guarantee at least once processing. If there is a failure between the consumer's finishing processing prior to acknowledging then it will be rerun again which is the more than once scenario. Systems that are not designed to handle this can create errors. The above non-ledger style account above would be a case in point. Using pika, a Python RabbitMQ driver, we can review configuring the queue to ack when done. The below connects to Rabbit, consumes from the transaction queue, and the acknowledges a message: import json import pika conn = pika.BlockingConnection(pika.ConnectionParameters( host='aws-us-east-1-portal12.dblayer.com', port=15518, virtual_host='tangible-rabbitmq-66', ssl=True, credentials=pika.credentials.PlainCredentials(""hays"", ""thumper"", True))) channel = conn.channel() for frame, props, body in channel.consume('transaction'): msg = json.loads(body) # Do Mongo Upsert Here! (See Below) channel.basic_ack(frame.delivery_tag) In the above you can see the basic_ack . We ack when message processing is complete. The thing that is also handled here is that if this doesn't finish processing then the original message will still be in the transaction queue without having been ack ed. That means in certain scenarios this entire code would be run again. In the non-ledger system this could be a problem with double processing. Idempotent solves this. Mongo's upsert as an Idempotent FunctionSo, one can insert data and let the datastore generate a key such as Mongo's ObjectId or using a SERIAL / SEQUENCE from PostgreSQL. This can be a big problem in the more than once scenarios like above where you could end up with the following: [ { id: 1, action: ""WITHDRAW"", account: 123, value: 45.0, at: ""2016-10-11T17:04:00-06:00"" }, { id: 2, action: ""WITHDRAW"", account: 123, value: 45.0, at: ""2016-10-11T17:04:00-06:00"" } ] Obviously, this would be bad for our account since it would be withdrawing an extra $45. The easiest solution is to treat the WITHDRAW as an entity before it ever makes it to the datastore or to any code where a failure could create multiple copies of the data. Keying the data with a UUID before publishing to a queue is a simple solution: { id: ""37320056-9d42-492a-a216-03bc5beea0ce"", action: ""WITHDRAW"", account: 123, value: 45.0, at: ""2016-10-11T17:04:00-06:00"" } Since the id is part of the original record we can assert it as the unique key and just upsert it to our store. We could do it more than once too as long as the processing doesn't create any other side effects. Extending the Python example from above, we use the pymongo MongoDB driver to key and publish just such data: ... import pymongo import ssl mongo = pymongo.MongoClient(""mongodb://idem:potent@aws-us-east-1-portal.19.dblayer.com:15513/idempotent"", ssl=True, ssl_ca_certs='i_am_a_ca.pem') accounts = mongo.idempotent.accounts ... for frame, props, body in channel.consume('transaction'): msg = json.loads(body) account = accounts.replace_one({'_id': msg['_id']}, msg, True) channel.basic_ack(frame.delivery_tag) ... In the above we've connected to a Mongo database named idempotent with a collection named accounts via SSL with our self-signed certificate which was copied from the Compose deployments Overview page. The new addition here is the accounts.replace_one function call. With the True parameter, it will insert the msg if not found. Otherwise it updates it. These together are upsert . It's all that is needed to safely process retries which might rarely happen from our consuming queue code. This is idempotent. An Added BonusEmbracing this publisher, queue, and consumer processing allows for splitting an application at a good boundary. Front end responses can proceed quickly and back end processing can be pushed to an asynchronous worker which relies on a queue: Less latency and more speed up front with acceptable speed and buffered processing on the back equals less total resources which is good from a dollars perspective. To parse that, it makes sense to handle front end HTTP requests optimistically then to asynchronously handle the heavy lifting on a back end process that can even be buffered during peak usage. Insert Queue, Respond ImmediateThere are many scenarios where having an asynchronous worker handle the heavy, stateful lifting makes good sense. Trading a transactional style commit for a simple queue insert can be a really good tradeoff for some workloads since your web serving code can respond without waiting. Let's see what it takes using Python's Flask web framework to create a simple API and use the pika driver to publish from the front end to RabbitMQ: from flask import Flask from flask import request from flask import jsonify import pika import json import uuid app = Flask(__name__) conn = pika.BlockingConnection(pika.ConnectionParameters( host='aws-us-east-1-portal12.dblayer.com', port=15518, virtual_host='tangible-rabbitmq-66', ssl=True, credentials=pika.credentials.PlainCredentials(""hays"", ""thumper"", True))) channel = conn.channel() channel.queue_declare(queue='transaction') @app.route(""/transaction"", methods=['POST']) def transaction(): msg = request.json msg[""_id""] = str(uuid.uuid4()) channel.basic_publish(exchange='', routing_key='transaction', body= json.dumps(msg), properties=pika.BasicProperties( delivery_mode = 2 )) return jsonify(msg) if __name__ == ""__main__"": app.run() The above connects to Rabbit, declares a queue, and creates an HTTP endpoint that accepts a JSON POST request. For each POST , it parses the request and generates a unique key before it publishes the data to a persistent, on disk queue. By keying the entity here instead of at the database layer, it protects the identity of the data and allows the later upsert to be idempotent. TradeoffsThere is a lot to be gained by building your apps with messaging: separating concerns, asynchronous and possibly parallel processing on the back end, the ability to use multiple languages, and on the list can go. There are some tradeoffs in an approach such as added complexity and running a new server to handle messaging. The beauty of Compose is that we can handle running that RabbitMQ server for you so you can go about the business of building scalable applications. Image by: Chris Pastrick Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton writes code and then writes about it. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud * Subscribe Join 75,000 other devs on our weekly database newsletter. © 2016 Compose","One of the easiest strategies to help mitigate potential failures is to design your data model for idempotent functions, which can be correctly called over and over again, and to rely on a messaging system to deliver data at least once.",The Potency of Idempotent with RabbitMQ and MongoDB Upsert,Live,212 601,"Homepage Follow Sign in / Sign up Homepage * Home * Data Science Experience * * Watson Data Platform * Armand Ruiz Blocked Unblock Follow Following Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats #Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my own Jun 23, 2016 -------------------------------------------------------------------------------- MODELING ENERGY USAGE IN NEW YORK CITY On June 6 we introduced the **IBM Data Science Experience** to the world at the Spark Maker Event that took place in Galvanize. We demonstrated the Experience with a real use case developed in partnership with BlocPower. BlocPower is a startup based in New York City. Its technology and finance platform develops clean energy projects in American inner cities. IBM Data Science Experience helped BlocPower perform a comprehensive energy audit of each property to determine the correct mix of high-efficiency technology to reduce each customer’s energy consumption. Tooraj Arvajeh, Chief Engineering Officer at BlocPower, explained how IBM Data Science Experience made this process simpler. “BlocPower operation is diverse from outreach and targeting, origination of investment-grade clean energy projects to financing projects through our crowdfunding marketplace. Data is the underlying tool of our operation and IBM’s Data Science Experience will facilitate a closer integration across it and help our business scale up faster. “GOALS OF THE DEMO: - Easily import data into a notebook from object storage to quickly start analyzing data and creating predictive models. - Model energy usage of buildings in kWh. - Identify buildings that consume energy inefficiently. - Create a project and collaborate with other data scientists. - Create an easy-to-use application to make the outcome of the models consumable by any user. To do that, we used tools that data scientists love today that are integrated into the IBM Data Science Experience: Jupyter notebooks connected to Apache Spark, RStudio, Shiny, and GitHub. These are the steps that we followed: 1- GitHub + Jupyter notebooks = ❤ When starting a new project, the data scientist can choose to start from scratch or to leverage someone else’s work. In this case, we showcase the Import from URL capability to import an existing notebook from GitHub and start working on it right away. There are more than 200k public Jupyter notebooks out there that you can use! 2- Load and clean data To analyze data in a Jupyter notebook, first load the data. Many libraries and commands can do that, but it’s not always obvious which one to use. One of the add-ons to Jupyter notebooks is the capability to access data files stored in object storage or available through data connections and in one click to add the code needed to load the data into the notebook. Once the data is loaded, the next step is to clean it. We created a library called Sparkling.Data , which can scale to big data, to help the data scientist perform this task. 3- Data Exploration After cleaning the data, we used Matplotlib , the best tool available for data visualization in Python, to explore the correlations between energy usage and building characteristics such as age, number of stories, square footage, amount of plugged equipment, and domestic and heating gas consumption. By analyzing variable relationships, the data scientist can, for example, determine the best model to use and which variables have more predictive power. 4- Create a Prediction Model Our goal is to create a model that predicts the energy consumption in kWh of different buildings based on characteristics such as square feet, age, number of stories, and so on. We model energy usage with a linear regression using the algorithm included in scikit-learn , one of the best Python libraries for machine learning. Before running the linear regression, we used the MaxAbsScaler function from scikit-learn to scale the data. To visualize the fit of this model, we use a scatter plot of the observed vs. the predicted values. The resulting R-squared value was approximately 0.72. 5- Classify buildings by efficiency We used the popular **K-means** algorithm to cluster buildings in NYC based on four dimensions that indicate energy efficiency: gas use for heating, gas use for domestic purposes, electricity use for plugged equipment, and electricity use for air conditioning. In the next matplotlib plot, we colored our buildings by using the K-means labels with K=4 and using two out of the four dimensions. This visualization, and other visualizations not shown here, helped us reduce the four clusters to two. These two clusters of buildings were interpreted as the efficient and the inefficient groups of buildings. 6- Flexdashboard and Shiny in RStudio RStudio just published on CRAN a new R package called Flexdashboard . This great package enables creating dashboards very easily, and you can include Shiny code to make dashboards very interactive. A dashboard can be shared with anyone by simply sending the URL. The dashboard is divided into 4 sections: - Data Exploration : A map of buildings colored by their electricity consumption. When a building is selected, a bar plot indicates how this building is doing with respect to the average energy efficiency measured in four dimensions. - Clustering : A map of buildings classified as efficient or inefficient. - Prediction : Scoring of the linear regression model built in the notebook to predict the energy usage in kWh and annual cost of electricity for the buildings. On the left side are sliders for selecting the properties of the building to score the model. - Raw Data : We use the Data.Tables package to display the data set with search and sorting capabilities. Link to the Shiny Application You can check out the 10-minute demo of IBM Data Science Experience here: We created a GitHub repository with all of the material and instructions needed to run this demo, too. Enjoy! Link to GitHub repo * Data Science * Machine Learning * IBM 4 2 Blocked Unblock Follow FollowingARMAND RUIZ Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats #Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my own FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * 4 * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",On June 6 we introduced the **IBM Data Science Experience** to the world at the Spark Maker Event that took place in Galvanize. We demonstrated the Experience with a real use case developed in…,Modeling energy usage in New York City,Live,213 602,"COMPOSE'S 2016 - ALL ABOUT THE DATABASE Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 22, 2016What a year it's been for Compose. We... * Brought MongoDB 3.2 , PostgreSQL 9.5 , ElasticSearch 2.4.0 and Redis 3.2 to the Compose system * Built features around them to make working with them even easier * Grew to keep up with demand for our production-ready databases and talking about those challenges and the the tools we use * Introduced Compose for MySQL and ScyllaDB in beta to the Compose platform * Promoted RabbitMQ out of beta * Introduced Compose databases as a service on IBM Bluemix * Added Google's Cloud Platform as a deployment option for Compose databases * Announced and shipped the Compose Enterprise option for private database clusters * Integrated FIDO universal 2nd-factor authentication into the Compose UI * Held our first DataLayer conference And if that's not enough, read on to find out what happened with your favorite databases at Compose. MONGODB 2016 started with MongoDB 3.2 on the beta of Compose's new MongoDB+SSL deployments. Those beta deployments became the default MongoDB configuration on the Compose platform after a month and came complete with support for WiredTiger and Oplog access . The addition of new import capabilities to MongoDB to help people migrate came later in the year. MongoDB 3.2 itself offered some exciting new features such as partial indexes, validation , improved aggregation and lookup which were covered in the Compose Articles blog. There was also coverage of other subjects like connection pooling , connecting with Go , using Meteor 1.4 with Compose , MongoDB data transfers with NiFi , geospatial queries and using Node-RED with MongoDB for prototyping apps . GraphQL is up and coming and we looked at using it with MongoDB early in the year, thanks to a Write Stuff author. POSTGRESQL The arrival of PostgreSQL 9.5.2 in April got us asking, and answering, a regular question - "" Could PostgreSQL 9.5 be your next JSON database? "". PostgreSQL 9.5 included some great new features: row level security and group by options (following on from 2015's look at upsert in 9.5 ). PostgreSQL deployments saw many enhancements: cross database queries extensions , performance and extension views for your Compose console, and a SQL query data browser to let you browse PostgreSQL's data from your browser. In Compose Articles, we explored PostgreSQL features like full-text search indexing and per-connection write consistency . There were also looks at some alternative ways of accessing your PostgreSQL data such as using PostgREST to create a RESTful API from your schema, hugsql with clojure for an SQL-centric approach and a look at PostGraphQL for quickly bringing GraphQL to your database. Compose's in-house analytics ""Metrics Maven"" started a new and on-going series, and as you might expect with us using PostgreSQL in-house there were a lot of useful articles about breaking down data using PostgreSQL: * Window Frames in PostgreSQL * Calculating a Moving Average in PostgreSQL * Making Data Pretty in PostgreSQL * Creating Pivot Tables in PostgreSQL using Crosstab * Beyond Average: A Look at Mean in PostgreSQL * Meet in the Middle Median in PostgreSQL ELASTICSEARCH The recent update to Elasticsearch 2.4.0 on Compose is designed to keep things fresh as we have worked on making Elasticsearch more accessible with a new data browser . Meanwhile, in Compose Articles, there's been a look at how to set up Kibana locally to explore your Elasticsearch deployments. Elasticsearch on Compose also got simplified and more effective security with our new Let's Encrypt based TLS Certificates . Compose's Elasticsearch coverage included a wide range of subjects in 2016. There was a four-part series on Elasticsearch and Perl on how to connect and monitor an ElasticSearch Cluster , how to do indexing , advanced index options , and how to use the querying and search features of the Elasticsearch.pm perl module. For the more modern developer, they now could learn how to leverage their Node.js skills with our Getting started with Elasticsearch and Node.js five part series which worked with real data ( 2 , 3 , 4 and 5 ). Compose's Metrics Maven also added to our library of Elasticsearch articles by looking at how scoring works , a mini-series on increasing Elasticsearch relevance , and a deep dive into how to use query string queries effectively. REDIS Redis on Compose saw many improvements over the year. We started exposing more controls for it like the KEA switch so you could tune it for your needs , and we let you secure it with SSH tunneling from the Compose Redis console. Our developers also added a data browser , made migrations easier with new Redis importing , let you see slow logs , tuned the autoscaling resolution and improved our cache/storage Redis modes , boosting the cache performance for many users. Updating to Redis 3.2 brought a number of new features to the database itself; one article looked specifically at the Redis Geo API , which offers applications new ways to understand the geographic proximity of things - ideal for mobile applications. Another looked at Lua scripting , an older feature which you needed to know about so we could talk about Redis 3.2's new Lua debugging . Compose Articles also looked at the most popular Redis drivers and using Redis PubSub and web sockets . RETHINKDB RethinkDB always impressed us with its solid engineering, and the release of RethinkDB 2.2.4 and 2.3.2 brought features like user authentication and native SSL to the database, and a new proxy to Compose RethinkDB deployments. Compose articles showed different ways to connect to RethinkDB, such as from Elixir applications and by using the new Java driver over SSL . Other articles demonstrated strategies for preventing data loss by configuring RethinkDB replicas, how to aggregate data from multiple tables using RethinkDB Joins , using new aggregation features like fold , and using the Thinky ORM for a more object-centric way of accessing RethinkDB. Unfortunately, the company behind RethinkDB, RethinkDB Inc, had to shut down; the open-sourced engineering legacy and active planning to seed a new open source community mean Compose will continue to support RethinkDB . RABBITMQ With the promotion of Compose RabbitMQ out of beta , we also updated it to version 3.6.5, introducing new features like Lazy Queues. RabbitMQ was another platform which got the new Let's Encrypt based SSL/TLS certificates for AMQPS connections . Compose Articles looked at how to use MongoDB and RabbitMQ together to create resilient applications and using RabbitMQ in microservices . ETCD Compose etcd entered beta in 2015 and hasn't left yet. Building a dynamic configuration service , a Write Stuff article, showed how to apply etcd in practice. We also showed you how to use etcdtool to backup and restore your etcd data and gave you a behind-the-scenes peek at how our engineering team tracked down a problem with performance degradation and fixed it. MYSQL This year, two new beta databases debuted on Compose - one of them was Compose for MySQL . This MySQL deployment is built on top MySQL 5.7.15 and makes use of InnoDB-based group replication to help deliver the high-availability and reliability that we incorporate in all out databases. Compose for MySQL is a recent arrival and, like other Compose databases, offers one-click deployments a cluster, SSL connections and a range of drivers for applications . SCYLLADB The beta release of ScyllaDB on Compose brings an Apache Cassandra compatible database to Compose for the first time. ScyllaDB presented at DataLayer and our own Nick Stott talked at Scylla Summit . Scylla CTO Avi Kivity also chatted with Compose Articles about what makes Scylla faster and lighter than Cassandra. Compose launched with Scylla 1.3 and the blog covered getting started with simple connections . We then followed that up with our better connected Scylla 1.4 release which lets you connect to all the Scylla nodes in your three node clusters. 2017, HERE WE COME There's so much in the pipeline for 2017 at Compose and we know it'll make your data life so much better. Image by Nitish Meena Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud * Subscribe Join 75,000 other devs on our weekly database newsletter. © 2016 Compose",What a year it's been for Compose. Here are the highlights ...,Compose's 2016 — All about the database,Live,214 605,"The dplyr and tidyr packages are built to save you time when you wrangle data. Together, they provide a complete system for reshaping, transforming, and combining data sets.","The dplyr and tidyr packages are built to save you time when you wrangle data. Together, they provide a complete system for reshaping, transforming, and combining data sets.",Data Wrangling with dplyr and tidyr Cheat Sheet,Live,215 606,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register * DASHBOARD * HOW-TO * BLOG * EVENTS * DOCS * SUPPORT Search How-to < Previous / Next >CONNECTING POUCHDB TO CLOUDANT ON IBM BLUEMIX cloudant node.js nosql pouchDB Raymond Camden / April 29, 2015 / 0 commentsRepublished from Raymond Camden’s Blog -------------------------------------------------------------------------------- So, as always, I tend to feel I’m a bit late to things. Earlier today my coworker Andy Trice was talking to me about PouchDB. PouchDB is a client-side database solution that works in all the major browsers (and Node.js) and intelligently picks the best storage system available. It is even smart enough to recognize that while Safari supports IDB, it doesn’t make sense to use it and switches to WebSQL. It has a relatively simply API and best of all – it has incredibly simple sync built in. I tend to work with client-side databases with just the vanilla JavaScript APIs available to them, but honestly, after an hour or so of using PouchDB I can’t see going back. (And yes, I know other solutions exist too – and I’m going to explore this area more.) Probably the slickest aspect is the sync. If you have a CouchDB server setup, you can set up automatic sync between all the database instances in seconds. For my testing, I decided to use IBM Bluemix . This blog post assumes you’re following the PouchDB Getting Started guide. First, add the Cloudant NoSQL DB service to your Bluemix app: After you have added the service and restaged your app, select it, and then hit the Launch button: This fires up the Cloudant administrator where you can do – well – pretty much everything related to setting up your database. But to work with that guide at PouchDB, select Databases and then “Add New Database”: Then enter todos to match the guide: Ok, you’re almost done. You then want to enable CORS for your Cloudant install. In the Cloudant admin, click Account and then CORS. Enable it, and then select what origin domains you want. For now, it may be easier to just allow all domains. Woot! OK, one more step. When using PouchDB and sync, they expect you to supply a connection URL. You can get this back in your Bluemix console. Select the “Show Credentials” link to expand the connection data and then copy the “url” portion. And voila – that’s it. If you open your test in multiple browsers, you’ll see everything sync perfectly. Remember you can also use PouchDB in Node.js, which, coincidentally, you can also host up on Bluemix, so yeah, that works out well too. SHARE THIS: * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to email this to a friend (Opens in new window) * LEAVE A COMMENT Click here to cancel reply. Tell us who you are Name (required) Email (required) Comment text Notify me of follow-up comments by email. Notify me of new posts by email. * START BUILDING WITH BLUEMIX! * @IBMBLUEMIX RT @IBM The #CognitiveDress is almost here! The first @MarchesaFashion + @IBMWatson collaboration debuts tomorrow. #MetGala pic.twitter.com/y54X… How dynamic introspection has the capacity to advance the state of the art in #containers , and the cloud itself: bit.ly/1SXmXST RT @IBMdwOpen Why Apps Should be Data Efficient developer.ibm.com/bl… via @IBMBluemix At @signalconf check out /bots: the sub-conference dedicated to all things bots and the future of messaging. bit.ly/1pX1nQO Learn about IBM's new cloud services for #blockchain and our release of #hyperledger code via @BitcoinMagazine bit.ly/1pWZPGh * @IBMCLOUDSUPPORT Maintenance: #Bluemix Analytics for Apache Spark - May 4 02:00 UTC - US-South. See status: bit.ly/1J6UPrT pic.twitter.com/gBNC… Maintenance: #Bluemix DataWorks service - May 4th, 02:00 UTC - US-South. See status: bit.ly/1J6UPrT pic.twitter.com/YuHB… #Bluemix platform maintenance May 3, 17:00 UTC - AU-SYD region. See details at bit.ly/1J6UPrT pic.twitter.com/1924… Read about Active Deploy & have add zero-downtime deploy capability for cloud apps: ibm.co/1S65cgi #Bluemix pic.twitter.com/P9uQ… #Bluemix web console back online in US-South region - bit.ly/1J6UPrT pic.twitter.com/qnuH… * CATEGORIES * General * Events * Updates * How-to * SOLUTIONS * Analytics * App Services * Big Data * Bluemix Dedicated * Bluemix Local * Catalog * CF Applications * Containers * DevOps * Eclipse * Hybrid * Integration * Internet of Things * Mobile * Network * OpenWhisk * Security * Storage * Virtual Servers * Watson * Web Apps Follow us on Twitter RSS Feed * Contact us * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","A step-by-step guide to configuring Cloudant on Bluemix so that it is able to replicate to and from PouchDB - an in-browser, CouchDB-compatible database. PouchDB and Cloudant allow offline-first apps to be developed, allowing your app's users the ability to save their data even when not connected to the internet and syncing the data at a later date.",Connecting PouchDB to Cloudant on IBM Bluemix,Live,216 607,"SEVEN DATABASES IN SEVEN DAYS – DAY 1: RETHINKDB Lorna Mitchell and Matt Collins / July 28, 2016This post is part of a series of posts created by the two newest members of our Developer Advocate team here at IBM Cloud Data Services. In honour of the book Seven Databases in Seven Weeks by Eric Redmond and Jim R. Wilson, we challenged Lorna and Matt to take a new database from our portfolio every day, get it set up and working, and write a blog post about their experiences. Each post reflects the story of their day with a new database. We’ll update our seven-days GitHub repo with example code as the series progresses. Meet RethinkDB and its mascot, The Thinker. * Database type: highly scalable JSON storage with real-time data feeds * Best tool for: situations where it’s important to quickly update when data changes OVERVIEW RethinkDB is a database that aims to provide a performant and scalable storage solution that pleases both development and operations people. Inside, it’s a document database using JSON format, is distributed by nature, and includes a user-friendly admin console for managing it. So far, nothing particularly special but RethinkDB has a couple of tricks up its sleeve: unusually for document databases it supports joins, and it also allows you to retain a connection to a query so if any further results arrive they will instantly be pushed to the client over the same connection. RethinkDB is open source, so you can run this anywhere, although for these examples we’ll make use of the cloud and grab a RethinkDB from Compose . This article covers how to get started setting up RethinkDB and connecting your application to it. We’ve put together a quick example using a hypothetical issue tracker and paying particular attention to the data feed updates that RethinkDB offers. GETTING SET UP Start by setting up a RethinkDB instance on Compose , which will then take a moment to deploy to the appropriate cloud (choice of AWS, SoftLayer or DigitalOcean) for you and then send you to the console. Once the database has been created, we’ll start by logging into the Admin UI. CONNECTING TO THE ADMIN UI In the Compose overview for this database, you’ll find some connection information – a username (defaults to admin), an Authentication Credential (a fancy way of saying password!), and a few connection strings. One of these strings is for the Admin UI. Pop that into the address bar for your browser of choice, and you should be prompted for the username and password you discovered above. Enter them into the prompt and you should see the quite lovely looking RethinkDB Admin UI in front of you! “Shine on you crazy admin” —Rethink Floyd From here we can see how our database is doing, run queries, create tables, manage indexes, and so on. This is interesting and useful, but even more interesting and useful is how we can programmatically access our database, which we’ll cover in the next section. CONNECTING FROM NODE.JS There are official libraries for Node.js, Python and Ruby, and there are many more community-contributed offerings that seem to work well. So most applications will be able to easily take advantage of RethinkDB’s features. There are a few pieces of information that we want to grab from the connection details screen: * Authentication Credential: this is a token that you need to click to show, and then copy * Certificate: further down the page, there’s a self-signed SSL cert that you should save somewhere. Ours is in a file called cert and you’ll see it referenced in our application shortly. For this example, we used Node.js, and put all the initial configuration and setup into a file named config.js , which we included in all our other scripts ( example code on GitHub ). Here’s that file: const fs = require('fs'); const cert = new Buffer(fs.readFileSync('./cert', ""utf8� const connection = { host: ""aws-us-east-1-portal.17.dblayer.com"", port: 11557, user: ""admin"", password: ""SAHPgKzuOeFj7qu8ZaXCDjPNz4LPrCpfWEyquasjrA"", ssl: { ca: cert } } module.exports = { connection: connection } Take a look at the Connection Settings screen again, and specifically at the “RethinkDB Proxy Connection strings” the password is the Authentication Credential that you acquired earlier. Now we can test the connection by attempting to create a database — if we can successfully do this, then we know everything is working well. Here’s our create_db.js : const config = require('./config.js'); // RethinkDB Driver const r = require('rethinkdb'); // connect to the DB r.connect(config.connection, function(err, conn) { if(err) throw err; // create our DB r.dbCreate('issues').run(conn, function(err, data) { if (err) throw err; console.log(""DB created� Pro-tip: Remember to include the self-signed SSL cert that Compose gives you. If you don’t yet have Node.js, we recommend Homebrew for OS X. Then just brew install node . Treehouse also has some nice instructions . This code simply includes the config file we created earlier, creates a connection to the database, and outputs a log message if it is successful. At this point, we can start to use this connection to perform other operations. DESIGN YOUR DATABASE As an example, we’ll consider a simple sort of bug tracker application, just allowing us to add issues and keep track of their status and so on. First, we’ll create a table to store the issues. RethinkDB has a nice, easy web interface which you can use to create tables. You may also want to do that programmatically, so let’s start by looking at the code we used to create the issues table. Here’s our create_table.js : const config = require('./config.js'); // RethinkDB Driver const r = require('rethinkdb'); // connect to the DB r.connect(config.connection, function(err, conn) { if(err) throw err; // create our table r.db('issues').tableCreate(""issues� console.log(""Table created� Check in the admin interface to see your new database listed and verify that everything worked as expected. You should see your new table (but it’s still empty). IMPORTING DATA Since RethinkDB is JSON-based, it’s pretty happy to ingest JSON data of any kind, which is nice! There’s some detailed documentation on importing data , but we generated some sample data using http://json-generator.com and simply used that to quickly give ourselves something to work with. Importing data from our application is quite simple. Here’s a snippet from our application, with the data to import saved into a file named seed_data.json in the same directory. Here’s create_data.js : const config = require('./config.js'); // RethinkDB Driver const r = require('rethinkdb'); // seed data const seed = require('./seed_data.json'); // connect to the DB r.connect(config.connection, function(err, conn) { if(err) throw err; // Seed our table with some data r.db('issues').table(""issues� console.log(""Seed data added� This is a great way to get started quickly with some data in the issues table, and it means we can move along to the fun parts: querying the data and then seeing later changes also arrive instantly. FETCHING DATA AND RECEIVING UPDATES RethinkDB has its own query language called ReQL (for the very quickest of starts, there’s even an SQL to ReQL cheatsheet ). Let’s look at a very simple query. It fetches all records from our issues table, but here’s where it gets interesting: this script will then remain connected, and output further records when new data appears. First the code that queries the database and outputs information for each issue ( fetch_all_data.js ): const config = require('./config.js'); // RethinkDB Driver const r = require('rethinkdb'); // seed data const seed = require('./seed_data.json'); // helper function to format the output const format = require('./format_issue.js').output; // async const async = require('async'); // connect to the DB r.connect(config.connection, function(err, conn) { if(err) throw err; var actions = { current: function(callback) { // Get every issue r.db('issues').table(""issues� } } async.series(actions, function() { // Get every new issue r.db('issues').table(""issues� Take a look at the output of this script (this is just the last few lines): ebd578b5-fde3-4318-bb9e-e2aaf7b43b21 ut anim sunt voluptate ex reprehenderit STATUS: closed ================================ b22d4484-2a00-472d-b5c1-20af894ed056 est sint labore tempor veniam sit STATUS: wontfix ================================ ef344e27-e809-44cd-8395-1c93490c546e in officia Lorem in pariatur labore STATUS: reopened We can leave this running in the terminal and from another window, use a script that just inserts one new row that would appear in our dataset. Below is a quick script to do that; it cheats and steals an existing row of data and repurposes it. And now, we give you create_new_row.js : const config = require('./config.js'); // RethinkDB Driver const r = require('rethinkdb'); // some helper modules const _ = require('underscore'); const argv = require('optimist').argv; // seed data const row = _.shuffle(require('./seed_data.json'))[0]; // connect to the DB r.connect(config.connection, function(err, conn) { if(err) throw err; // Seed our table with some data r.db('issues').table(""issues� With the new row in place, take a look at what’s going on in the output of our original fetch-all-the-data script: ebd578b5-fde3-4318-bb9e-e2aaf7b43b21 ut anim sunt voluptate ex reprehenderit STATUS: closed ================================ b22d4484-2a00-472d-b5c1-20af894ed056 est sint labore tempor veniam sit STATUS: wontfix ================================ ef344e27-e809-44cd-8395-1c93490c546e in officia Lorem in pariatur labore STATUS: reopened ================================ 3fc73e89-8da1-4bce-91a3-31ae897ab7b6 Lorem nisi proident ea commodo nulla STATUS: reopened CONCLUSION This ability to keep queries running and instantly ship updates when the data changes is a key feature of RethinkDB. It makes this tool a great choice for anything which needs to update in response to data, either changing prices on a ticker or notifying other users of a web-based tool that someone else made changes. RethinkDB can be used by any number of server-side languages and is available whether you want to run it on your own hardware or deploy it as-a-service . SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Click to share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Compose / RethinkDB Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Looking to learn the basics of cloud databases? In this series, we show them running on Compose and intro programmatic access. First up: RethinkDB.",Seven Databases in Seven Days – Day 1: RethinkDB,Live,217 615,"* Home * Research * Partnerships and Chairs * Staff * Books * Articles * Videos * Presentations * Contact Information * Subscribe to our Newsletter * 中文 * Marketing Analytics * Credit Risk Analytics * Fraud Analytics * Process Analytics * Human Resource Analytics * Prof. dr. Bart Baesens * Prof. dr. Seppe vanden Broucke * Aimée Backiel * Libo Li * Sandra Mitrović * Klaas Nelissen * María Óskarsdóttir * Michael Reusens * Eugen Stripling * Tine Van Calster * Basic Java Programming * Principles of Database Management * Business Information Systems * Mini Lecture Series * Other Videos WEB PICKS (WEEK OF 28 DECEMBER 2016) Posted on January 3, 2017Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources . * The NIPS (Neural Information Processing Systems) 2016 conference is just past, and many people are reflecting on the many great works presenting there. See NIPS 2016 Highlights – Sebastian Ruder , Some general take aways from #NIPS2016 , 50 things I learned at NIPS 2016 , Post NIPS Reflections , All the available code repos for the NIPS 2016’s top papers for what people are saying, as well as Le Cun’s slides . * The great AI awakening How Google used artificial intelligence to transform Google Translate, one of its more popular services — and how machine learning is poised to reinvent computing itself. * In the race to build the best AI, there’s already one clear winner As Google, Facebook, Microsoft, and Baidu take turns leapfrogging each other in artificial intelligence innovation, one company stands to profit from any outcome: Nvidia. * The World’s Largest Hedge Fund Is Building an Algorithmic Model From its Employees’ Brains Bridgewater wants day-to-day management—hiring, firing, decision-making—to be guided by software that doles out instructions. * Crime Prediction software joins Dubai Police Force In addition to its fleet of supercars, the Dubai Police are now enlisting the help of Crime Prediction software. * What I learned creating one chart with 24 tools Finding the best tool means thinking hard about your goals and needs. * The Most Boring/Valuable Data Science Advice “I’m going to make this quick. You do a carefully thought through analysis. You present it to all the movers and shakers at your company. Everyone loves it. Six months later someone asks you a question you didn’t cover so you need to reproduce your analysis…” * The major advancements in Deep Learning in 2016 “In this article, we will go through the advancements we think have contributed the most (or have the potential) to move the field forward and how organizations and the community are making sure that these powerful technologies are going to be used in a way that is beneficial for all.” * US starts asking foreign travelers for their social media info Homeland Security approved the controversial proposal a few days ago. * Wall Street wants algorithms that trade based on Trump’s tweets Trump’s volatility is a market opportunity. * Tourists Vs Locals: 20 Cities Based On Where People Take Photos Tourists and locals experience cities in strikingly different ways. Great maps! * Tool AI’s want to be Agent AI’s “Tool AIs limited purely to inferential tasks will be less intelligent, efficient, and economically valuable than independent reinforcement-learning AIs learning actions over computation / data / training / architecture / hyperparameters / external-resource use.” * Building Jarvis Wondering how Zuckerberg creates an AI? “My personal challenge for 2016 was to build a simple AI to run my home — like Jarvis in Iron Man.” * A non-comprehensive list of awesome things other people did in 2016 Some people always manage to stick an ungodly amount of work in a year! * Finding MLB Anomalies with CADE “Over the Summer, while an intern at Elder Research, I learned about a very intuitive anomaly detection algorithm called CADE, or Classifier-Adjusted Density Estimation. The algorithm seemed very simple, so I wanted to try and implement it myself and try to find anomalous players in the MLB.” * A Guide to Solving Social Problems with Machine Learning “We have learned that some of the most important challenges fall within the cracks between the discipline that builds algorithms (computer science) and the disciplines that typically work on solving policy problems (such as economics and statistics). As a result, few of these key challenges are even on anyone’s radar screen.” * A Visual and Interactive Guide to the Basics of Neural Networks Simple explanation with great interactive visualizations. * Top 10 Python libraries of 2016 “Again, we try to avoid most established choices such as Django, Flask, etc. that are kind of standard nowadays.” * Hamiltonian Monte Carlo explained MCMC (Markov chain Monte Carlo) is a family of methods that are applied in computational physics and chemistry and also widely used in bayesian machine learning. * Data science and critical thinking (pdf) Some great stats and thoughts in this presentation! * Speed up your code with multidplyr “There’s nothing more frustrating than waiting for long-running R scripts to iteratively run. I’ve recently come across a new-ish package for parallel processing that plays nicely with the tidyverse: multidplyr.” * Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling “We study the problem of 3D object generation. We propose a novel framework, namely 3D Generative Adversarial Network (3D-GAN), which generates 3D objects from a probabilistic space by leveraging recent advances in volumetric convolutional networks and generative adversarial nets.” * China invents the digital totalitarian state Big data, meet big brother. * How we learn how you learn “In this post, we’ll take a look at the science behind the Duolingo skill strength meter, which we published in an Association of Computational Linguistics article earlier this year….” * Machine learning model to production (presentation) As explained by Georg Heiler. * Anomaly Detection at Scale (presentation) Jeff Henrikson presents at the first annual O’Reilly Security Conference, in New York City, 2016. ‹ The Analytics Year in Review and Looking Forward to 2017 —Ad—We display ads on this section of the site. -------------------------------------------------------------------------------- Recent Posts * Web Picks (week of 28 December 2016) * The Analytics Year in Review and Looking Forward to 2017 * Web Picks (week of 12 December 2016) * How can you interpret the coefficients of a logistic regression model? * Recommender Systems with Multiple Types of Feedback Archives * January 2017 * December 2016 * November 2016 * October 2016 * September 2016 * August 2016 * July 2016 * June 2016 * May 2016 * April 2016 * March 2016 * February 2016 * January 2016 * December 2015 * November 2015 * October 2015 * September 2015 * August 2015 * July 2015 * June 2015 * May 2015 * * * © DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU Leuven KU Leuven, Department of Decision Sciences and Information Management Naamsestraat 69, 3000 Leuven, Belgium DataMiningApps on Twitter , Facebook , YouTube info@dataminingapps.com","Interesting data science links from around the web, collected in Data Science Briefings, the DataMiningApps newsletter. ",Web Picks (week of 28 December 2016),Live,218 616,"SCALING OFFLINE FIRST WITH ENVOY At Offline Camp, fellow IBM Developer Advocate Bradley Holt gave a Passion Talk on Cloudant Envoy. As I am more involved in this project than he is, Bradley asked me to write this summary of Cloudant Envoy in his place. The “one database per user” design pattern makes things very easy for an Offline First application developer. Simply create a database on the mobile device and one in the cloud and get your app to read and write from its local copy. When there is an internet connection, data can be synced between the the device and the cloud. CouchDB 2.0 and IBM Cloudant are built to scale massively on the server side and each mobile device only needs to store a single user’s data. We can use PouchDB for web apps and Cloudant Sync for native mobile apps on the client side and use the CouchDB replication protocol to sync without loss of data. one-database-per-userThe problem comes as the number of users increases: backup, reporting and change control become problematic when there are hundreds / thousands / millions of individual databases — one for each user. Database proliferationEarlier this year, faced with this scaling problem, some IBMers armed with a flip chart, some Sharpies and a code editor, set about building something to address the scalability problems with this approach. Envoy — from scribbles to codeEnvoy is a Node.js micro-service that sits between the mobile devices and the Cloudant or CouchDB 2.0 cluster in the cloud, acting as CouchDB replication target. It proxies the replication requests between the client and server replicas, subtly changing the documents on the way through and storing the data in a single, server-side database. Many databases replicating to a single database via EnvoyEach mobile device still has one database per user but Envoy seamlessly stores the server side data in one database — or in two databases if you count the database of users too. Having a single store of data in the cloud makes querying, backing-up and managing the data set a breeze. We think Envoy has potential, but it’s early days for the project and we’re looking for folks in the Offline First community to try it out, provide feedback with comments & suggestions and hopefully contribute to the codebase. It’s published under the Apache-2.0 license so we’d be more than happy for folks to get involved. In future posts to the IBM Cloud Data Services Developer Center blog , we’ll delve into some of the technical details but for now I’ll leave you with some links: * https://github.com/cloudant-labs/envoy * https://www.npmjs.com/package/cloudant-envoy If you have any questions then you can leave comments here or ping me in the Offline First Slack community . Thanks to Bradley Holt and Maureen McElaney . JavaScript Web Development Offline First Database Nodejs 1 Blocked Unblock Follow FollowingGLYNN BIRD Developer Advocate @ IBM Cloud Data Services. Views are my own etc. FollowOFFLINE CAMP We took the Offline First community for a three day retreat in the Catskill Mountains. You won’t believe what happened next. × Don’t miss Glynn Bird’s next story Blocked Unblock Follow Following Glynn Bird","At Offline Camp, fellow IBM Developer Advocate Bradley Holt gave a Passion Talk on Cloudant Envoy. As I am more involved in this project than he is, Bradle…",Scaling Offline First with Envoy — Offline Camp,Live,219 620,"A TOUR OF THE REDIS STARS Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 18, 2016On the Redis site is a page that lists Redis clients for various languages. It's very extensive, covering clients that work with languages as diverse as emacs lisp, GNU Prolog, Haskell and C#. Throughout the list, some clients have a star next to them and these are the current recommended clients. In this tour of the starred clients, the ones that are recommended, we're going to list them in order of language popularity using the Redmonk programming language index for June 2016 . Before we dive into the tour though, you may be wondering why having so many differemt clients for Redis is important. Redis can be thought of, in many cases, as the database glue that can hold many applications together. While disk-based databases are good as sources of reference, Redis shines in being a source of state and transient data for many systems. In a lot of cases, it's used as a cache, but that's just sharing state out to many other clients. What makes Redis work is that it has so many drivers there's no application that can be considered out of the running when working with Redis. The recommended clients are the cream of the crop, the ones that have proven themselves to be stable, mature and well maintained. Before we start though, it's worth noting there are two major styles of driver: minimalist drivers and what we call the idiomatic drivers. The minimalist drivers provide the framework to send Redis command strings and arguments and decode the Redis response. The developer using a minimalist driver will have the Redis commands documentation. The idiomatic drivers instead map the Redis command set to a richer API which exposes the Redis commands in a way that's native to the language. So let's begin the tour with... JAVASCRIPT/NODE.JS (1) - NODE-REDIS AND IOREDIS JavaScript tops the Redmonk rankings, though you'll only find JavaScript in its Node.js form in the client list. Node.js is popular with Redis client developers though; there are ten listed on the client list and two of them are recommended. Node-redis is an idiomatic driver and claims complete coverage of the Redis command set with an entirely asynchronous set of calls. That means callbacks all round for processing results though it can be promisfied with bluebird for less indented, more predictable code flow. It also has support for server events for managing the connection and subscriber events for managing pub/sub subscriptions. Handy tricks: a built-in redis.print command you can use instead of a callback to just print results. ioredis is another idiomatic and extensive Redis client with a similar set of features to node-redis, and more. For example, it works with Redis sentinels and clusters out of the box and supports ES6 Map and Hash types. Its support for Lua scripting includes a defineCommand call to simplify the process of uploading and storing Lua in the Redis server. You may wonder why two clients are recommended, and it appears so do the developers who are currently, but not rapidly, working on consolidating the features of node-redis and ioredis into a single library. Which leaves the question, which to choose currently. We'd lean towards ioredis purely because it's a more recently developed codebase. JAVA (2) - JEDIS, LETTUCE AND REDDISON Java, halfway house to the motto ""there's more than one way to do things"", has three recommended drivers, Jedis, lettuce and Redisson... Jedis is your ""small, lightweight and fast"" idiomatic Redis driver. The single Jedis instance isn't thread-safe but is usable. For thread safety, you need to create a statically stored JedisPool and fill it with Jedis instances. There's no asynchronous support but there is support for sharding over multiple Redis servers. Lettuce does claim to be thread-safe and able to service multiple threads with one connection as long as an app doesn't block the connection. It includes support for asynchronous and reactive APIs to deal with those blocking commands. It's a very idiomatic driver with a vast hierarchy of classes representing commands and results. Redisson is probably one of the most interesting of the Java clients. It sets out to create distributed data structures and services that are backed by Redis. This means you can create a Map or Set locally that is synchronized with a Redis server without marshalling the data in and out of appropriate Java objects. With a rich set of integrations, support for many services and codecs and an Apache license, it's one to look at if you want a higher level interface. Find more at the git repository . PHP (3) - PHPREDIS AND PREDIS Phpredis is a C-based extension for PHP, while Predis is a pure PHP client. Both are recommended and actively maintained. Phpredis offers better performance but usually can't be installed on hosts where the user has no shell access. Predis, as a pure PHP client doesn't have that issue, but doesn't offer the very high performance that phpredis could offer. That said, many applications don't need that high a level of performance. PYTHON (4) - REDIS-PY Pythonic access to Redis is but a ""pip install redis"" away with redis-py . It's notable for its extensive Readme.rst file which makes you aware of all the deviations from the Redis commands and is explicit over thread-safety issues and other pitfalls you could run into with the library. C# (5) - SERVICESTACK.REDIS AND STACKEXCHANGE.REDIS ServiceStack.Redis is a recommended driver, but that could well change as the most recent version, v4, is now a commercial product with a free tier and after hitting 6000 Redis requests in an hour ( or other limitations ), it will start generating exceptions requiring an upgrade. If you are in the market for a commercially supported Redis driver for C#, check it out, otherwise, your next stop is ... StackExchange.Redis was developed by StackExchange as a ""logical successor"" to an earlier driver called BookSleeve , StackExchange.Redis is an MIT-licensed driver which includes support for clusters, shared connections and coverage of the full redis feature set. With regular updates and a range of programming models, it's the more recommendable recommended driver for C#. RUBY (5) - REDIS-RB Redis-rb is exactly what you expect from the Ruby community - a one-to-one idiomatic mapping of Redis functionality which maintains Ruby's idioms and pragmatism. Redis-rb is all Ruby using Ruby's socket library for connections, but can also be setup to use the hiredis C driver (see the next entry) for better performance with large objects. C (9) - HIREDIS If you positively have to code in C and want the best performance from your driver, then hiredis is the foundation you are going to want to build on. Surprisingly, it's still yet to release a 1.0.0 version and the last release was a year ago. That said, as minimalist driver, it abstracts Redis communications as a redisCommand() call passing the actual command in a string so Redis command additions don't require changes to hiredis. PERL (13) - REDIS There are a lot of Perl Redis clients with different objectives, as befits the language built around there being ""more than one way to do it"", but there's only one recommended Perl client and that's Redis . If you dig into the docs , you'll find a client that idiomatically exposes the Redis API, up to but not including Redis 3.2 features. There are also modules to tie Redis Hashes and Lists into Perl Hashes and Arrays and offer sentinel support. GO (15) - RADIX AND REDIGO (AND GO-REDIS) Go is a rapidly moving ecosystem and there's an interesting mix of drivers out there. Radix is an example of the minimalist style of driver, with a non-thread-safe Redis connection which can be made safer with its own pool, sentinel and cluster implementations. Redigo is the other recommended driver, and that also offers a minimalist driver with similar features - external projects that offer the sentinel and cluster clients support. Oddly, there's no recommended idiomatic driver, so allow us to informally recommend go-redis . It's an actively developed driver with cluster and sentinel support and has interesting additional features like rate limiting and distributed locking. HASKELL (16) - HEDIS For Haskell developers, there's only one recommended and actively maintained Redis client and that's hedis . The documentation has it as a full idiomatic driver for the Redis 2.6 command set though there are at least some commands from later Redis versions implemented. It also exposes its low level API giving the user the flexibility of a minimalist driver. CLOJURE (20) - CARMINE Clojure developers have one choice in Redis client support and that's carmine . It's another rich idiomatic driver with support for 2.6 and later features and adds its own capabilities such as distributed locks, raw binary handling and easy message queues. Redis commands are exposed as Clojure functions, and - here's the neat part - generated by using the official Redis command reference so it's always up to date and documented. We run out of Redmonk ratings - they reasonably stop at 21 places (after ties), but there are still more recommended clients. Switching to alphabetical order we have: CRYSTAL (-) AND CRYSTAL-REDIS For Crystal developers, the crystal-redis package is the only option we know of. It has an idiomatic style API which appears to be up to but not including Redis 3.0. DART (-) AND DARTREDISCLIENT The DartRedisClient seems to have stalled in development. As a Redis client for Dart , Google's JavaScript alternative, the library reached a version 0.1 last year and there have been no commits since. That said, the 0.1 version offers an idiomatic API which returns Futures for async/non-blocking functionality. ERLANG (-) AND EREDIS Erlang developers are recommended Eredis which is a minimalist non-blocking library, with support for pipelining and auto-reconnection, but no support for sentinels or clustering. LUA (-) AND REDIS-LUA The Redis-lua library has support for commands up to, and including, Redis 2.6 in an idiomatic API but hasn't been updated since 2014. RUST (-) AND REDIS-RS The Rust libraray for Redis, Redis-rs , is being actively developed and strikes a half-way house between idiomatic and minimalist - there is some high-level functionality but it's only for commonly used features, but developers are free to fall back to using the low-level API to construct any Redis commands they wish. There are also features limited by what's currently implemented in the languge - these are detailed in the documentation . SCALA (-) AND SCALA-REDIS The scala-redis library is actively being developed and its more recent work has brought support for Redis 3.2's GEO commands among other things. It works with native Scala types and is not a wrapper around a Java client. It's a blocking client but has a pool and asynchronous futures built on top of that. And it's idiomatic. Scala developers are not short of Redis client alternatives, but scala-redis seems to cover most core requirements. AND THERE THE TOUR ENDS Hopefully, you'll come away with a good feeling about the range of languages covered by the Redis community's driver work. From strictly minimalist drivers that cover the protocol with a thin veneer of essential code, to rich idiomatic libraries designed to make Redis a natural fit to the language in use, there's a lot of ground that is covered. Remember, we've only touched all too briefly on the recommended drivers and there's a whole lot more that are not recommended but worth investigating. We'll be taking a deep dive on some of these Redis libraries in the future. We invite anyone who is knowledgable about a Redis driver to check out our Write Stuff page where you can earn cash and database credits. -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. Image by ESO.org Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud * Subscribe Join 75,000 other devs on our weekly database newsletter. © 2016 Compose",A run-down of Redis drivers for the most popular programming languages.,A tour of the Redis stars,Live,220 623,"* United States IBM® * Site map Search within Bluemix Blog Bluemix Blog * About Bluemix * What is Bluemix * Getting Started * Case Studies * Hybrid Architecture * Open Source * Trust, Security, Privacy * Data Centers * Our Network * Automation * Architecture Center * Products * Compute Infrastructure * Compute Services * Hybrid Deployments * Watson * Internet of Things * Mobile * DevOps * Data Analytics * Network * Open Source * Storage * Security * Services * Bluemix Services * Garage * Pricing * Support * Support * Contact Us * Resources * Docs * Blog * How-tos * Trending * What's New * Events * Partners * Partners * Become a Partner * Find a Partner * Sign up DATA ANALYTICSHOW SMART CATALOGS CAN TURN THE BIG DATA FLOOD INTO AN OCEAN OF OPPORTUNITY August 1, 2017 | Written by: Jay Limburn Categorized: Data Analytics Share this post: One of the earliest documented catalogs was compiled at the great library of Alexandria in the third century BC, to help scholars manage, understand and access its vast collection of literature. While that cataloging process represented a massive undertaking for the Alexandrian librarians, it pales in comparison to the task of wrangling the volume and variety of data that modern organizations generate. Nowadays, data is often described as an organization’s most valuable asset, but unless users can easily sift through data artifacts to find the information they need, the value of that data may remain unrealized. Catalogs can solve this problem by providing an indexed set of information about the organization’s data, storing metadata that describes all assets and providing a reference to where they can be found or accessed. It’s not just the size and complexity of the data that makes cataloging a tough challenge: organizations also need to be able to perform increasingly complicated operations on that data at high speed, and even in real-time. As a result, technology leaders must continually find better ways to solve today’s version of the same cataloging challenges faced in Alexandria all those years ago. ENTER IBM IBM’s aim with Watson Data Platform is to make data accessible for anyone who uses it. An integral part of Watson Data Platform will be a new intelligent asset catalog, IBM Data Manager, a solution underpinned by a central repository of metadata describing all the information managed by the platform. Unlike many other catalog solutions on the market, the intelligent asset catalog will also offer full end-to-end capabilities around data lifecycle and governance. Because all the elements of Watson Data Platform can utilize the same catalog, users will be able to share data with their colleagues more easily, regardless of what the data is, where it is stored, or how they intend to use it. In this way, the intelligent asset catalog will unlock the value held within that data across user groups—helping organizations use this key asset to its full potential. BREAKING DOWN SILOS With Watson Data Platform, data engineers, data scientists and other knowledge workers throughout an enterprise can search for, share and leverage assets (including datasets, files, connections, notebooks, data flows, models and more). Assets can be accessed using the Data Science Experience web user interface to analyze data, To collaborate with colleagues, users can put assets into a Project that acts as a shared sandbox where the whole team can access and utilize them. Once their work is complete, they can submit any resulting content to the catalog for further reuse by other people and groups across the organization. Rich metadata about each asset makes it easy for knowledge workers to find and access relevant resources. Along with data files, the catalog can also include connections to databases and other data sources, both on- and off-premises, giving users a full 360-degree view to all information relevant to their business, regardless of where or how it is stored. MANAGING DATA OVER TIME It’s important to look at data as an evolving asset, rather than something that stays fixed over time. To help manage and trace this evolution, IBM Data Manager will keep a complete track of which users have added or modified each asset, so that it is always clear who is responsible for any changes. SMART CATALOG CAPABILITIES FOR BIG DATA MANAGEMENT The concept of catalogs may be simple, but when they’re being used to make sense of huge amounts of constantly changing data, smart capabilities make all the difference. Here are some of the key smart catalog functionalities that we see as integral to tackling the big data challenge, and that we will be aiming to include in upcoming releases of IBM Data Manager. DATA AND ASSET TYPE AWARENESS When a user chooses to preview or view an asset of a particular type, the data and asset type awareness feature will automatically launch the data in the best viewer—such as a shaper for a dataset, or a canvas for a data flow. This will save time and boost productivity for users, optimizing discovery and making it easier to work with a variety of data types without switching tools. INTELLIGENT SEARCH AND EXPLORATION By combining metadata, machine learning-based algorithms and user interaction data, it is possible to fine-tune search results over time. Presenting users with the most relevant data for their purpose will increase usefulness of the solution the more it is used. SOCIAL CURATION Effective use of data throughout your organization is a two-way street: when users discover a useful dataset, it’s important for them to help others find it too. Users can be encouraged to engage by taking advantage of curation features, enabling them to tag, rank and comment on assets within the catalog. By augmenting the metadata for each asset, this can help the catalog’s intelligent search algorithms guide users to the assets that are most relevant to their needs. DATA LINEAGE If data is incomplete or inaccurate, utilizing it can cause more problems than it solves. On the other hand, if data is accurate but users do not trust it, they might not use it when it could make a real difference. In either scenario, data lineage can help. Data lineage captures the complete history of an asset in the catalog: from its original source, through all the operations and transformations it has undergone, to its current state. By exploring this lineage, users can be confident they know where assets have come from, how those assets have evolved, and whether they can be trusted. MONITORING Taking a step back to a higher-level view, monitoring features will help users keep track of overall usage of the catalog. Real-time dashboards help chief data officers and other data professionals monitor how data is being used, and identify ways to increase its usage in different areas of the organization. METADATA DISCOVERY We have already mentioned that data needs to be seen as an evolving asset—which means our catalogs must evolve with it. We plan to make it easy for users to augment assets with metadata manually; in the future, it may also be possible to integrate algorithms that can discover assets and capture their metadata automatically. DATA GOVERNANCE For many organizations, keeping data secure while ensuring access for authorized users is one of the most significant information management challenges. You can mitigate this challenge with rule-based access control and automatic enforcement of data governance policies. APIS Finally, the catalog will enable access to all these capabilities and more through a set of well-defined, RESTful APIs. IBM is committed to offering application developers easy access to additional components of Watson Data Platform , such as persistence stores and data sets. We hope that they can use our services to extend their current suite of data and analytics tools, to innovate and create smart new ways of working with data. In our next post, we’ll discuss the challenges around data governance, and explore how IBM Data Manager can help you make light work of addressing them. JAY LIMBURN Jay Limburn THOMAS SCHAECK Previous Post WebSphere on the Cloud: Application Modernization (Phase 1)Next Post Maximize Control with IBM Bluemix Virtual serversADD COMMENT NO COMMENTS LEAVE A REPLY CANCEL REPLY Your email address will not be published. Required fields are marked * Comment Name * Email * Website Search for:RECENT POSTS * IBM Watson Machine Learning – General Availability * Locating IoT with Skyhook Precision Location * Monitoring IBM Bluemix Container Service with Sysdig Container Intelligence * Mobile Foundation Service integration with Mobile Analytics Service * Intel® Optane™ SSD DC P4800X Available Now on IBM Cloud ARCHIVES Archives Select Month August 2017 July 2017 June 2017 May 2017 April 2017 March 2017 February 2017 January 2017 December 2016 November 2016 October 2016 September 2016 August 2016 July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 October 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 November 2013TAGS analytics announcements api apps Architecture Center best-of-bluemix Bluemix bluemix-support-notifications buildpacks client success cloud cloudant cloud foundry conference conferences containers dashdb deployment devops docker eclipse garage garage-method hackathon homepage hybrid interconnect iot java Kubernetes liberty local microservices mobile MobileFirst node.js OpenStack openwhisk security Spark swift twilio video watson webinar More Data Analytics StoriesData Analytics MEDTRONIC MAKES DIABETES MANAGEMENT EASIER WITH REAL-TIME INSIGHTS FROM IBM STREAMS With cases of both type I and type II diabetes rising, Medtronic recognized the need to create a new generation of glucose monitoring solutions that would give people the tools to manage their diabetes more easily, in combination with routine support from healthcare professionals. Find out how they are working with IBM Watson to help. Continue reading Share this post: Data Analytics INTRODUCING A NEW LOOK AND FEEL FOR DB2 WAREHOUSE ON CLOUD Today, we're proud to announce the launch and immediate availability of the brand new Db2 Warehouse on Cloud Web console and REST APIs! We want to make your interaction with our world-class cloud data warehouse offering as seamless as possible, so we set out to completely redesign these two integral parts of our user experience. Continue reading Share this post: Data Analytics IBM DASHDB FOR ANALYTICS IS NOW DB2 WAREHOUSE ON CLOUD We're rebranding dashDB for Analytics to Db2 Warehouse on Cloud. Continue reading Share this post: SIGN UP FOR A BLUEMIX TRIAL TODAY Get started free Learn more about Bluemix CONNECT WITH US * Contact * Privacy * Terms of use * Accessibility","When used to make sense of huge amounts of constantly changing data, smart catalog capabilities can make all the difference.",How smart catalogs can turn the big data flood into an ocean of opportunity,Live,221 627,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Share * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Join Medium Join Medium Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. 28 mins ago -------------------------------------------------------------------------------- AUTHENTICATION FOR CLOUDANT ENVOY APPS, PART III ADDING TWITTER AUTHENTICATION For those familiar with the Apache CouchDB ecosystem, Cloudant Envoy is a microservice that serves out your static application and behaves as a replication target for your one-database-per-user application. Simply build an application that writes data locally using PouchDB or Cloudant Sync , and Envoy will ensure that each user’s data is stored in a single Cloudant database, with each user’s data carefully segregated. For more background on Cloudant Envoy, I have a write-up over on Offline Camp : Scaling Offline First with Envoy At Offline Camp, fellow IBM Developer Advocate Bradley Holt gave a Passion Talk on Cloudant Envoy. medium.comSo far I’ve been looking at Envoy apps that have been generating their own users. Save a document in the envoyusers database, and Envoy will use that information for subsequent authentication requests. But what if you want users to sign up with Facebook/Google/Twitter/etc? How can Envoy integrate with social media’s federated login? In previous blog posts, I showed you how to add Facebook authentication to an Envoy app and then how to make the app Offline-First: Authentication for Cloudant Envoy Apps, Part I Adding Facebook Authentication medium.com Authentication for Cloudant Envoy Apps, Part II Make Your App Offline First medium.comFor this post, we’ll focus in on Twitter, and I’ll show you how to use Twitter as an authentication option. Let’s go! PASSPORT TO THE RESCUE (AGAIN!) http://passportjs.org/The PassportJS project solves 95% of the problem for us, which we also demonstrated in Part 1 of this series. It has several modules, each handling authentication for a third-party partner. Although Envoy doesn’t use Passport out-of-the box, you can create an Envoy app that does. Here’s how. CREATE A TWITTER APP Visit the Twitter Application Management page and create a new Twitter app to handle authentication for you. It need only have read-only access to your users’ profiles; we aren’t going to be tweeting on behalf of your app’s users. Once the app is created, two keys will be generated: * Consumer Key (API Key) * Consumer Secret (API Secret) Make a note of these values as we’ll need to inject them into your code. CREATE AN ENVOY APP Let’s create an Envoy app. These are the same steps we followed in Part 1 of this series on adding Facebook Authentication . No need to recreate this app if you’re still working from the same sample application. In a new directory, type npm init and follow the on-screen prompts. This will create a template package.json file for you. Then we can add the modules we’re going to need for this project: npm install --save cloudant-envoy Create some static content: mkdir public echo ""

Hello World

"" > public/index.html The layout of an Envoy app is pretty simple — create an app.js : Once you’ve saved that file in the project directory you can then run the app: export COUCH_HOST=https://myusername:mypassword@myhost.cloudant.com node app.js Note: Envoy assumes the Cloudant URL will be in a COUCH_HOST environment variable. Replace myusername , mypassword and myhost with your own Cloudant account details. We now have a web server serving out our own static content which also acts as a replication target for PouchDB/CouchDB/Cloudant/Cloudant-Sync clients. ADD TWITTER AUTHENTICATION We’ll need some extra modules to handle Twitter authentication: npm install --save passport npm install --save passport-twitter npm install --save uuid npm install --save express npm install --save crypto-js Then we need to add some custom endpoints into our app to handle the authentication process. * GET /_twitter — Hitting this endpoint in your browser will bounce the user to Twitter and ask them to authenticate. * GET /_twitter/callback — After logging into the Twitter website, it will bounce the browser to this URL to allow us to access the user’s profile. We implement a getOrCreateUser function, which checks if Envoy knows about this user already. If not, a new user is created. Envoy’s user model is very simple: add a document to its users database (default name envoyusers ) to allow someone to replicate. Envoy provides some helper functions for you: * envoy.auth.getUser(userid, callback) — to fetch a user by userid * envoy.auth.newUser(userid, password, metaobject, callback) — to create a new user We need to run the app, passing in our app’s CLIENT_ID and CLIENT_SECRET environment variables we got when we created the Twitter integration: export TWITTER_API_KEY=1234567 export TWITTER_API_SECRET=abc123456 export COUCH_HOST=https://myusername:mypassword@myhost.cloudant.com node app.js Here’s the source code: HOW DO WE COMMUNICATE THE USER CREDENTIALS TO THE CLIENT SIDE? We know the ID and password of our user — not the Twitter username and password—the Envoy username and password. But how can we send that data to the client side? A simple way is to bounce the browser to a URL with the credentials in the query string: http://mypretenddomain.com/bounce.html?username=999888777&password=9886f37a-725e-4096-be67-ff2aba2acb68 We could write some client-side JavaScript to parse the query string, extract the username and password and store it locally. A safer way would be to create a single-use token and pass that in the query string. http://mypretenddomain.com/bounce.html?token=696ad23c375b4aa4acce97734fa2ea4f In this case the client-side code needs to extract the token, make a call back to the server to exchange the token for the username and password and then store the credentials locally. This is more secure as the token can be made to expire on use and have a built-in time limit. Here’s some simple client-side code to extract and decode the token, ultimately saving the user details in a local PouchDB document. Local documents are never transmitted during replication; they only remain on the device they are created: MAKING YOUR APP Now the client side app has the Envoy credentials (in a PouchDB document whose ID is _local/user ), we can set about building an app that reads and writes data to its local PouchDB database and replicates its data to and/or from your Envoy service using the credentials provided. var db = new PouchDB('mydb' db.get('_local/user').then(function(loggedinuser) { var url = window.location.origin.replace('//', '//' + loggedinuser.username + ':' + loggedinuser.meta.password + '@' url += '/envoy' // sync live with retry, animating the icon when there's a change' var remote = new PouchDB(url); db.replicate.to(remote).on('change', function(c) { console.log('change', c) }); }); -------------------------------------------------------------------------------- https://apps.twitter.com/I hope you enjoyed this 3-part series. Together, we’ve built a static application that writes data locally using PouchDB, and you’ve used Cloudant Envoy to synchronize user data to a remote Cloudant database. Using PassportJS to handle authentication, your users can now sign up to use your app using their own Facebook and Twitter credentials. We also reviewed how to make an Offline-First app with Cloudant Envoy using a Progressive Web Application. Until next time! Thanks to Maureen McElaney and Mike Broberg . JavaScript Passportjs Pouchdb Tutorial Database Blocked Unblock Follow FollowingGLYNN BIRD Developer Advocate @ IBM Watson Data Platform. Views are my own etc. FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform.",I’m going to show you how to deploy a live application to a Cloudant server and add the ability to use Twitter authentication. Future articles cover Offline First & Facebook auth.,"Authentication for Cloudant Envoy Apps, Part III – IBM Watson Data Lab",Live,222 639,"Homepage Follow Sign in Get started * Home * Data Science Experience * Data Catalog * IBM Data Refinery * * Watson Data Platform * Greg Filla Blocked Unblock Follow Following Product manager & Data scientist — Data Science Experience and Watson Machine Learning Dec 14 -------------------------------------------------------------------------------- USING BIGDL IN DATA SCIENCE EXPERIENCE FOR DEEP LEARNING ON SPARK Huge thanks for the contributions from Yulia Tell and Yuhao Yang from Intel and Roland Weber from IBM in making this integration possible! Deep Learning has become one of the most popular techniques used in the field of Machine Learning in recent years. The Data Science Experience (DSX) team has been excited about deep learning since before launching last year (we have a couple blogs on this topic: DL trends , Using DL in DSX ). As a data science platform, we make it easy to scale your analysis by providing a Spark cluster for all users. Whether working in notebooks or RStudio in DSX you have access to connect to this cluster to distribute workloads. Until recently, Spark batch processing was not used for Deep Learning since it required a lot of effort to optimize Spark’s compute engine for training deep neural networks. This is where Intel comes in, with their big data deep learning framework called BigDL. This blog will explain what BigDL is and how it can be used in Data Science Experience. -------------------------------------------------------------------------------- WHAT IS BIGDL? BigDL is a distributed deep learning framework for Apache Spark that was developed by Intel and contributed to the open source community for the purposes of uniting big data processing and deep learning (check out https://github.com/intel-analytics/BigDL ). Built on the highly scalable Apache Spark platform, BigDL can be easily scaled out to hundreds or thousands of servers. In addition, BigDL uses Intel® Math Kernel Library (Intel® MKL) and parallel computing techniques to achieve very high performance on Intel® Xeon® processor-based servers (comparable to mainstream GPU performance). BigDL helps make deep learning more accessible to the big data community by allowing developers to continue using familiar tools and infrastructure to build deep learning applications. BigDL provides support for various deep learning models (for example, object detection, classification, and so on); in addition, it also lets us reuse and migrate pre-trained models (in Caffe, Torch*, TensorFlow*, and so on), which were previously tied to specific frameworks and platforms, to the general purpose big data analytics platform through BigDL. As a result, the entire application pipeline can be fully optimized to deliver significantly accelerated performance. As the following diagram shows, BigDL is implemented as a library on top of Spark, so that users can write their deep learning applications as standard Spark programs. As a result, BigDL can be seamlessly integrated with other libraries on top of Spark — Spark SQL and DataFrames, Spark ML pipelines, Spark Streaming, Structured Streaming, etc. — and can run directly on top of existing Spark or Hadoop clusters. Highlights of the BigDL v0.3.0 release Since its initial open source release in December 2016, BigDL has been used to build applications for fraud detection, recommender systems, image recognition, and many other purposes. The recent BigDL v0.3.0 release addresses many user requests, improving usability and additional new features and functionality: • New layers support • RNN encoder-decoder (sequence-to-sequence) architecture • Variational auto-encoder • 3D de-convolution • 1D convolution and pooling • Model quantization support • Quantize existing (BigDL, Caffe, Torch or TensorFlow) model • Converting float points to integer for model inference (for model size reduction & inference speedup) • Sparse tensor and layers — Efficient support of sparse data -------------------------------------------------------------------------------- BIGDL ON DSX: A PERFECT FIT Since notebooks in DSX are already executed on a Spark cluster, it is very easy to get up and running with BigDL. The only tool you need to get started is a Data Science Experience notebook. Follow the steps below to install BigDL and confirm it is working. In future posts, we will show tutorials using BigDL on DSX. Installation Guide for BigDL within IBM DSX This section was written by Roland Weber in this StackOverflow post. You can follow along with this notebook to get up and running with BigDL in DSX. If your notebooks are backed by an Apache Spark as a Service instance in DSX, installing BigDL is simple. But you have to collect some version information first. 1. Which Spark version? Currently, 2.1 is the latest supported by DSX. With Python, you can only install BigDL for one Spark version per service. 2. Which BigDL version? Currently, 0.3.0 is the latest, and it supports Spark 2.1. If in doubt, check the download page . The Spark fixlevel does not matter. With this information, you can determine the URL of the required BigDL JAR file in the Maven repository. For the example versions, BigDL 0.3.0 with Spark 2.1, the download URL is https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-SPARK_2.1/0.3.0/bigdl-SPARK_2.1-0.3.0-jar-with-dependencies.jar For other versions, replace 0.3.0 and 2.1 in that URL as required. Note that both versions appear twice, once in the path and once in the filename. Installing for Python You need the JAR, and the matching Python package. The Python package depends only on the version of BigDL, not on the Spark version. The installation steps can be executed from a Python notebook: 1. Install the JAR. !(export sv=2.1 bv=0.3.0 ; cd ~/data/libs/ && wget https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-SPARK_${sv}/${bv}/bigdl-SPARK_${sv}-${bv}-jar-with-dependencies.jar) Here, the versions of Spark (sv) and BigDL (bv) are defined as environment variables, so you can easily adjust them without having to change the URL. 2. Install the Python module. !pip install bigdl==0.3.0 | cat If you want to switch your notebooks between Python versions, execute this step once with each Python version. After restarting the notebook kernel, BigDL is ready for use. (Not) Installing for Scala If you install the JAR as described above for Python, it is also available in Scala kernels. If you want to use BigDL exclusively with Scala, better not install the JAR at all. Instead, use the %AddJar magic at the beginning of the notebook. It’s best to do this in the very first code cell, to avoid class loading issues. %AddJar https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-SPARK_2.1/0.3.0/bigdl-SPARK_2.1-0.3.0-jar-with-dependencies.jar By not installing the JAR, you gain the flexibility of using different versions of Spark and BigDL in different Scala notebooks sharing the same service. As soon as you install a JAR, you’re likely to run into conflicts between that one and the one you pull in with %AddJar. -------------------------------------------------------------------------------- Hopefully after following along with those instructions you are ready to start using BigDL to train deep nets on Spark in DSX! If you prefer a Python notebook with all these steps you can copy this notebook written by the DSX development team . You can copy this notebook directly into a DSX project using the copy icon in the top right; this will let you start running the code in your Spark cluster in Data Science Experience. This notebook also gives you some code to start using the BigDL framework. Stay tuned for a follow up post showing how to train models with BigDL. If you are interested to see examples of training models for fraud detection, sentiment analysis and others with BigDL, feel free to check out BigDL model zoo at https://github.com/intel-analytics/analytics-zoo . * Machine Learning * Bigdl * Data Science * Deep Learning * Dsx One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. Blocked Unblock Follow FollowingGREG FILLA Product manager & Data scientist — Data Science Experience and Watson Machine Learning FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",This blog will explain what BigDL is and how it can be used in Data Science Experience (DSX).,Using BigDL in DSX for Deep Learning on Spark,Live,223 641,"SHIFTING SANDS A man with a hammer PAGES * Home * About Me MONDAY, DECEMBER 17, 2012 USING APPLY, SAPPLY, LAPPLY IN R This is an introductory post about using apply, sapply and lapply, best suited for people relatively new to R or unfamiliar with these functions. There is a part 2 coming that will look at density plots with ggplot , but first I thought I would go on a tangent to give some examples of the apply family, as they come up a lot working with R. I have been comparing three methods on a data set. A sample from the data set was generated, and three different methods were applied to that subset. I wanted to see how their results differed from one another. I would run my test harness which returned a matrix. The columns values were the metric used for evaluation of each method, and the rows were the results for a given subset. We have three columns, one for each method, and lets say 30 rows, representing 30 different subsets that the three methods were applied to. It looked a bit like this method1 method2 method3 [1,] 0.05517714 0.014054038 0.017260447 [2,] 0.08367678 0.003570883 0.004289079 [3,] 0.05274706 0.028629661 0.071323030 [4,] 0.06769936 0.048446559 0.057432519 [5,] 0.06875188 0.019782518 0.080564474 [6,] 0.04913779 0.100062929 0.102208706 We can simulate this data using rnorm , to create three sets of observations. The first has mean 0, second mean of 2, third of mean of 5, and with 30 rows. m <- matrix(data=cbind(rnorm(30, 0), rnorm(30, 2), rnorm(30, 5)), nrow=30, ncol=3) APPLY When do we use apply? When we have some structured blob of data that we wish to perform operations on. Here structured means in some form of matrix. The operations may be informational, or perhaps transforming, subsetting, whatever to the data. As a commenter pointed out, if you are using a data frame the data types must all be the same otherwise they will be subjected to type conversion. This may or may not be what you want, if the data frame has string/character data as well as numeric data, the numeric data will be converted to strings/characters and numerical operations will probably not give what you expected. Needless to say such circumstances arise quite frequently when working in R, so spending some time getting familiar with apply can be a great boon to our productivity. Which actual apply function and which specific incantion is required depends on your data, the function you wish to use, and what you want the end result to look like. Hopefully the right choice should be a bit clearer by the end of these examples. First I want to make sure I created that matrix correctly, three columns each with a mean 0, 2 and 5 respectively. We can use apply and the base mean function to check this. We tell apply to traverse row wise or column wise by the second argument. In this case we expect to get three numbers at the end, the mean value for each column, so tell apply to work along columns by passing 2 as the second argument. But let's do it wrong for the point of illustration: apply(m, 1, mean) # [1] 2.408150 2.709325 1.718529 0.822519 2.693614 2.259044 1.849530 2.544685 2.957950 2.219874 #[11] 2.582011 2.471938 2.015625 2.101832 2.189781 2.319142 2.504821 2.203066 2.280550 2.401297 #[21] 2.312254 1.833903 1.900122 2.427002 2.426869 1.890895 2.515842 2.363085 3.049760 2.027570 Passing a 1 in the second argument, we get 30 values back, giving the mean of each row. Not the three numbers we were expecting, try again. apply(m, 2, mean) #[1] -0.02664418 1.95812458 4.86857792 Great. We can see the mean of each column is roughly 0, 2, and 5 as we expected. OUR OWN FUNCTIONS Let's say I see that negative number and realise I wanted to only look at positive values. Let's see how many negative numbers each column has, using apply again: apply(m, 2, function(x) length(x[x<0])) #[1] 14 1 0 So 14 negative values in column one, 1 negative value in column two, and none in column three. More or less what we would expect for three normal distributions with the given means and sd of 1. Here we have used a simple function we defined in the call to apply , rather than some built in function. Note we did not specify a return value for our function. R will magically return the last evaluated value. The actual function is using subsetting to extract all the elements in x that are less than 0, and then counting how many are left are using length . The function takes one argument, which I have arbitrarily called x . In this case x will be a single column of the matrix. Is it a 1 column matrix or a just a vector? Let's have a look: apply(m, 2, function(x) is.matrix(x)) #[1] FALSE FALSE FALSE Not a matrix. Here the function definition is not required, we could instead just pass the is.matrix function, as it only takes one argument and has already been wrapped up in a function for us. Let's check they are vectors as we might expect. apply(m, 2, is.vector) #[1] TRUE TRUE TRUE Why then did we need to wrap up our length function? When we want to define our own handling function for apply, we must at a minimum give a name to the incoming data, so we can use it in our function. apply(m, 2, length(x[x<0])) #Error in match.fun(FUN) : object 'x' not found We are referring to some value x in the function, but R does not know where that is and so gives us an error. There are other forces at play here, but for simplicity just remember to wrap any code up in a function. For example, let's look at the mean value of only the positive values: apply(m, 2, function(x) mean(x[x>0])) #[1] 0.4466368 2.0415736 4.8685779 USING SAPPLY AND LAPPLY These two functions work in a similar way, traversing over a set of data like a list or vector, and calling the specified function for each item. Sometimes we require traversal of our data in a less than linear way. Say we wanted to compare the current observation with the value 5 periods before it. Use can probably use rollapply for this (via quantmod), but a quick and dirty way is to run sapply or lapply passing a set of index values. Here we will use sapply , which works on a list or vector of data. sapply(1:3, function(x) x^2) #[1] 1 4 9 lapply is very similar, however it will return a list rather than a vector: lapply(1:3, function(x) x^2) #[[1]] #[1] 1 # #[[2]] #[1] 4 # #[[3]] #[1] 9 Passing simplify=FALSE to sapply will also give you a list: sapply(1:3, function(x) x^2, simplify=F) #[[1]] #[1] 1 # #[[2]] #[1] 4 # #[[3]] #[1] 9 And you can use unlist with lapply to get a vector. unlist(lapply(1:3, function(x) x^2)) #[1] 1 4 9 However the behviour is not as clean when things have names, so best to use sapply or lapply as makes sense for your data and what you want to receive back. If you want a list returned, use lapply . If you want a vector, use sapply . DIRTY DEEDS Anyway, a cheap trick is to pass sapply a vector of indexes and write your function making some assumptions about the structure of the underlying data. Let's look at our mean example again: sapply(1:3, function(x) mean(m[,x])) [1] -0.02664418 1.95812458 4.86857792 We pass the column indexes (1,2,3) to our function, which assumes some variable m has our data. Fine for quickies but not very nice, and will likely turn into a maintainability bomb down the line. We can neaten things up a bit by passing our data in an argument to our function, and using the … special argument which all the apply functions have for passing extra arguments: sapply(1:3, function(x, y) mean(y[,x]), y=m) #[1] -0.02664418 1.95812458 4.86857792 This time, our function has 2 arguments, x and y . The x variable will be as it was before, whatever sapply is currently going through. The y variable we will pass using the optional arguments to sapply . In this case we have passed in m , explicitly naming the y argument in the sapply call. Not strictly necessary but it makes for easier to read & maintain code. The y value will be the same for each call sapply makes to our function. I don't really recommend passing the index arguments like this, it is error prone and can be quite confusing to others reading your code. I hope you found these examples helpful. Please check out part 2 where we create a density plot of the values in our matrix. If you are working with R, I have found this book very useful day-to-day R Cookbook (O'Reilly Cookbooks) Posted by Pete at 11:43 PM Email This BlogThis! Share to Twitter Share to Facebook Share to Pinterest Labels: apply , lapply , R , sapply7 COMMENTS: 1. Joshua Ulrich December 19, 2012 at 4:12 AMYou suggest using apply() on a matrix or data.frame, but it's very important to note that apply() always coerces its first argument to a matrix/array. This is important because a matrix/array can only contain a single atomic type, whereas a data.frame can contain columns of varying types/classes. When a data.frame is converted to a matrix, it will be converted to the highest atomic type of any of the columns of the data.frame (e.g. if the data.frame has 9 numeric columns and 1 character column, it will be converted to a 10 column character matrix). Reply Delete Replies 1. Pete December 22, 2012 at 7:54 PMHi Joshua, thank you I was not fully aware of that, and it has bitten me in the past as well. I have updated the post. Thanks for stopping by, nice to see you here! Delete 2. Reply 2. Selva Prabhakaran May 25, 2014 at 11:52 PMGreat post! Thank you so much for sharing.. For those who want to learn R Programming, here is a great new course on youtube for beginners and Data Science aspirants. The content is great and the videos are short and crisp. New ones are getting added, so I suggest to subscribe. https://www.youtube.com/watch?v=BGWVASxyow8&list=PLFAYD0dt5xCzTQHDhMPZwBoaAXWeVhZzg&index=19 Reply Delete 3. Adrian August 1, 2015 at 7:54 AMThis comment has been removed by the author. Reply Delete 4. Adrian August 1, 2015 at 7:56 AMThank you for this insightful and practical post. In the 3rd to last paragraph, you mentioned that you do not recommend passing the index argument in the way you just demonstrated. So, what method would you recommend? Reply Delete Replies 1. Pete January 7, 2016 at 1:08 AMIn general, instead of passing the indexes to use, I would try pass the data itself and let the internals of apply do the subsetting and make the function operate on that data, vs subsetting the data manually in the apply function we pass in. This isn't always possible though I know, and it is fine to pass indexes really, I am just a bit uptight about it I think. Thanks for you comment though and sorry for the delayed reply, I always have trouble posting comments on blogger! Delete 2. Reply 5. Daniel Maartens May 7, 2016 at 8:31 AMHi Pete, In your second paragraph under ""using sapply and lapply"" you are trying to tell us why we might want to use sapply and lapply instead of apply because we might ""require traversal of our data in a less than linear way"" and that we also might want to ""compare the current observation with the value 5 periods before it."" However, in your subsequent answer to this problem you raised you only give us an alternative way of doing the exact same calculation you did using the apply() method (i.e. testing the means of the three rnorm-methods). Could you please provide an example highlighting how the use of sapply or lapply would enable me to traverse through data in a less than linear way and allow me to compare a current observation with a value 5 periods before it in a way that the apply() cannot? Please note that I am still a beginner in R. Thanks in advance :) Reply Delete Add comment Load more... Newer Post Older Post Home Subscribe to: Post Comments (Atom)POPULAR POSTS * Using apply, sapply, lapply in R * Intro to Curl Noise * Density Plot with ggplot * Tracking down errors in R * Streaming OANDA with python and ZeroMQ BLOG ROLL * R-bloggers * Quant Mashup | Quantocracy * Abnormal Returns * Eran Raviv * R by examples * Quant News SUBSCRIBE TO Posts Atom Posts Comments Atom CommentsBLOG ARCHIVE * ► 2016 (1) * ► January (1) * ► 2015 (6) * ► December (1) * ► May (1) * ► March (3) * ► February (1) * ► 2014 (12) * ► December (3) * ► September (1) * ► July (1) * ► June (1) * ► May (4) * ► April (1) * ► January (1) * ► 2013 (17) * ► November (2) * ► October (2) * ► September (3) * ► August (1) * ► June (1) * ► March (5) * ► February (1) * ► January (2) * ▼ 2012 (13) * ▼ December (2) * Density Plot with ggplot * Using apply, sapply, lapply in R * ► November (1) * ► August (1) * ► July (3) * ► May (2) * ► April (2) * ► February (2) * ► 2011 (9) * ► December (1) * ► October (4) * ► September (4) Simple template. Powered by Blogger .","This is an introductory post about using apply, sapply and lapply, best suited for people relatively new to R or unfamiliar with these functions.","Using apply, sapply, lapply in R",Live,224 644,"KDNUGGETS Data Mining, Analytics, Big Data, and Data Science Subscribe to KDnuggets News | Follow | Contact * SOFTWARE * NEWS * Top stories * Opinions * Tutorials * JOBS * Academic * Companies * Courses * Datasets * EDUCATION * Certificates * Meetings * Webinars KDnuggets Home » News » 2016 » Jun » Tutorials, Overviews » A Visual Explanation of the Back Propagation Algorithm for Neural Networks ( 16:n22 )LATEST NEWS, STORIES * U. Chicago Center for Data Science and Public Policy: ... KDnuggets 16:n23, Jun 29: Machine Learning Trends & Fu... The Big Data Ecosystem is Too Damn Big Civis Analytics: Data Scientist, Statistics Civis Analytics: Lead Data Engineer More News & Stories | Top Stories A VISUAL EXPLANATION OF THE BACK PROPAGATION ALGORITHM FOR NEURAL NETWORKS Previous post Next post Tweet Tags: Algorithms , Backpropagation , Machine Learning , Neural Networks -------------------------------------------------------------------------------- A concise explanation of backpropagation for neural networks is presented in elementary terms, along with explanatory visualization. By Sebastian Raschka , Michigan State University.Let's assume we are really into mountain climbing, and to add a little extra challenge, we cover eyes this time so that we can't see where we are and when we accomplished our ""objective,"" that is, reaching the top of the mountain. Since we can't see the path upfront, we let our intuition guide us: assuming that the mountain top is the ""highest"" point of the mountain, we think that the steepest path leads us to the top most efficiently. We approach this challenge by iteratively ""feeling"" around you and taking a step into the direction of the steepest ascent -- let's call it ""gradient ascent."" But what do we do if we reach a point where we can't ascent any further? I.e., each direction leads downwards? At this point, we may have already reached the mountain's top, but we could just have reached a smaller plateau ... we don't know. Essentially, this is just an analogy of gradient ascent optimization (basically the counterpart of minimizing a cost function via gradient descent). However, this is not specific to backpropagation but just one way to minimize a convex cost function (if there is only a global minima) or non-convex cost function (which has local minima like the ""plateaus"" that let us think we reached the mountain's top). Using a little visual aid, we could picture a non-convex cost function with only one parameter (where the blue ball is our current location) as follows: Now, backpropagation is just back-propagating the cost over multiple ""levels"" (or layers). E.g., if we have a multi-layer perceptron, we can picture forward propagation (passing the input signal through a network while multiplying it by the respective weights to compute an output) as follows: And in backpropagation, we ""simply"" backpropagate the error (the ""cost"" that we compute by comparing the calculated output and the known, correct target output, which we then use to update the model parameters): It may be some time ago since pre-calc, but it's essentially all based on the simple chain-rule that we use for nested functions Instead of doing this ""manually"" we can use computational tools (called ""automatic differentiation""), and backpropagation is basically the ""reverse"" mode of this auto-differentiation. Why reverse and not forward? Because it is computationally cheaper! If we'd do it forward-wise, we'd successively multiply large matrices for each layer until we multiply a large matrix by a vector in the output layer. However, if we start backwards, that is, we start by multiplying a matrix by a vector, we get another vector, and so forth. So, I'd say the beauty in backpropagation is that we are doing more efficient matrix-vector multiplications instead of matrix-matrix multiplications. Bio: Sebastian Raschka is a 'Data Scientist' and Machine Learning enthusiast with a big passion for Python & open source. Author of ' Python Machine Learning '. Michigan State University. Original . Reposted with permission. Related: * When Does Deep Learning Work Better Than SVMs or Random Forests? * The Development of Classification as a Learning Machine * Why Implement Machine Learning Algorithms From Scratch? -------------------------------------------------------------------------------- Previous post Next post -------------------------------------------------------------------------------- MOST POPULAR LAST 30 DAYS Most viewed 1. 7 Steps to Mastering Machine Learning With Python R vs Python for Data Science: The Winner is ... What is the Difference Between Deep Learning and “Regular” Machine Learning? TensorFlow Disappoints - Google Deep Learning falls shallow 9 Must-Have Skills You Need to Become a Data Scientist Top 10 Data Analysis Tools for Business How to Explain Machine Learning to a Software Engineer Most shared 1. What is the Difference Between Deep Learning and “Regular” Machine Learning? Data Science of Variable Selection: A Review R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results A Visual Explanation of the Back Propagation Algorithm for Neural Networks Machine Learning Key Terms, Explained How to Build Your Own Deep Learning Box Big Data Business Model Maturity Index and the Internet of Things (IoT) MORE RECENT STORIES * An Inside Update on Natural Language Processing Webinar, Jun 30: Introducing Anaconda Mosaic: Visualize. Explo... 5 More Machine Learning Projects You Can No Longer Overlook U. of Iowa: Business Analytics & Information Systems, Lec... U. of Iowa: Lecturer: Business Analytics & Information Sy... Top Stories, June 20-26: New Machine Learning Book, Free Draft... BigDebug: Debugging Primitives for Interactive Big Data Proces... Mining Twitter Data with Python Part 4: Rugby and Term Co-occu... Improving Nudity Detection and NSFW Image Recognition Highmark Health: Medical Economics Consultant Regularization in Logistic Regression: Better Fit and Better G... Doing Data Science: A Kaggle Walkthrough Part 6 – Creati... Highmark Health: Lead Decision Support Analyst Top Machine Learning Libraries for Javascript Predictive Analytics World in October: Government, Business, F... Ten Simple Rules for Effective Statistical Practice: An Overview Bank of America: Statistician From Research to Riches: Data Wrangling Lessons from Physical ... Microsoft: Sr. Applied Data Scientist. Achieving End-to-end Security for Apache Spark with Databricks KDnuggets Home » News » 2016 » Jun » Tutorials, Overviews » A Visual Explanation of the Back Propagation Algorithm for Neural Networks ( 16:n22 ) © 2016 KDnuggets. About KDnuggets Subscribe to KDnuggets News | Follow @kdnuggets | | X","A concise explanation of backpropagation for neural networks is presented in elementary terms, along with explanatory visualization. ",A Visual Explanation of the Back Propagation Algorithm for Neural Networks,Live,225 646,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix * Tutorials * Load dashDB Data with Apache Spark * Load Cloudant Data in Apache Spark Using a Python Notebook * Load Cloudant Data in Apache Spark Using a Scala Notebook * Build SQL Queries * Use the Machine Learning Library * Build a Custom Library for Apache Spark * Sentiment Analysis of Twitter Hashtags * Use Spark Streaming * Launch a Spark job using spark-submit * Sample Notebooks * Sample Python Notebook: Precipitation Analysis * Sample Python Notebook: NY Motor Vehicle Accidents Analysis * BigInsights * Get Started * BigInsights on Cloud for Analysts * BigInsights on Cloud for Data Scientists * Perform Text Analytics on Financial Data * Perform Sentiment Analysis * Sample Scripts * Compose * Get Started * Create a Deployment * Add a Database and Documents * Back Up and Restore a Deployment * Enable Two-Factor Authentication * Add Users * Enable Add-Ons for Your Deployment * Compose Enterprise * Get Started * Cloudant * Get started * Copy a sample database * Create a database * Change database permissions * Connect to Bluemix * Developing against Cloudant * Intro to the HTTP API * Execute common API commands * Set up pre-authenticated cURL * Database Replication * Use cases for replication * Create a replication job * Check replication status * Set up replication with cURL * Indexes and Queries * Use the primary index * MapReduce and the secondary index * Build and query a search index * Use Cloudant Query * Cloudant Geospatial * Integrate * Create a Data Warehouse from Cloudant Data * Store Tweets Using Cloudant, dashDB, and Node-RED * Load Cloudant Data in Apache Spark Using a Scala Notebook * Load Cloudant Data in Apache Spark Using a Python Notebook * dashDB * dashDB Quick Start * Get * Get started with dashDB on Bluemix * Load data from the desktop into dashDB * Load from Desktop Supercharged with IBM Aspera * Load data from the Cloud into dashDB * Move data to the Cloud with dashDB’s MoveToCloud script * Load Twitter data into dashDB * Load XML data into dashDB * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB * Load JSON Data from Cloudant into dashDB * Integrate dashDB and Informatica Cloud * Load geospatial data into dashDB to analyze in Esri ArcGIS * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion Workbench (DCW) * Install IBM Database Conversion Workbench * Convert data from Oracle to dashDB * Convert IBM Puredata System for Analytics to dashDB * From Netezza to dashDB: It’s That Easy! * Use Aginity Workbench for IBM dashDB * Build * Create Tables in dashDB * Connect apps to dashDB * Analyze * Use dashDB with Watson Analytics * Perform Predictive Analytics and SQL Pushdown * Use dashDB with Spark * Use dashDB with Pyspark and Pandas * Use dashDB with R * Publish apps that use R analysis with Shiny and dashDB * Perform market basket analysis using dashDB and R * Connect R Commander and dashDB * Use dashDB with IBM Embeddable Reporting Service * Use dashDB with Tableau * Leverage dashDB in Cognos Business Intelligence * Integrate dashDB with Excel * Extract and export dashDB data to a CSV file * Analyze With SPSS Statistics and dashDB * REST API * Load delimited data using the REST API and cURL * DataWorks * Get Started * Connect to Data in IBM DataWorks * Load Data for Analytics in IBM DataWorks * Blend Data from Multiple Sources in IBM DataWorks * Shape Raw Data in IBM DataWorks * DataWorks API MOVE DATA TO THE CLOUD WITH DASHDB’S MOVETOCLOUD SCRIPTJess Mantaro / July 17, 2015See an easy way to upload files larger than 5GB to a Softlayer Swift cloudobject store using IBM dashDB’s moveToCloud script.You can also read a transcript of this video .Read the tutorial (PDF)RELATED LINKS * Load data from the Cloud into dashDB * Load data from the desktop into dashDBPlease enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",See an easy way to upload files larger than 5GB to a Softlayer Swift cloud object store using IBM dashDB’s moveToCloud script. ,Move data to the Cloud with dashDB's MoveToCloud script,Live,226 650,Bradley spends some time discussing the different types of NoSQL databases available and why you might choose one type over another.,Bradley spends some time discussing the different types of NoSQL databases available and why you might choose one type over another.,Bradley Holt on NoSQL (Channel 9),Live,227 663,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix * Tutorials * Load dashDB Data with Apache Spark * Load Cloudant Data in Apache Spark Using a Python Notebook * Load Cloudant Data in Apache Spark Using a Scala Notebook * Build SQL Queries * Use the Machine Learning Library * Build a Custom Library for Apache Spark * Sentiment Analysis of Twitter Hashtags * Use Spark Streaming * Launch a Spark job using spark-submit * Sample Notebooks * Sample Python Notebook: Precipitation Analysis * Sample Python Notebook: NY Motor Vehicle Accidents Analysis * BigInsights * Get Started * BigInsights on Cloud for Analysts * BigInsights on Cloud for Data Scientists * Perform Text Analytics on Financial Data * Perform Sentiment Analysis * Sample Scripts * Compose * Get Started * Create a Deployment * Add a Database and Documents * Back Up and Restore a Deployment * Enable Two-Factor Authentication * Add Users * Enable Add-Ons for Your Deployment * Compose Enterprise * Get Started * Cloudant * Get started * Copy a sample database * Create a database * Change database permissions * Connect to Bluemix * Developing against Cloudant * Intro to the HTTP API * Execute common API commands * Set up pre-authenticated cURL * Database Replication * Use cases for replication * Create a replication job * Check replication status * Set up replication with cURL * Indexes and Queries * Use the primary index * MapReduce and the secondary index * Build and query a search index * Use Cloudant Query * Cloudant Geospatial * Integrate * Create a Data Warehouse from Cloudant Data * Store Tweets Using Cloudant, dashDB, and Node-RED * Load Cloudant Data in Apache Spark Using a Scala Notebook * Load Cloudant Data in Apache Spark Using a Python Notebook * dashDB * dashDB Quick Start * Get * Get started with dashDB on Bluemix * Load data from the desktop into dashDB * Load from Desktop Supercharged with IBM Aspera * Load data from the Cloud into dashDB * Move data to the Cloud with dashDB’s MoveToCloud script * Load Twitter data into dashDB * Load XML data into dashDB * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB * Load JSON Data from Cloudant into dashDB * Integrate dashDB and Informatica Cloud * Load geospatial data into dashDB to analyze in Esri ArcGIS * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion Workbench (DCW) * Install IBM Database Conversion Workbench * Convert data from Oracle to dashDB * Convert IBM Puredata System for Analytics to dashDB * From Netezza to dashDB: It’s That Easy! * Use Aginity Workbench for IBM dashDB * Build * Create Tables in dashDB * Connect apps to dashDB * Analyze * Use dashDB with Watson Analytics * Perform Predictive Analytics and SQL Pushdown * Use dashDB with Spark * Use dashDB with Pyspark and Pandas * Use dashDB with R * Publish apps that use R analysis with Shiny and dashDB * Perform market basket analysis using dashDB and R * Connect R Commander and dashDB * Use dashDB with IBM Embeddable Reporting Service * Use dashDB with Tableau * Leverage dashDB in Cognos Business Intelligence * Integrate dashDB with Excel * Extract and export dashDB data to a CSV file * Analyze With SPSS Statistics and dashDB * REST API * Load delimited data using the REST API and cURL * DataWorks * Get Started * Connect to Data in IBM DataWorks * Load Data for Analytics in IBM DataWorks * Blend Data from Multiple Sources in IBM DataWorks * Shape Raw Data in IBM DataWorks * DataWorks API PUBLISH APPS THAT USE R ANALYSIS WITH SHINY AND DASHDBJess Mantaro / July 17, 2015Watch how a you can analyze dashDB data with R and publish insights with Shinyand dashDB.You can also read a transcript of this videoRELATED LINKS * Use dashDB with R * Perform market basket analysis using dashDB and R * Connect R Commander and dashDB * Analyzing with RPlease enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",Watch how a you can analyze dashDB data with R and publish insights with Shiny and dashDB.,Publish apps that use R analysis with Shiny and dashDB,Live,228 665,"Homepage IBM Watson Follow Sign in Get started * Home * Announcements * Editorials * Tutorials * Code Spotlight * * Build with Watson * Damian Cummins Blocked Unblock Follow Following Software Developer — IBM Watson Data API Apr 9 -------------------------------------------------------------------------------- SERVERLESS DATA FLOW SEQUENCING WITH WATSON DATA API AND IBM CLOUD FUNCTIONS The complete code for this tutorial and other Watson Data API Data Flow samples can be found here . In a previous tutorial , you saw how data flows could be run one after another by polling using a simple shell script. This tutorial demonstrates how to deploy the same functionality as a serverless action. IBM Cloud Functions enable you to deploy a simple, repeatable function and run it periodically by using the alarm package. Again, a data flow can read data from a large variety of sources, process that data in a runtime engine using pre-defined operations or custom code, and then write it to one or more targets. For example, if you have two data flows ( data_flow_1 and data_flow_2 ) and you always want to run data_flow_2 after data_flow_1 run completes, you can write an IBM Cloud Function to check the status of the latest data_flow_1 run. If the status is completed, then the function should start a run of data_flow_2 . CREATING A NODE.JS FUNCTION First, clone this repository and run npm install to install the dependencies. Once this completes, be sure to include your project ID and the IDs of the two data flows you want to monitor and run in index.js , for example: // Parameters const projectId = 'c2254fed-404d-4905-9b8c-5102f195cc0d' const dataFlowId1 = '37bd30f0-dd3f-4052-988d-69c8fb2bf40a' // Data Flow Ref to check status of latest run const dataFlowId2 = 'd31116c7-854f-404c-9e7a-de274a8bb2d6' // Data Flow Ref to trigger run for The project ID can be retrieved from the browser URI between /projects/ and /assets in Watson Studio or Watson Knowledge Catalog when viewing the project: Similarly, the data flow ID can be retrieved from the browser URI between /refinery/ and /details in Watson Studio or Watson Knowledge Catalog when viewing the data flow:The main function is the one that will be called each time the action is invoked. The function creates a new authentication token, retrieves the latest run for dataFlowId1 , and then either creates a new dataFlowId2 run or simply returns, depending on the state and completed_date . The function is configured to run every 20 seconds so we will only start a new run for dataFlowId2 if the latest run for dataFlowId1 completed in the last 20 seconds. This is to avoid starting dataFlowId2 every time we retrieve the latest finished run for dataFlowId1 . To deploy this node.js function with IBM Cloud using the IBM Cloud Functions CLI , package it as a .zip archive, including the node_modules , index.js and package.json files. GETTING STARTED WITH IBM CLOUD FUNCTIONS CLI First, follow the instructions here to install the IBM Cloud Functions CLI. In a terminal window, upload the .zip file containing the node.js action as a Cloud Function by using the following command: bx wsk action create packageAction --kind nodejs:default action.zip . You can test the action you have just created manually by using the following command: bx wsk action invoke --blocking --result packageAction . TRIGGER: EVERY-20-SECONDS You can include a trigger that uses the built-in alarm package feed to fire events every 20 seconds. This is specified through cron syntax in the cron parameter. [Optional] The maxTriggers parameter ensures that it only fires for five minutes (15 times), rather than indefinitely. Create the trigger with the following command: bx wsk trigger create every-20-seconds --feed /whisk.system/alarms/alarm --param cron ""*/20 * * * * *"" --param maxTriggers 15 . RULE: INVOKE-PERIODICALLY This rule shows how the every-20-seconds trigger can be declaratively mapped to the packageAction. Create the rule with the following command: bx wsk rule create invoke-periodically every-20-seconds packageAction Next, open a terminal window to start polling the activation log. The console.log statements in the action will be logged here. You can stream them with the following command: bx wsk activation poll MONITORING LOGS Before running your data flow, you should see entries similar to the following ones: The first entry shows the IAM Authorization token being obtained, retrieving the data flow run, and then returning because the entity.summary.completed_date is earlier than the lookback date. At this point, run dataFlowId1 from either Watson Studio or Watson Knowledge Catalog. You can do this using the Refine action for the data flow in the project assets page. This entry is very similar but in this case, the entity.state is running so the function returns again. In this entry, you can see that the run for the data flow with an ID of 37bd30f0-dd3f-4052-988d-69c8fb2bf40a finished so the data flow with an ID of d31116c7-854f-404c-9e7a-de274a8bb2d6 starts. TO SUMMARIZE… In summary, we have created a serverless action that polls the status of a data flow’s most recent run and, on completion, runs another data flow. This demonstrates the ability to chain or sequence the running of data flows using Watson Data APIs in the IBM Cloud. -------------------------------------------------------------------------------- Damian Cummins is a Cloud Application Developer with the Data Refinery and IBM Watson teams at IBM. Thanks to Cecelia Shao . * Nodejs * Tutorial * Cloud * API * Big Data And Analytics One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. 1 Blocked Unblock Follow FollowingDAMIAN CUMMINS Software Developer — IBM Watson Data API FollowIBM WATSON AI Platform for the Enterprise * 1 * * * Never miss a story from IBM Watson , when you sign up for Medium. Learn more Never miss a story from IBM Watson Get updates Get updates","In a previous tutorial, you saw how data flows could be run one after another by polling using a simple shell script. This tutorial demonstrates how to deploy the same functionality as a serverless…",Serverless Data Flow Sequencing with Watson Data API and IBM Cloud Functions,Live,229 667,"* Home * Research * Partnerships and Chairs * Staff * Books * Articles * Videos * Presentations * Contact Information * Subscribe to our Newsletter * 中文 * Marketing Analytics * Credit Risk Analytics * Fraud Analytics * Process Analytics * Human Resource Analytics * Prof. dr. Bart Baesens * Prof. dr. Seppe vanden Broucke * Aimée Backiel * Libo Li * Sandra Mitrović * Klaas Nelissen * María Óskarsdóttir * Michael Reusens * Eugen Stripling * Tine Van Calster * Basic Java Programming * Principles of Database Management * Business Information Systems * Mini Lecture Series * Other Videos WEB PICKS (WEEK OF 23 JANUARY 2017) Posted on January 29, 2017Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources . * Some things I’ve found help reduce my stress around science “I decided to make a list of things that I’ve learned through hard experience do not help me with my own imposter syndrome and do help me to feel less stressed out about my science.” * The New Gold Rush? Wall Street Wants your Data If you’re one of the many startups sitting on a growing data asset and trying to figure out whether you can make money selling it to Wall Street, this post is for you. * Data Readiness Levels: Turning Data from Palid to Vivid “All these problems arise before modeling even starts. Both questions and data are badly characterised. This is particularly true in the era of Big Data, where one gains the impression that the depth of data-discussion in many decision making forums is of the form “We have a Big Data problem, do you have a Big Data solution?”, “Yes, I have a Big Data solution.” Of course in practice it also turns out to be a solution that requires Big Money to pay for because in practice no one bothered to scope the nature of the problem, the data, or the solution.” * Data Could Be the Next Tech Hot Button for Regulators “Now data — gathered in those immense pools of information that are at the heart of everything from artificial intelligence to online shopping recommendations — is increasingly a focus of technology competition. And academics and some policy makers, especially in Europe, are considering whether big internet companies like Google and Facebook might use their data resources as a barrier to new entrants and innovation.” * Poker Is the Latest Game to Fold Against Artificial Intelligence Two research groups have developed poker-playing AI programs that show how computers can out-hustle the best humans. * Why go long on artificial intelligence? We are now at the right place and time for AI to be the set of technology advancements that can help us solve challenges where answers reside in data. * 4 trends in security data science for 2017 How bots, threat intelligence, adversarial machine learning, and deep learning are impacting the security landscape. * 8 data trends on our radar for 2017 From deep learning to decoupling, here are the data trends to watch in the year ahead. * 5 Big Predictions for Artificial Intelligence in 2017 Expect to see better language understanding and an AI boom in China, among other things. * High-Speed Traders Are Taking Over Bitcoin Cryptocurreny offers fragmented market, zero transaction fees, risks include hacking thefts, Chinese government crackdown. * king – but why? “word2vec is an algorithm that transforms words into vectors, so that words with similar meaning end up laying close to each other. Moreover, it allows us to use vector arithmetics to work with analogies, for example the famous king – man + woman = queen. I will try to explain how it works, with special emphasis on the meaning of vector differences, at the same time omitting as many technicalities as possible.” * Playing with 80 Million Amazon Product Review Ratings Using Apache Spark “Back then, I was only limited to 1.2M reviews because attempting to process more data caused out-of-memory issues and my R code took hours to run. Apache Spark, which makes processing gigantic amounts of data efficient and sensible, has become very popular in the past couple years. Although data scientists often use Spark to process data with distributed cloud computing via Amazon EC2 or Microsoft Azure, Spark works just fine even on a typical laptop, given enough memory.” * Concrete AI tasks for forecasting This page contains a list of relatively well specified AI tasks designed for forecasting. Currently all entries were used in the 2016 Expert Survey on Progress in AI. Still a lot of challenges up ahead. * The Humans Working Behind the AI Curtain “Just how artificial is Artificial Intelligence? Facebook created a PR firestorm last summer when reporters discovered a human “editorial team” – rather than just unbiased algorithms – selecting stories for its trending topics section. The revelation highlighted an elephant in the room of our tech world: companies selling the magical speed, omnipotence, and neutrality of artificial intelligence (AI) often can’t make good on their promises without keeping people in the loop, often working invisibly in the background. So who are the people behind the AI curtain?” * Microsoft touts Deep Learning in SQL Server “Can SQL Server do Deep Learning? The response to this is enthusiastic “yes!” With the public preview of the next release of SQL Server, we’ve added significant improvements into R Services inside SQL Server including a very powerful set of machine learning functions that are used by our own product teams across Microsoft. This brings new machine learning and deep neural network functionality with increased speed, performance and scale to database applications built on SQL Server.” * Rules of Machine Learning: Best Practices for ML Engineering (pdf) This document is intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning from around Google. Some great pieces of advice in here! * Simulation of empirical Bayesian methods (using baseball statistics) “We’re approaching the end of this series on empirical Bayesian methods, and have touched on many statistical approaches for analyzing binomial (success / total) data, all with the goal of estimating the “true” batting average of each player. There’s one question we haven’t answered, though: do these methods actually work?” * SOMBER: Self-Organizing Maps in Numpy somber (Somber Organizes Maps By Enabling Recurrence) is a collection of numpy/python implementations of various kinds of Self-Organizing Maps (SOMS), with a focus on SOMs for sequence data. * Calling Bullshit in the Age of Big Data A not-yet-official course on different aspects of bullshit in the current age. “We feel that the world has become oversaturated with bullshit and we’re sick of it. However modest, this course is our attempt to fight back.” Some great references and tidbits included on the syllabus, worth checking out! * Distributed Pandas on a Cluster with Dask Data “Dask Dataframe extends the popular Pandas library to operate on big data-sets on a distributed cluster. We show its capabilities by running through common dataframe operations on a common dataset.” * Null Hypothesis Significance Testing Never Worked “Much has been written about problems with our most-used statistical paradigm: frequentist null hypothesis significance testing (NHST), p-values, type I and type II errors, and confidence intervals. We seldom examine whether the original idea of NHST actually delivered on its goal of making good decisions about effects, given the data.” * Artificial intelligence predicts when heart will fail It correctly predicted those who would still be alive after one year about 80% of the time. The figure for doctors is 60%. * Introducing Embedding.js, a Library for Data-Driven Environments “Data and its visual presentation have become central to our understanding of the world, and yet so many visualizations prioritize bling over communication. The fear, and it is justified, is that VR will merely exacerbate the problem, unleashing new and nauseating ways to deliver empty visual calories rather than a meaningful increase in articulative power.” * Game Theory reveals the Future of Deep Learning “A disadvantage of adversarial networks are they are difficult to train. Adversarial learning consists in finding a Nash equilibrium to a two-player non-cooperative game. Yann Lecun, in a recent lecture on unsupervised learning, calls adversarial networks the “the coolest idea in machine learning in the last twenty years”.” * From Natural Language Processing to Artificial Intelligence (presentation) Overview of natural language processing (NLP) from both symbolic and deep learning perspectives. Covers tf-idf, sentiment analysis, LDA, WordNet, FrameNet, word2vec, and recurrent neural networks (RNNs). * R and Spark (presentation) Better support is comming, great! * Large scale data processing pipelines at trivago: a use case (presentation) Kafka is used a lot at trivago, these days, together with Impala and R. * Text Mining, the Tidy Way (presentation) January 2017 talk at rstudio::conf by Julia Silge. * AI Alignment: Why It’s Hard, and Where to Start “In this talk, I’m going to try to answer the frequently asked question, “Just what is it that you do all day long?” We are concerned with the theory of artificial intelligences that are advanced beyond the present day, and that make sufficiently high-quality decisions in the service of whatever goals they may have been programmed with to be objects of concern.” * DeepTraffic: a gamified simulation of typical highway traffic. Your task is to build a neural agent Your neural network gets to control one of the cars (displayed in red) and has to learn how to navigate efficiently to go as fast as possible. The car already comes with a safety system, so you don’t have to worry about the basic task of driving – the net only has to tell the car if it should accelerate/slow down or change lanes, and it will do so if that is possible without crashing into other cars. * The state of d3 Voronoi “ given a set of sites in a space, it partitions that space in cells — one cell for each site. Here we explore what our favourite javascript library, d3.js, allows to do with this concept.” * RL2: Fast Reinforcement Learning via Slow Reinforcement Learning (paper) “ however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a “fast” reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL2, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose (“slow”) RL algorithm.” * OpenAI announces support for GTA V in Universe but takes down page afterwards Something strange is going on regarding OpenAI’s announcement of supporting GTA V. A cached version of the page can still be accessed here , and some people have forked the code repository which was also taken offline. The eye-catching demonstration video is still up, however. * WeChat’s App Revolution “Apple Inc. isn’t taking this development lightly. It even prohibited WeChat from using the term “app” as applied to mini programs. But the challenge to the App Store might be the least of Apple’s worries. For now, WeChat is changing smartphones in China. One day soon, its impact will be felt worldwide.” * Two Google Homes are arguing on Twitch and thousands of people can’t look away A Twitch stream called Seebotschat is really taking things to the next level with a live feed of two Google Homes engaged in an absolutely hilarious war of words. The future of natural language is here, folks! ‹ Offline Recommender Evaluation is Killing Serendipity —Ad—We display ads on this section of the site. -------------------------------------------------------------------------------- Recent Posts * Web Picks (week of 23 January 2017) * Offline Recommender Evaluation is Killing Serendipity * How can networked data be leveraged for analytics? * Web Picks (week of 9 January 2017) * 5 Practical Use Cases of Social Network Analytics: Going Beyond Facebook and Twitter Archives * January 2017 * December 2016 * November 2016 * October 2016 * September 2016 * August 2016 * July 2016 * June 2016 * May 2016 * April 2016 * March 2016 * February 2016 * January 2016 * December 2015 * November 2015 * October 2015 * September 2015 * August 2015 * July 2015 * June 2015 * May 2015 * * * © DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU Leuven KU Leuven, Department of Decision Sciences and Information Management Naamsestraat 69, 3000 Leuven, Belgium DataMiningApps on Twitter , Facebook , YouTube info@dataminingapps.com","Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. ",Web Picks (week of 23 January 2017),Live,230 668,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSPEED YOUR SQL QUERIES WITH SPARK SQLChetna Warade / August 19, 2015It can be painful to query your enterprise Relational Database Management System(RDBMS) for useful information. You write lengthy java code to create a databaseconnection, send a SQL query, retrieve rows from the database tables, andconvert data types. That’s a lot of steps, and they all take time when users arewaiting for answers. Plus most relational databases are stuck on servers thatlive somewhere inside the walls of your organization, inaccessible tocloud-based apps and services.There’s a more efficient and faster way to get the answers you need. You can useApache® Spark™, the high-speed, in-memory analytics engine to query yourdatabase instead. Not only does Spark’s SQL API provide lightning-fastanalytics, it also lets you access the database schema and data with only a fewsimple lines of code. How efficient and elegant.WHAT YOU’LL LEARNThis tutorial shows you how to use Spark to query a relational database. First,we’ll set up a PostgreSQL database to serve as our relational database (eitheron the cloud-based Compose postgreSQL service or in a local instance). Next,you’ll learn how to connect and run Spark SQL commands through the Spark Shelland then through IPython Notebook.WHY SPARK?Apache® Spark™ is an open-source cluster-computing framework with in-memoryprocessing, which enables analytic applications to run up to 100 times fasterthan other technologies on the market today. It helps developers be moreproductive and frees them to write less code. I’m so glad that IBM is committedto the Apache Spark project, investing in design and education programs topromote open source innovation. We’re working hard to help developers leverageSpark to create smart, fast apps that use and deliver data wisely. Learn more .SET UP YOUR POSTGRESQL DATABASEYou can use a cloud-based Compose PostgreSQL instance (the faster, easier option), or install PostgreSQL locally and openexternal access to its port by HTTP.OPTION 1: SET UP A CLOUD-BASED COMPOSE POSTGRESQL DATABASEThis online option gives you the availability and flexibility of a cloud-basedservice and some neat browser-based tools. If you prefer to work locally, skipdown to Option 2 . 1. Download psql.With Compose, your postgreSQL database will live in the cloud, but you need to install the psql command line tool. Go to http://www.postgresql.org/download/ and download PostgreSQL, accepting all default installation settings. 2. Sign up for a Compose PostgreSQL account.Go to https://app.compose.io/signup/ select the PostgrSQL database option and enter your account information.Compose asks for a credit card upon sign-up, but you get a free 30-day trial. 3. Click Deployments button. 4. Click the deployment link to open it. 5. Click the Reveal your credentials link.You see your username and password, which you’ll use in a minute. 6. Locate the Command Line , copy its contents, and keep this Compose browser window open. 7. Populate the database. 1. Open your terminal/command window and go to psql by typing the command: cd /Library/PostgreSQL/9.4/bin If your directory/version is different, locate it first, then cd to the correct directory 2. Connect with the following commands: Type ./psql then within quotation marks, paste in the command line you just copied, then press Enter. This will look something like this:./psql ""sslmode=require host=haproxy429.aws-us-east-1-portal.3.dblayer.com port=10429 dbname=compose user=admin"" 3. When prompted, enter your Compose postgreSQL password and press Enter. You’ll see: 4. Password: psql (9.4.4) SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off) Type ""help"" for help. 5. Copy, paste, and run the following SQL commands: CREATE TABLE weather ( city varchar(80), temp_lo int, -- low temperature temp_hi int, -- high temperature prcp real, -- precipitation date date ); INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27'); INSERT INTO weather VALUES ('San Francisco', 43, 57, 0.0, '1994-11-29'); INSERT INTO weather VALUES ('Hayward', 54, 37, 0.25, '1994-11-29'); Shell output from psql will look like: compose=> CREATE TABLE weather ( compose(> city varchar(80), compose(> temp_lo int, -- low temperature compose(> temp_hi int, -- high temperature compose(> prcp real, -- precipitation compose(> date date compose(� CREATE TABLE compose=� city | temp_lo | temp_hi | prcp | date ------+---------+---------+------+------ (0 rows) compose=� INSERT 0 1 compose=� INSERT 0 1 compose=� INSERT 0 1 compose=� city | temp_lo | temp_hi | prcp | date ---------------+---------+---------+------+------------ San Francisco | 46 | 50 | 0.25 | 1994-11-27 San Francisco | 43 | 57 | 0 | 1994-11-29 Hayward | 54 | 37 | 0.25 | 1994-11-29 (3 rows) Return to Compose and from the menu on the left, choose Browser . Click the compose database. You should see your new weather table. Click it to see the values you just added. Open your deployment again (see Steps 3-4) and copy the Public hostname/port . You’ll use it in a minute. .OPTION 2: SET UP A LOCAL POSTGRESQL DATABASEIf you prefer to work with a locally-installed PostreSQL database, follow thesteps below. The local option requires a some additional steps like openingexternal access to the database port by HTTP and restarting the database. (Ifyou already followed steps for Option 1 to set up a cloud-based ComposePostgreSQL database, skip ahead to the section on accessing data with Spark .) 1. Go to http://www.postgresql.org/download/ and Download PostgreSQL, accepting all default installation settings. 2. Modify the pg_hba.conf file to allow external access via HTTP. By default, your postgreSQL database is accessible through port number 5432, and only localhost can access it. For this tutorial, we’ll open access to external programs and machines via HTTP. To do so: 1. Open your terminal or command window. 2. Sign in as the Postgres user by enteringsu postgres If prompted, enter your postgresql password. 3. Add the following line to the pg_hba.conf file (located in /Library/PostgreSQL/9.4/data) host all all 0.0.0.0/0 md5 4. Edit this file using your favorite editing tool. We edited through the command line interface using vi command: 5. vi pg_hba.conf 3. Restart the PostgreSQL database by running these 2 commands in Terminal: cd /Library/PostgreSQL/9.4/bin ./pg_ctl status -D ../data/ You see the following message: pg_ctl: server is running (PID: XXXX) /Library/PostgreSQL/9.4/bin/postgres ""-D/Library/PostgreSQL/9.4/data” Take note of these additional commands: ./pg_ctl stop -D ../data/ to stop the database ./pg_ctl start -D ../data/ to start the database again 4. Populate the database.Launch the SQL Shell (psql) application on your machine (located at /Library/PostgreSQL/9.4/bin) to connect to the database and populate it with some data. We used the command line, and entered: cd /Library/PostgreSQL/9.4/scripts/ and then: ./runpsql.sh and psql returns: Server [localhost]: Database [postgres]: Port [5432]: Username [postgres]: Password for user postgres: psql (9.4.4) Type ""help� city | temp_lo | temp_hi | prcp | date ---------------+---------+---------+------+------------ San Francisco | 46 | 50 | 0.25 | 1994-11-27 San Francisco | 43 | 57 | 0 | 1994-11-29 Hayward | 37 | 54 | | 1994-11-29 (3 rows) Files For more on working in PostgreSQL, see http://www.postgresql.org/docs/9.4/static/tutorial-table.html . ACCESS SQL DATA VIA SPARK SHELLThere are 2 ways to work with Spark: * Access a virtual machine where Spark is installed. For this tutorial, we used a VM with Apache Spark v1.3.1 installed in it and hosted via Virtual Box on Mac OS X Version 10.9.4. The VM image is wrapped by Vagrant, a virtual development environment configuration software. * Or download Apache Spark and run locally on your machine. Get it at: http://spark.apache.org/downloads.htmlOnce you’ve installed Spark or know where it lives: 1. Download postgresql-9.4-1200.jdbc41.jar from https://jdbc.postgresql.org/download.html and save it to a location accessible to the spark shell. Note its location. You’ll need its path in a few minutes. 2. In your terminal or command window, open Spark shell. * If Spark’s installed locally, cd to the directory that contains spark shell. Then spark-shell * If using a VM, ssh into a VM/machine, where spark is installed. Create a new Spark dataframe object using SQLContext.load.In a command/terminal window, type: vagrant@sparkvm:~$ spark-shell --jars ./drivers/postgresql-9.4-1200.jdbc41.jar At the scala command prompt enter the following command. * If you’re using a cloud-based Compose PostgreSQL database, retrieve the public hostname:port you copied in Step 10 and insert within the URL value after jdbc:postgresql:// . It should look something like this:scala val jdbcDF = sqlContext.load(""jdbc"", Map(""url"" - ""jdbc:postgresql://haproxy425.aws-us-east-1-portal.3.dblayer.com:10425/compose?user=admin&password=XXXXXXXXXXXXXXXX"", ""dbtable"" - ""weather"")) * If you’re connecting to a locally deployed PostgreSQL database, enter the following command:scala val jdbcDF = sqlContext.load(""jdbc"", Map(""url"" - ""jdbc:postgresql://192.168.1.15:5432/postgres?user=postgres&password=postgres"", ""dbtable"" - ""weather"")) Type these commands: scala jdbcDF.show() scala jdbcDF.printSchema() scala jdbcDF.filter(jdbcDF(""temp_h1"") 40).show()You see scala output that looks like this: That’s it! You’ve accessed your PostgreSQL data via Spark SQL. ACCESS SQL DATA VIA IPYTHON NOTEBOOKIn this part of the tutorial we walk through steps on how to modify Spark’sclasspath and run Spark SQL commands through IPython Notebook.Note: This section assumes familiarity with Spark Server installation and IPythonNotebook. 1. Retrieve the complete path and name of the jdbc driver as a string value (you noted this info in the last section). 2. Locate compute-classpath.sh file under /usr/local/bin/spark-1.3.1-bin-hadoop2.6/bin 3. Add the following line to the end of the file:appendToClasspath ""/home/vagrant/drivers/postgresql-9.4-1200.jdbc41.jar” 4. Restart the vm that runs Spark. Now the IPython Notebook is ready to connect and query the sample database. 5. Launch the IPython Notebook. 6. Insert a new cell. 7. Create a new Spark dataframe object using SQLContext.load. 8. Tip: Here, you can use the same Spark commands you used at the Scala command prompt in the previous section. 9. You see Spark commands in gray boxes and beneath each call, IPython shows the data returned. SUMMARYNow you know how to connect Spark to a relational database, and use Spark’s APIto perform SQL queries. Spark can also run as a cloud service, potentiallyunlocking your on-premises SQL data, which we’ll explore more in future posts.To try these calls with another type of database, you’d just follow these samesteps, but download the jdbc driver supported by your RDBMS.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: PostgreSQL / Spark / SQL Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Get faster queries and write less code too. Learn how to use Spark SQL to query your relational database. Follow this tutorial and see how to query a cloud-based Compose PostgreSQL instance or a local PostreSQL database.,Speed your SQL Queries with Spark SQL,Live,231 672,"Homepage Follow Sign in Get started Homepage * Home * Data Science Experience * Data Catalog * IBM Data Refinery * * Watson Data Platform * Carmen Ruppach Blocked Unblock Follow Following Offering Manager for Data Refinery on Watson Data Platform at IBM Nov 14 -------------------------------------------------------------------------------- SELF-SERVICE DATA PREPARATION WITH IBM DATA REFINERY If you are like most data scientists, you are probably spending a lot of time to cleanse, shape and prepare your data before you can actually start with the more enjoyable part of building and training machine learning models. As a data analyst, you might face similar struggles to obtain data in a format you need to build your reports. In many companies data scientists and analysts need to wait for their IT teams to get access to cleaned data in a consumable format. IBM Data Refinery addresses this issue. It provides an intuitive self-service data preparation environment where you can quickly analyze, cleanse and prepare data sets. It is a fully managed cloud service, available in open beta now. Analyze and prepare your data With IBM Data Refinery, you can interactively explore your data and use a wide range of transformations to cleanse and transform data into the format you need for analysis. You can use a simple point-and-click interface for selecting and combining a wide range of built-in operations, such as filtering, replacing, and deriving values. It is also possible to quickly remove duplicates, split and concatenate values, and choose from a comprehensive list of text and math operations. Interactive data exploration and preparationIf you prefer to code, in IBM Data Refinery you can directly enter R commands via R libraries such as dplyr. We provide code templates and in-context documentation to help you become productive with the R syntax more quickly. Code templates to help users with R syntaxIf you’re not satisfied with the shaping results, you can easily undo and change operations in the Steps side bar. The interactive user interface works on a subset of the data to give you a faster preview of the operations and results. Once you’re happy with the sample output, you can apply the transformations on the entire data set and save all transformation steps in a data flow. You can repeat the data flow later and track changes that were applied to your data. To accelerate the job execution, Apache Spark is used as the execution engine. Data profiling and visualization Data shaping is an iterative and time-consuming process. In a traditional data science workflow, you might use one tool to apply various transformations to your data set, and then load the data into another tool to visualize and evaluate the results. Over many cycles, this continual tool hopping can become frustrating. IBM Data Refinery soothes the pain by integrating both data transformations and visualizations in a single interface, so you can move between views with a simple click. You can use the Profile tab to view descriptive statistics of your data columns in order to better understand the distribution of values. You can continue to apply transformations and the corresponding profile information adjusts automatically. On the Visualization tab you can select a combination of columns to build charts using Brunel (open source visualization library). IBM Data Refinery automatically suggests appropriate plots and you can choose between 12 pre-defined chart types. You can adjust the appearance of the charts using Brunel syntax. Connecting to data wherever it resides IBM Data Refinery comes with a comprehensive set of 30 prebuilt data connectors so that you can set up connections to a wide range of commonly used on-premises and cloud data stores. You can connect to IBM as well as non-IBM services. If your data service is hosted on IBM Cloud (formerly IBM Bluemix), you can directly access the data service instance from IBM Data Refinery. Once you specify a connection and connect the data object to your data, you can start to analyze and refine your data wherever it resides. Try out IBM Data Refinery! Sign up for free at: https://www.ibm.com/cloud/data-refinery * Data Science * Data Visualization * Data Analysis * Data Refinery One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. Blocked Unblock Follow FollowingCARMEN RUPPACH Offering Manager for Data Refinery on Watson Data Platform at IBM FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","If you are like most data scientists, you are probably spending a lot of time to cleanse, shape and prepare your data before you can actually start with the more enjoyable part of building and…",Self-service data preparation with IBM Data Refinery,Live,232 676,"Homepage Stats and Bots Follow Sign in / Sign up Homepage * Home * DATA SCIENCE * ANALYTICS * STARTUPS * BOTS * DESIGN * Subscribe * * 🤖 TRY STATSBOT FREE * Vadim Smolyakov Blocked Unblock Follow Following passionate about data science and machine learning https://github.com/vsmolyakov Oct 12 -------------------------------------------------------------------------------- BAYESIAN NONPARAMETRICS AN INTRODUCTION TO THE DIRICHLET PROCESS AND ITS APPLICATIONS Bayesian Nonparametrics is a class of models with a potentially infinite number of parameters. High flexibility and expressive power of this approach enables better data modelling compared to parametric methods. Bayesian Nonparametrics is used in problems where a dimension of interest grows with data, for example, in problems where the number of features is not fixed but allowed to vary as we observe more data. Another example is clustering where the number of clusters is automatically inferred from data. The Statsbot team asked a data scientist, Vadim Smolyakov, to introduce us to Bayesian Nonparametric models. In this article, he describes the Dirichlet process along with associated models and links to their implementations. INTRODUCTION: DIRICHLET PROCESS K-MEANS Bayesian Nonparametrics are a class of models for which the number of parameters grows with data. A simple example is non-parametric K-means clustering [1]. Instead of fixing the number of clusters K, we let data determine the best number of clusters. By letting the number of model parameters (cluster means and covariances) grow with data, we are better able to describe the data as well as generate new data given our model. Of course, to avoid over-fitting, we penalize the number of clusters K via a regularization parameter which controls the rate at which new clusters are created. Thus, our new K-means objective becomes: In the figure above, we can see the non-parametric clustering, aka Dirichlet-Process (DP) K-Means applied to the Iris dataset. The strength of regularization parameter lambda (right), controls the number of clusters created. Algorithmically, we create a new cluster, every time we discover that a point (x_i) is sufficiently far away from all the existing cluster means: The resulting update is an extension of the K-means assignment step: we reassign a point to the cluster corresponding to the closest mean or we start a new cluster if the squared Euclidean distance is greater than lambda. By creating new clusters for data points that are sufficiently far away from the existing clusters, we eliminate the need to specify the number of clusters K ahead of time. Dirichlet process K-means eliminates the need for expensive cross-validation in which we sweep a range of values for K in order to find the optimum point in the objective function. For an implementation of the Dirichlet process K-means algorithm see the following github repo . DIRICHLET PROCESS The Dirichlet process (DP) is a stochastic process used in Bayesian nonparametric models [2]. Each draw from a Dirichlet process is a discrete distribution. For a random distribution G to be distributed according to a DP, its finite dimensional marginal distributions have to be Dirichlet distributed. Let H be a distribution over theta and alpha be a positive real number. We say that G is a Dirichlet process with base distribution H and concentration parameter alpha if for every finite measurable partition A1,…, Ar of theta we have: Where Dir is a Dirichlet distribution defined as: The Dirichlet distribution can be visualized over a probability simplex as in the figure below. The arguments to the Dirichlet distribution (x1, x2, x3) can be interpreted as pseudo-counts. For example, in the case of (x1, x2, x3) = (2, 2, 2) the Dirichlet distribution (left) has high probability near the middle, in comparison to the (2, 2, 10) case where it concentrates around one of the corners. In the case of (10, 10, 10) we have more observations, and the Dirichlet distribution concentrates more in the middle (since equal number of counts are observed in this case). The base distribution H is the mean of the DP: E[G(A)] = H(A), whereas the concentration parameter is the inverse variance: VAR[G(A)] = H(A)[1-H(A)] / (1+alpha). Thus, the larger the alpha, the smaller the variance and the DP will concentrate more of its mass around the mean as shown in the figure below [3]. STICK-BREAKING CONSTRUCTION We have seen the utility of Bayesian Nonparametric models is in having a potentially infinite number of parameters. We also had a brief encounter with the Dirichlet process that exhibits a clustering property that makes it useful in mixture modeling where the number of components grows with data. But how do we generate a mixture model with an infinite number of components?The answer is a stick-breaking construction [4] that represents draws G from DP(alpha, H) as a weighted sum of atoms (or point masses). It is defined as follows: The mixture model G consists of an infinite number of weights (pi_k) and mixture parameters (theta_k). The weights are generated by first sampling beta_k from Beta(1, alpha) distribution, where alpha is the concentration parameter and then computing pi_k as in the expression above, while mixture parameters theta_k are sampled from the base distribution H. We can visualize the stick-breaking construction as in the figure below: Notice that we start with a stick of unit length (left) and in each iteration we break off a piece of length pi_k. The length of the piece that we break off is determined by the concentration parameter alpha. For alpha=5 (middle) the stick lengths are longer and as a result there are fewer significant mixture weights. For alpha=10 (right) the stick lengths are shorter and therefore we have more significant components. Thus, alpha determines the rate of cluster growth in a non-parametric model. In fact, the number of clusters created is proportional to alpha x log(N) where N is the number of data points. DIRICHLET PROCESS MIXTURE MODEL (DPMM) A Dirichlet process mixture model (DPMM) belongs to a class of infinite mixture models in which we do not impose any prior knowledge on the number of clusters K. DPMM models learn the number of clusters from the data using a nonparametric prior based on the Dirichlet process (DP). Automatic model selection leads to computational savings of cross validating the model for multiple values of K. Two equivalent graphical models for a DPMM are shown below: Here, x_i are observed data points and with each x_i we associate a label z_i that assigns x_i to one of the K clusters. In the left model, the cluster parameters are represented by pi (mixture proportions) and theta (cluster means and covariances) with associated uninformative priors (alpha and lambda). For ease of computation, conjugate priors are used such as a Dirichlet prior for mixture weights and Normal-Inverse-Wishart prior for a Gaussian component. In the right model, we have a DP representation of DPMM where the mixture distribution G is sampled from a DP (alpha, H) with concentration parameter alpha and base distribution H. There are many algorithms for learning the Dirichlet process mixture models based on sampling or variational inference. For a Gibbs sampler implementation of DPMMs with Gaussian and Discrete base distribution, have a look at the following code . The figure above shows DPMM clustering results for a Gaussian distribution (left) and Categorical distribution (right). On the left, we can see the ellipses (samples from posterior mixture distribution) of the DPMM after 100 Gibbs sampling iterations. The DPMM model initialized with 2 clusters and a concentration parameter alpha of 1, learned the true number of clusters K=5 and concentrated around cluster centers. On the right, we can see the results of clusters of Categorical data, in this case a DPMM model was applied to a collection of NIPS articles. It was initialized with 2 clusters and a concentration parameter alpha of 10. After several Gibbs sampling iterations, it discovered over 20 clusters, with the first 4 shown in the figure. We can see that the word clusters have similar semantic meaning within each cluster and the cluster topics are different across clusters. HIERARCHICAL DIRICHLET PROCESS (HDP) The hierarchical Dirichlet process (HDP) is an extension of DP that models problems involving groups of data especially when there are shared features among the groups. The power of hierarchical models comes from an assumption that the features among groups are drawn from a shared distribution rather than being completely independent. Thus, with hierarchical models we can learn features that are common to all groups in addition to the individual group parameters. In HDP, each observation within a group is a draw from a mixture model and mixture components are shared between groups. In each group, the number of components is learned from data using a DP prior. The HDP graphical model is summarized in the figure below [5]: Focusing on HDP formulation in the figure on the right, we can see that we have J groups where each group is sampled from a DP: Gj ~ DP(alpha, G0) and G0 represents shared parameters across all groups which in itself is modeled as a DP: G0 ~ DP(gamma, H). Thus, we have a hierarchical structure for describing our data. There exists many ways for inferring the parameters of hierarchical Dirichlet processes. One popular approach that works well in practice and is widely used in the topic modelling community is an online variational inference algorithm [6] implemented in gensim . The figure above shows the first four topics (as a word cloud) for an online variational HDP algorithm used to fit a topic model on the 20newsgroups dataset . The dataset consists of 11,314 documents and over 100K unique tokens. Standard text pre-processing was used, including tokenization, stop-word removal, and stemming. A compressed dictionary of 4K words was constructed by filtering out tokens that appear in less than 5 documents and more than 50% of the corpus. The top-level truncation was set to T=20 topics and the second level truncation was set to K=8 topics. The concentration parameters were chosen as gamma=1.0 at the top-level and alpha=0.1 at the group level to yield a broad range of shared topics that are concentrated at the group level. We can find topics about autos, politics, and for sale items that correspond to the target labels of the 20newsgroups dataset. HDP HIDDEN MARKOV MODELS The hierarchical Dirichlet process (HDP) can be used to define a prior distribution on transition matrices over countably infinite state spaces. The HDP-HMM is known as an infinite hidden Markov model where the number of states is inferred automatically. The graphical model for HDP-HMM is shown below: In a nonparametric extension of HMM, we consider a set of DPs, one for each value of the current state. In addition, the DPs must be linked because we want the same set of next states to be reachable from each of the current states. This relates directly to HDP, where the atoms associated with state-conditional DPs are shared. The HDP-HMM parameters can be described as follows: Where the GEM notation is used to represent stick-breaking. One popular algorithm for computing the posterior distribution for infinite HMMs is called beam sampling and is described in [7]. DEPENDENT DIRICHLET PROCESS (DDP) In many applications, we are interested in modelling distributions that evolve over time as seen in temporal and spatial processes. The Dirichlet process assumes that observations are exchangeable and therefore the data points have no inherent ordering that influences their labelling. This assumption is invalid for modelling temporal and spatial processes in which the order of data points plays a critical role in creating meaningful clusters. The dependent Dirichlet process (DDP), originally formulated by MacEachern, provides a nonparametric prior over evolving mixture models. A construction of the DDP built on the Poisson process [8] led to the development of the DDP mixture model as shown below: In the graphical model above we see a temporal extension of the DP process in which a DP at time t depends on the DP at time t-1. This time-varying DP prior is capable of describing and generating dynamic clusters with means and covariances changing over time. CONCLUSION In Bayesian Nonparametric models the number of parameters grows with data. This flexibility enables better modeling and generation of data. We focused on the Dirichlet process (DP) and key applications such as DP K-means (DP-means), Dirichlet process mixture models (DPMMs), hierarchical Dirichlet processes (HDPs) applied to topic models and HMMs, and dependent Dirichlet processes (DDPs) applied to time-varying mixtures. We looked at how to construct nonparametric models using stick-breaking and examined some of the experimental results. To better understand the Bayesian Nonparametric model, I encourage you to read the literature mentioned in the references and experiment with the code linked throughout the article on challenging datasets! REFERENCES [1] B. Kulis and M. Jordan, “Revisiting k-means: New Algorithms via Bayesian Nonparametrics ”, ICML, 2012 [2] E. Sudderth, “Graphical Models for Visual Object Recognition and Tracking”, PhD Thesis (Chp 2.5), 2006 [3] A. Rochford, Dirichlet process Mixture Model in PyMC3 [4] J. Sethuraman, “A constructive definition of Dirichlet priors”, Statistica Sinica, 1994. [5] Y. Teh, M. Jordan, M. Beal and D. Blei, “Hierarchical Dirichlet process”, JASA, 2006 [6] C. Wang, J. Paisley, and D. Blei, “Online Variational Inference for the Hierarchical Dirichlet process”, JMLR, 2011. [7] J. Van Gael, Y. Saatci, Y. Teh and Z. Ghahramani, “Beam Sampling for the infinite Hidden Markov Model”, ICML 2008 [8] D. Lin, W. Grimson and J. W. Fisher III, “Construction of Dependent Dirichlet processes based on compound Poisson processes”, NIPS 2010 YOU’D ALSO LIKE: Google Analytics Audit Checklist and Tools Auditing a Google Analytics setup like a pro blog.statsbot.co Singular Value Decomposition (SVD) Tutorial: Applications, Examples, Exercises A complete tutorial on the singular value decomposition method blog.statsbot.co Excel for Startups: Simple Financial Models and Dashboards Ready-to-use Excel templates of different financial models for startups blog.statsbot.co * Data Science * Bayesian Statistics * Machine Learning * Data Analysis * Data Analytics Show your supportClapping shows how much you appreciated Vadim Smolyakov’s story. 101 Blocked Unblock Follow FollowingVADIM SMOLYAKOV passionate about data science and machine learning https://github.com/vsmolyakov FollowSTATS AND BOTS Data stories on machine learning and analytics. From Statsbot’s makers. * 101 * * * Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates",An introduction to Bayesian Nonparametrics: the Dirichlet process along with associated models and links to their implementations.,Bayesian Nonparametric Models – Stats and Bots,Live,233 680,"Homepage Follow Sign in Get started John Thomas Blocked Unblock Follow Following IBM Distinguished Engineer. #Analytics, #Cognitive, #Cloud, #MachineLearning, #DataScience. Chess, Food, Travel (60+ countries). Tweets are personal opinions. Dec 20 -------------------------------------------------------------------------------- 3 SCENARIOS FOR MACHINE LEARNING ON MULTICLOUD Wikimedia Commons photoMore and more cloud-computing experts are talking about “multicloud”. The term refers to an architecture that spans multiple cloud environments in order to take advantage of different services, different levels of performance, security, or redundancy, or even different cloud vendors. But what sometimes gets lost in these discussions is that multicloud is not always public cloud. In fact, it’s often a combination of private and public clouds. As machine learning (ML) continues to pervade enterprise environments, we need to understand how to make ML practical on multicloud — including those architectures that span the firewall. Let’s look at three possible scenarios. SCENARIO 1: TRAIN WITH ON-PREM DATA, DEPLOY ON CLOUD It often happens that the data science team needs to build and train an ML model on sensitive customer data even though the model itself will be deployed on a public cloud. Data gravity and security issues mean that the model needs to be trained behind the firewall, where the data lives. However, the model may need to be invoked by cloud-native applications. Concerns about the latency for scoring calls mean that the model should be deployed close to the consuming app — near the edge of the network, outside the firewall. SCENARIO 2: TRAIN ON SPECIALIZED HARDWARE, DEPLOY ON SYSTEMS OF RECORD Deep Learning models as well as some types of classic ML models can benefit from significant acceleration using specialized hardware. For example, a data science team might decide to build and train the model on specialized hardware like a PowerAI machine, which consists of Power processors coupled to GPUs through high-speed NVLink connections. The PowerAI machine is designed to significantly speed up the training process, but the model itself may need to be consumed in a system of record like an on-premises z System. SCENARIO 3: TRAIN ON CLOUD WITH PUBLIC DATA, DEPLOY ON-PREM The third scenario is becoming increasingly common with the increased availability — and increased quality — of public data. Imagine a financial firm doing arbitrage on agricultural commodities. The data science team gathers a variety of publicly available data including weather and climate data, crop yield data, currency data, and more. Because the data is high-volume and non-proprietary, they aggregate it on a public cloud where they also train their ML model. They pull down the latest version of the model and integrate it within a proprietary application that the firm has developed to predict the prices of the commodities they trade. IBM’S APPROACH Each of these scenarios calls for a fit-for-purpose, multicloud architecture for flexibly training, deploying, and consuming the machine learning models. IBM takes an enterprise approach by making our Data Science Experience (DSX) platform available both on-prem and in the cloud — with intuitive interfaces designed to let users easily move from one to the other. With the same REST APIs, you can save, publish, and consume models across environments — on the mainframe, on a private cloud, or on the public cloud, including on non-IBM public clouds , like AWS and Azure. These two videos demonstrate how easy this is: AWS / Azure . A Kubernetes-based implementation of the DSX platform gives you the flexibility to run DSX Local within a variety of infrastructure options. For example, you can stand up a multi-node cluster with two separate infrastructure vendors, and then build and train models wherever it’s most convenient, and move your models from one vendor infrastructure to the other. In DSX, each deployed model gets an external and internal end point. To invoke the model, simply use a REST API call for the end point. You can build and train the model on-prem and deploy the model to the cloud, where an external application like a chatbot can consume the model by making a REST API call to the particular end point. When multicloud flexibility lets you pick and choose the cloud environments that best fit your needs, you can align with the principle of data gravity and let your consumption channels dictate where you deploy the machine learning models that will transform your organization. Visit us to learn more about the Data Science Experience. * Cloud Computing * Machine Learning * Scenario * Scenarioplanning One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. 12 Blocked Unblock Follow FollowingJOHN THOMAS IBM Distinguished Engineer. #Analytics, #Cognitive, #Cloud, #MachineLearning, #DataScience. Chess, Food, Travel (60+ countries). Tweets are personal opinions. FollowINSIDE MACHINE LEARNING Deep-dive articles about machine learning and data. Curated by IBM Analytics. * 12 * * * Never miss a story from Inside Machine learning , when you sign up for Medium. Learn more Never miss a story from Inside Machine learning Get updates Get updates",More and more cloud-computing experts are talking about “multicloud”. The term refers to an architecture that spans multiple cloud environments in order to take advantage of different services…,3 Scenarios for Machine Learning on Multicloud,Live,234 683,"Compose The Compose logo Articles Sign in Free 30-day trialHOW TO ENABLE A REDIS CACHE FOR POSTGRESQL WITH ENTITY FRAMEWORK 6 Published Jul 17, 2017 redis postgresql c# How to enable a Redis cache for PostgreSQL with Entity Framework 6Caching a database can be a chore but in this Write Stuff article, Mariusz Bojkowski shows how easy it can be to add a Redis cache to your PostgreSQL database if you are using Entity Framework 6 . Database caching is a commonly used technique to improve scalability. By offloading database work to other, faster stores it can also help improve the availability of the data too. Often, though, that caching comes at the cost of hardwired code in the application to check the cache first before the database. But what if we could do it cheaply and transparently to the application? Let's try to leverage C# and the features of Entity Framework 6 to do all the heavy lifting. I’ll show how to use PostgreSQL database with the framework and how to add transparent caching using Redis database. In this tutorial, I’ll create simple Books table and a console application that will get the data from the table. Next, I’ll upgrade the application to use caching. I’ll be using Visual Studio 2017. The full application source is available on GitHub . PREPARING THE POSTGRESQL DATABASE First, you need to create PostgreSQL database using the tools or provider of your choice. Next, let’s create a sample database table of books. Connect to the PostgreSQL database and execute the following create statement. CREATE TABLE ""Books"" ( ""Id"" SERIAL NOT NULL, ""Title"" VARCHAR(50) NOT NULL, ""Author"" VARCHAR(50) NOT NULL, PRIMARY KEY (""Id"") ); Please remember that all identifiers (table names, column names) are folded to lower case in a PostgreSQL database. To change it, make sure you use double quotation marks in the table name and column names. This is required as it will simplify Book entity mapping to properties of the C# model class. CREATE ENTITY FRAMEWORK APPLICATION Once the database table is ready, create a new console application. Open Visual Studio and click File menu, then New – Project. From the dialog box, choose Installed – Templates – Visual C# – Windows Classic Desktop . Chose Console App (.NET Framework) , then provide a name (I typed RedisCacheForPostgre ) and location. Next, let’s add PostgreSQL Entity Framework provider – add the latest version of Npgsql.EntityFramework NuGet package. It will also install Entity Framework 6 NuGet package as it’s one of the dependencies. Please note that at the moment of writing this article the latest version of the Npgsql provider (2.2.7) references Entity Framework version 6.0.0 (not the latest) and the version 6.0.0 will be installed. We will upgrade few paragraphs below. CONFIGURE ENTITY FRAMEWORK ADD POSTGRESQL CONNECTION STRING Open App.config file and add connectionStrings section as in the example below. Please keep configSections as the first child element of configuration node – it’s a strict .NET requirement. Otherwise, the application will crash at runtime. (...) password=secret"" providerName=""Npgsql"" /> (...) Please note that there is providerName attribute in the connection string definition pointing to the PostgreSQL provider (Npgsql). DEFINE BOOKS ENTITY Add a new folder to the project and give it ‘Entities’ name. Next, add Book class to the folder. It will reflect books entities from the database. using System.ComponentModel.DataAnnotations; using System.ComponentModel.DataAnnotations.Schema; namespace RedisCacheForPostgre.Entities { [Table(""Books"", Schema = ""public"")] public class Book { [Key] public int Id { get; set; } public string Title { get; set; } public string Author { get; set; } } } There is a Table attribute added to the class that defines the database table name. Note the schema parameter – by default Entity Framework uses dbo schema. PostgreSQL uses public schema on the other hand. Also, the Id property is decorated with Key attribute to instruct Entity Framework that its primary key column. DEFINE THE DATABASE CONTEXT Add PostgreContext class to the Entities folder. The class should inherit from System.Data.Entity.DbContext . It will be the main interface for accessing the database. using System.Data.Entity; namespace RedisCacheForPostgre.Entities { public class PostgreContext : DbContext { public PostgreContext() : base(nameOrConnectionString: ""PostgreSQL"") { } public DbSet ). This way Entity Framework will mark them as new rows. Finally, the SaveChanges method adds the new rows to the database. You can query the database to confirm that the rows have been added. QUERY POSTGRESQL DATABASE We have the sample data in the database, so let’s query it in the application. Add a PrintBooks method to Program class. using RedisCacheForPostgre.Entities; using System; using System.Linq; namespace RedisCacheForPostgre { public class Program { public static void Main(string[] args) { //InsertSampleData(); PrintBooks(); } private static void PrintBooks() { using (var context = new PostgreContext()) { var books = context.Book.ToList(); foreach(var book in books) { Console.WriteLine($"" '{book.Title}' by {book.Author}""); } } } } } Again, we create an instance of PostgreContext. Then, we get a list of all books by calling the Book.ToList method. Finally, the list is printed to the console. ADD REDIS CACHING ADD REDIS CONNECTION STRING Edit App.config and insert new connection string to the Redis database. (...) password=secret"" providerName=""Npgsql"" /> (...) In order to easily access the connection string later we have to add a reference to System.Configuration assembly – right click the project and choose Add - Reference from the context menu. Next, select Assemblies - Framework , find System.Configuration and check the checkbox next to it. ADD CACHE SUPPORT Add EFCache.Redis NuGet package that extends Entity Framework Cache by adding Redis support. It will update the Entity Framework to 6.1.3 version due to dependencies. DEFINE CACHING POLICY A cache needs to know how to forget data and that's done through a caching policy. Let's set one for our Redis cache by first adding RedisCachingPolicy to the Entities folder. The class has to inherit from EFCache.CachingPolicy . using System; using System.Collections.ObjectModel; using System.Data.Entity.Core.Metadata.Edm; using EFCache; namespace RedisCacheForPostgre.Entities { public class RedisCachingPolicy : CachingPolicy { protected override void GetExpirationTimeout(ReadOnlyCollection affectedEntitySets, out TimeSpan slidingExpiration, out DateTimeOffset absoluteExpiration) { slidingExpiration = TimeSpan.FromMinutes(5); absoluteExpiration = DateTimeOffset.Now.AddMinutes(30); } } } There is GetExpirationTimeout method overridden – it configures: * absoluteExpiration = 30 minutes , means that every cache entry will expire after 30 minutes. * slidingExpiration = 5 minutes , means that a cache entry might be expired if it hasn’t been accessed in 5 minutes (sooner than the above). Of course, it’s useless at this point as the class is used nowhere. ENABLE ENTITY FRAMEWORK CACHE Let’s add the last class to the project a file called Configuration.cs . It should inherit from System.Data.Entity.DbConfiguration . using EFCache; using EFCache.Redis; using System.Configuration; using System.Data.Entity; using System.Data.Entity.Core.Common; namespace RedisCacheForPostgre.Entities { public class Configuration : DbConfiguration { public Configuration() { var redisConnection = ConfigurationManager.ConnectionStrings[""Redis""].ToString(); var cache = new RedisCache(redisConnection); var transactionHandler = new CacheTransactionHandler(cache); AddInterceptor(transactionHandler); Loaded += (sender, args) => { args.ReplaceService( (s, _) = } } } Entity Framework will search for a class that inherits DbConfiguration at runtime. This way the class becomes a code-based configuration for Entity Framework . There are a few things happening here. * Redis connection string is read from an application configuration * RedisCache object is created – it’s responsible for reading from and writing to the Redis database * CacheTransactionHandler is created and registered – it monitors database transactions * On Loaded event replaces the default provider with CachingProviderServices – it tries to get items from the Redis cache first and falls back to the standard provider. Note that we pass a new instance of RedisCachingPolicy – the class was defined in the previous point and is responsible for caching rules (e.g. when data should be forgotten). Finally, let’s try to run the application. It will print the same set of books. But having a look at Redis database, you can see that new entries appeared there. REDIS CACHE MODE Please also remember to set the Redis to cache mode, otherwise, it’ll keep expanding and scaling up. SUMMARY In the sample data used you won’t see significant improvement. It’s caused by small types of queries used (one query) and a very limited number of items queried. The regular PostgreSQL database can optimize it quite well. The point of this article was to show how easy it is to add the caching to the Entity Framework and how transparent it is. Once the cache was added to the framework we didn’t have to change anything in the PrintBooks method and it still worked. The same would apply for all Entity Framework queries (if we had more). -------------------------------------------------------------------------------- Do you want to shed light on a favorite feature in your preferred database? Why not write about it for Write Stuff ? attribution Patrick Tomasso This article is licensed with CC-BY-NC-SA 4.0 by Compose.RELATED ARTICLES Mar 2, 2017USE ALL THE DATABASES - PART 1 Loren Sands-Ramshaw, author of GraphQL: The New REST, shows how to combine data from multiple sources using GraphQL in this W… Guest Author Jul 12, 2017INTEGRATION TESTING AGAINST REAL DATABASES Integration testing can be challenging, and adding a database to the mix makes it even more so. In this Write Stuff contribu… Guest Author Jun 28, 2017ACCESSING RELATIONAL DATABASES USING GO Have you considered using Go to access your relational databases? In this Write Stuff article, Gigi Sayfan shows you how to a… Guest Author Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","Caching a database can be a chore, but Mariusz Bojkowski shows how easy it can be to add a Redis cache to your PostgreSQL database if you are using Entity Framework 6.",How to enable a Redis cache for PostgreSQL with Entity Framework 6,Live,235 684,"Homepage Stats and Bots Follow Sign in Get started Homepage * Home * DATA SCIENCE * ANALYTICS * STARTUPS * BOTS * DESIGN * Subscribe * * 🤖 TRY STATSBOT FREE * Varun Agrawal Blocked Unblock Follow Following Computer Scientist | Inventor | Bibliophile | Musician | Gastronome. Oct 19 -------------------------------------------------------------------------------- IMPROVING REAL-TIME OBJECT DETECTION WITH YOLO A NEW PERSPECTIVE FOR REAL-TIME OBJECT DETECTION In recent years, the field of object detection has seen tremendous progress, aided by the advent of deep learning. Object detection is the task of identifying objects in an image and drawing bounding boxes around them, i.e. localizing them. It’s a very important problem in computer vision due its numerous applications from self-driving cars to security and tracking. Prior approaches of object detection have generally proposed pipelines that are separate stages in a sequence. This causes a disconnect between what each stage accomplishes and the final objective, which is drawing a tight bounding box around the objects in an image. An end-to-end framework that optimizes the detection error in a joint fashion would be a better solution, not just to train the model for better accuracy but to also improve detection speed. This is where the You Only Look Once (or YOLO) approach comes into play. Varun Agrawal told the Statsbot team why YOLO is the better option compared to other approaches in object detection. Illustration sourceDeep learning has proven to be a powerful tool for image classification, achieving human level capability on this task. Earlier detection approaches leveraged this power to transform the problem of object detection to one of classification, which is recognizing what category of objects the image belonged to. The way this was done was via a 2-stage process: 1. The first stage involved generating tens of thousands of proposals. They are nothing but specific rectangular areas on the image also known as bounding boxes, of what the system believed to be object-like things in the image. The bounding box proposal could either be around an actual object in an image or not, and filtering this out was the objective of the second stage. 2. In the second stage, an image classifier would classify the sub-image inside the bounding box proposal, and the classifier would say if it was of a particular object type or simply a non-object or background. While immensely accurate, this 2-step process suffered from certain flaws such as efficiency, due to the immense number of proposals being generated, and a lack of joint optimization over both proposal generation and classification. This leads to each stage not truly understanding the bigger picture, instead being siloed to their own mini-problem and thus limiting their performance. WHAT YOLO IS ALL ABOUT This is where YOLO comes in. YOLO, which stands for You Only Look Once, is a deep learning based object detection algorithm developed by Joseph Redmon and Ali Farhadi at the University of Washington in 2016. The rationale behind calling the system YOLO is that rather than pass in multiple subimages of potential objects, you only passed in the whole image to the deep learning system once. Then, you would get all the bounding boxes as well as the object category classifications in one go. This is the fundamental design decision of YOLO and is what makes it a refreshing new perspective on the task of object detection. The way YOLO works is that it subdivides the image into an NxN grid, or more specifically in the original paper a 7x7 grid. Each grid cell, also known as an anchor, represents a classifier which is responsible for generating K bounding boxes around potential objects whose ground truth center falls within that grid cell (K is 2 in the paper) and classifying it as the correct object. Note that the bounding box is not restricted to be within the grid cell, it can expand within the boundaries of the image to accommodate the object it believes it is responsible to detect. This means that in the current version of YOLO, the system generates 98 bounding boxes of varying sizes to accommodate the various objects in the scene.PERFORMANCE AND RESULTS For more dense object detection, a user could set K or N to a higher number based on their needs. However, with the current configuration, we have a system that is able to output a large number of bounding boxes around objects as well as classify them into one of various object categories, based on the spatial layout of the image. This is done in a single pass through the image at inference time. Thus, the joint detection and classification leads to better optimization of the learning objective (the loss function) as well as real-time performance. Indeed, the results of YOLO are very promising. On the challenging Pascal VOC detection challenge dataset , YOLO manages to achieve a mean average precision, or mAP, of 63.4 (out of 100) while running at 45 frames per second. In comparison, the state of the art model, Faster R-CNN VGG 16 achieves an mAP of 73.2, but only runs at a maximum 7 frames per second, a 6x decrease in efficiency. You can see comparisons of YOLO to other detection frameworks in the table below. If one lets YOLO sacrifice some more accuracy, it can run at 155 frames per second, though only at an mAP of 52.7.Thus, the main selling point for YOLO is its promise of good performance in object detection at real-time speeds. That allows its use in systems such as robots, self-driving cars, and drones, where being time critical is of the utmost importance. YOLOV2 FRAMEWORK Recently, the same group of researchers have released the new YOLOv2 framework, which leverages recent results in a deep learning network design to build a more efficient network, as well as use the anchor boxes idea from Faster-RCNN to ease the learning problem for the network. Illustration sourceThe result is a detection system which is even better, achieving state-of-the-art performance at 78.6 mAP on the Pascal VOC detection dataset, while other systems, such as the improved version of Faster-RCNN (Faster-RCNN ResNet) and SSD500 , only achieve 76.4 mAP and 76.8 mAP on the same test dataset. The key differentiator though is the performance speed. The best performing YOLOv2 model runs at 40 FPS compared to 5 FPS for Faster-RCNN ResNet.Although SSD500 runs at 45 FPS, a lower resolution version of YOLOv2 with mAP 76.8 (the same as SSD500) runs at 67 FPS, thus showing us the high performance capabilities of YOLOv2 as a result of its design choices. FINAL THOUGHTS In conclusion, YOLO has demonstrated significant performance gains while running at real-time performance, an important middle ground in the era of resource hungry deep learning algorithms. As we march on towards a more automation ready future, systems like YOLO and SSD500 are poised to usher in large strides of progress and enable the big AI dream. IMPORTANT READING THROUGH THE ARTICLE * You Only Look Once: Unified, Real-Time Object Detection * The PASCAL Visual Objects Challenge: A Retrospective * SSD: Single Shot Multibox Detector YOU’D ALSO LIKE: SQL Queries for Funnel Analysis A template for building SQL funnel queries blog.statsbot.co Generative Adversarial Networks (GANs): Engine and Applications How generative adversarial nets are used to make our life better blog.statsbot.co How to Reduce Churn Rate By Handling Stripe Failed Payments How We Automated Dunning Management blog.statsbot.co * Machine Learning * Data Science * Data Analytics * Object Detection * Yolo Blocked Unblock Follow FollowingVARUN AGRAWAL Computer Scientist | Inventor | Bibliophile | Musician | Gastronome. FollowSTATS AND BOTS Data stories on machine learning and analytics. From Statsbot’s makers. * * * * Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates",Why YOLO is the better option compared to other approaches in real-time object detection.,Improving Real-Time Object Detection with YOLO,Live,236 686,"Homepage Follow Sign in / Sign up Homepage * Home * Data Science Experience * * Watson Data Platform * Armand Ruiz Blocked Unblock Follow Following Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats #Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my own Jan 22 -------------------------------------------------------------------------------- DEEP LEARNING WITH DATA SCIENCE EXPERIENCE Deep learning is a branch of Machine Learning that uses lots of data to teach computers how to do things only humans were capable of before. A good example of Deep Learning is perception, recognizing what’s in an image, what people are saying when they are talking, helping robots explore the world and interact with it. Deep learning is emerging as a central tool to solve perception problems in recent years. It’s the state of the art having to do with computer vision and speech recognition. Increasingly people are finding that deep learning is a much better tool to solve problems. Many companies today have made deep learning a central part of their machine learning toolkit. For example Facebook, Google and Uber are all using deep learning in their products. We at IBM are collaborating with the leaders in the market to push the research forward and lead in that space. Deep learning shines wherever there is lots of data and complex problems to solve and many companies today are facing lots of complicated problems. Deep learning can be applied to many different fields. As deep neural networks become increasingly important to everything from self-driving cars to voice recognition, new libraries are making it much easier to use deep learning to solve real problems. Building a training a multi-layer convolutional neural network would have taken hundreds of lines of code just a few years ago. In this post we are going to have an overview of the most popular Open Source projects that are available in the IBM Data Science Experience. WHY DEEP LEARNING NOW? One of the fascinating things about neural networks is how long they have taken to be an over night success. The history goes back all the way to the 1950s. Deep learning has really only taken off in the last five years.The reason is the increased availability of label data along with the greatly increased computational throughput of modern processors. For a long time, we didn’t have the huge label data sets that we needed to make deep learning work. Those data sets only became widely available with the rise of the Internet, which made collecting and labeling huge datasets feasible. But even when we had big datasets, we often didn’t have enough computational power to make us of them and it is only been in the last five years that processors have gotten big enough and fast enough to train large scale neural networks. HOW TO GET STARTED WITH DEEP LEARNING IN PYTHON There is a fast growing community of researchers, engineers, and data scientists who share a common, very powerful set of tools and most of them are Open Source. One of the nice things about deep learning is that it’s really a family of techniques that adapts to all sorts of data and all sorts of problems, all using a common infrastructure and a common language to describe things. The best is start with very simple models and move later to very large ones. It is simple to get started with your own personal computer to do very elaborate tasks. In the IBM Data Science Experience you have everything you need for free to start experimenting with Deep Learning technologies. Find here a summary of the most popular Deep Learning Python libraries and tutorials: * Theano : It is a low-level library that specializes in efficient computation. You’ll only use this directly if you need fine-grain customization and flexibility. → Tutorial * Tensorflow : It is another low-level library that is less mature than Theano. However, it’s supported by Google and offers out-of-the-box distributed computing. → Tutorial * Keras : It is a heavyweight wrapper for both Theano and Tensorflow. It’s minimalistic, modular, and awesome for rapid experimentation. This is our favorite Python library for deep learning and the best place to start for beginners. → Tutorial * Lasagne : It is a lightweight wrapper for Theano. Use this if need the flexibility of Theano but don’t want to always write neural network layers from scratch. → Tutorial * MXNet - It is another high-level library similar to Keras. It offers bindings for multiple languages and support for distributed computing. → Tutorial Resources * Getting Started with MXNet * Python deep learning * Machine Learning Blocked Unblock Follow FollowingARMAND RUIZ Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats #Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my own FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Deep learning is a branch of Machine Learning that uses lots of data to teach computers how to do things only humans were capable of before. A good example of Deep Learning is perception, recognizing…",Deep Learning with Data Science Experience,Live,237 689,"Cloudant is a database service that provides high-availability JSON data access. Kiwi Wearables's platform enables motion recognition for physical devices and software applications. Learn how Kiwi uses Cloudant on its back-end to persist motion events and process JSON between Node.js, Twilio, and other Web services.",Andy Ellicott and John David Chibuk talk about an Internet of Things application to record data captured from wearable technology and recorded in Cloudant ,"Building IoT Apps on Cloudant, with Kiwi Wearables",Live,238 691,"GETTING STARTED WITH GRAPHFRAMES IN APACHE SPARK David Taieb / July 15, 2016INTRODUCTION TO SPARK AND GRAPHS GraphX is one of the 4 foundational components of Spark — along with SparkSQL, Spark Streaming and MLlib — that provides general purpose Graph APIs including graph-parallel computation: GraphX APIs are great but present a few limitations. First they only work with Scala, so if you want to use GraphX with Python in a Jupyter Notebook, then you are out of luck. The second limitation is that they only work at the RDD ( Resilient Distributed Dataset ) level, which means that they can’t benefit from the performance improvement provided by DataFrames and the Catalyst query optimizer. GraphFrames is an open source Spark Package that was created with goal of addressing these two issues: * Provides a set of Python APIs * Works with DataFrames In this post, we’ll show how to get started with GraphFrames from a Python Notebook. We’ll start by creating a graph composed of airports as the vertices and flight routes as the edges, using the data from the flight predict application . I’ll then show interesting ways of visualizing the data and apply various graph algorithms to extract insights from the data. INSTALLING GRAPHFRAMES As previously mentioned, GraphFrames will be part of the Spark 2.0 distribution, but it’s currently available as a preview Spark package compatible with Spark 1.6 and higher. There are multiple ways to install the package depending on how you are running Spark: * Spark-submit or Spark-shell: simply add --packages graphframes:graphframes:0.1.0-spark1.6 as a command-line argument * Local Jupyter Notebook: assuming that you have access to the configuration files, all you need is to add --packages graphframes:graphframes:0.1.0-spark1.6 to the kernel.json located in ~/.ipython/kernels//kernel.json . { ""display_name"": ""pySpark (Spark 1.6.0) with graphFrames"", ""language"": ""python"", ""argv"": [ ""/Users/dtaieb/anaconda/envs/py27/bin/python"", ""-m"", ""ipykernel"", ""-f"", ""{connection_file}"" ], ""env"": { ""SPARK_HOME"": ""/Users/dtaieb/cdsdev/spark-1.6.0"", ""PYTHONPATH"": ""/Users/dtaieb/cdsdev/spark-1.6.0/python/:/Users/dtaieb/cdsdev/spark-1.6.0/python/lib/py4j-0.9-src.zip"", ""PYTHONSTARTUP"": ""/Users/dtaieb/cdsdev/spark-1.6.0/python/pyspark/shell.py"", ""PYSPARK_SUBMIT_ARGS"": ""--packages graphframes:graphframes:0.1.0-spark1.6 --master local[10] pyspark-shell"", ""SPARK_DRIVER_MEMORY"":""10G"", ""SPARK_LOCAL_IP"":""127.0.0.1"" } } * IPython Notebook (hosted on IBM Bluemix Apache Spark™ service): When the notebook is hosted and you don’t have access to the configuration files, I wished there were a magic command that would add a Spark Package to the session. Unfortunately there is no such thing today, so I made one :boom:. I created a helper Python library called pixiedust that implements a workaround. Note: The following steps currently only work on an python Notebook hosted on IBM Bluemix Open your python Notebook and run the following code: 1. Cell1: install the pixiedust library. !pip install --user pixiedust Or if you want to upgrade the version already installed: !pip install --user --upgrade --no-deps pixiedust 2. Cell2: import the pixiedust packageManager module and install graphframes. from pixiedust.packageManager import PackageManager pkg=PackageManager() pkg.installPackage(""graphframes:graphframes:0"") pkg.printAllPackages() sqlContext=SQLContext(sc) If all goes well, you should see a message printed in red in the output asking you to restart the kernel. You can do so using the menu: Kernel/Restart . 3. Once the kernel has restarted, run Cell2 again. Even though the Graphframes jar file is now part of the classpath, you still need to run the command to add the GraphFrames python APIs to the SparkContext. 4. Cell3: verify that GraphFrames is correctly installed. #import the display module from pixiedust.display import * #import the Graphs example from graphframes.examples import Graphs #create the friends example graph g=Graphs(sqlContext).friends() #use the pixiedust display display(g) Results of the code above should look like this: Note: I’ll be using the pixiedust display() API call in this post without diving into the details of how it’s built, which I’ll cover in a future post. CREATE A GRAPH WITH AIRPORTS AS NODES AND FLIGHT ROUTES AS EDGES At a high level, GraphFrames is to GraphX what DataFrames is to RDDs. It is built on top of Spark SQL and provides a set of APIs that elegantly combine Graph Analytics and Graph Queries: Diving into technical details, you need two DataFrames to build a Graph: one DataFrame for vertices and a second DataFrame for edges. With graphFrames successfully installed, we are now ready to load the data from the flight predict application . As a reminder, the data lives in two Cloudant databases: * flight-metadata : contains the airports info * flightpredict_training_set : contains the flight routes augmented with weather info The first step is to configure the Cloudant-spark connector and load the 2 datasets: #Configure connector sc.addPyFile(""https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/flightPredict/training.py"") sc.addPyFile(""https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/flightPredict/run.py"") import training import run sqlContext=SQLContext(sc) training.sqlContext = sqlContext training.cloudantHost='dtaieb.cloudant.com' training.cloudantUserName='weenesserliffircedinvers' training.cloudantPassword='72a5c4f939a9e2578698029d2bb041d775d088b5' #load the 2 datasets airports = training.loadDataSet(""flight-metadata"", ""airports"") print(""airports count: "" + str(airports.count())) flights = training.loadDataSet(""pycon_flightpredict_training_set"",""training"") print(""flights count: "" + str(flights.count())) Results: Successfully cached dataframe Successfully registered SQL table airports airports count: 17535 Successfully cached dataframe Successfully registered SQL table training flights count: 33336 In this step, we build the vertices and edges DataFrames for our graph. The vertices (airports) must all have at least one edge (flights). They also must have a column named “id” that uniquely identifies the vertex. To meet these two requirements, the cell below performs a join between airports and flights, and renames the column “fs” (airport code) to “id”. from pyspark.sql import functions as f from pyspark.sql.types import * rdd = flights.flatMap(lambda s: [s.arrivalAirportFsCode, s.departureAirportFsCode]).distinct()\ .map(lambda row:[row]) vertices = airports.join( sqlContext.createDataFrame(rdd, StructType([StructField(""fs"",StringType())])), ""fs"" ).dropDuplicates([""fs""]).withColumnRenamed(""fs"",""id"") print(vertices.count()) The edges dataframe is almost ready, but we need to make sure that it has the columns “src” and “dst” that respectively reference the “id” of the source and destination airport. We also drop a few unneeded columns: edges=flights.withColumnRenamed(""arrivalAirportFsCode"",""dst"")\ .withColumnRenamed(""departureAirportFsCode"",""src"")\ .drop(""departureWeather"").drop(""arrivalWeather"").drop(""pt_type"").drop(""_id"").drop(""_rev"") We can now build the graph and display it: from graphframes import GraphFrame g = GraphFrame(vertices, edges) display(g) When you initially run this cell, you’ll see a table. But because pixiedust introspects the dataset, it knows it contains latitude and longitude coordinates that can be displayed on a map. Click the map pin icon to see the graph of airports and flights overlaid on a map of the United States: Note: The visualization above is coming from a sample pixiedust plugin that visualizes all the flights for selected airports. It also provides menus to display the vertices and edges as tables. LET’S DO SOME GRAPH COMPUTING! COMPUTE THE DEGREE FOR EACH VERTEX IN THE GRAPH The degree of a vertex is the number of edges incident to the vertex. In a directed graph, in-degree is the number of edges where vertex is the destination and out-degree is the number of edges where the vertex is the source. GraphFrames has properties for degrees , outDegrees and inDegrees . They return a DataFrame containing the id of the vertex and the number of edges. We then sort them in descending order: from pyspark.sql.functions import * degrees = g.degrees.sort(desc(""degree"")) display( degrees ) Results: COMPUTE A LIST OF SHORTEST PATHS FOR EACH VERTEX TO A SPECIFIED LIST OF LANDMARKS For this example we use the shortestPaths api that returns a DataFrame containing the properties for each vertex plus an extra column called distances that contains the number of hops to each landmark. In the following code, we use BOS and LAX as the landmarks: r = g.shortestPaths(landmarks=[""BOS"", ""LAX""]).select(""id"", ""distances"") display(r) Results: COMPUTE THE PAGERANK FOR EACH VERTEX IN THE GRAPH PageRank is a famous algorithm used by Google Search to rank vertices in a graph by order of importance. To compute pageRank, we’ll use the pageRank() API call that returns a new graph in which the vertices have a new pagerank column representing the pagerank score for the vertex, and the edges have a new weight column representing the edge weight that contributed to the pageRank score. We’ll then display the vertex ids and associated pageranks sorted in descending order: from pyspark.sql.functions import * ranks = g.pageRank(resetProbability=0.20, maxIter=5) display(ranks.vertices.select(""id"",""pagerank"").orderBy(desc(""pagerank""))) Results: SEARCH ROUTES BETWEEN TWO AIRPORTS WITH SPECIFIC CRITERIA In this section, we want to find all the routes between Boston and San Francisco operated by United Airlines with at most two hops. To perform this search, we use the bfs() ( breadth-first search ) API call that returns a DataFrame containing the shortest path between matching vertices. For clarity, we will only keep the edge when displaying the results: paths = g.bfs(fromExpr=""id='BOS'"",toExpr=""id = 'SFO'"",edgeFilter=""carrierFsCode='UA'"", maxPathLength = 2).drop(""from"").drop(""to"") display(paths) Results: FIND ALL AIRPORTS THAT DO NOT HAVE DIRECT FLIGHTS BETWEEN EACH OTHER In this section, we’ll use a very powerful graphFrames search feature that uses a pattern called motif to find nodes. We’ll use it to apply the pattern ""(a)-[]-(b)-[]-!(a)-[]->(c)"" , which searches for all nodes a, b and c that have a path to (a,b) and a path to (b,c) but not a path to (a,c). Also, because the search is computationally expensive, we reduce the number of edges by grouping the flights that have the same src and dst. from pyspark.sql import functions as F h = GraphFrame(g.vertices, g.edges.select(""src"",""dst"").groupBy(""src"",""dst"").agg(F.count(""src"").alias(""count""))) query = h.find(""(a)-[]-(b)-[]-!(a)-[]-(c)"").drop(""b"") display(query) Results: COMPUTE THE STRONGLY CONNECTED COMPONENTS FOR THIS GRAPH Strongly Connected Components are components for which each vertex is reachable from every other vertex. To compute them, we’ll use the stronglyConnectedComponents() API call that returns a DataFrame containing all the vertices, with the addition of a component column that contains the id value of each connected vertex. We then group all the rows by components and aggregate the sum of all the member vertices. This gives us a good idea of the components distribution in the graph. from pyspark.sql.functions import * components = g.stronglyConnectedComponents(maxIter=10).select(""id"",""component"")\ .groupBy(""component"").agg(F.count(""id"").alias(""count"")).orderBy(desc(""count"")) display(components) Results: DETECT COMMUNITIES IN THE GRAPH USING LABEL PROPAGATION ALGORITHM Label propagation is a popular algorithm for finding communities within a graph. It has the advantage of being computationally inexpensive and thus works well with large graphs. To compute the communities, we’ll use the labelPropagation() API call that returns a DataFrame containing all the vertices, with the addition of a label column that contains the id value of each connected vertex. Similar to the strongly connected components computation, we’ll then group all the rows by label and aggregate the sum of all the member vertices. from pyspark.sql.functions import * communities = g.labelPropagation(maxIter=5).select(""id"", ""label"")\ .groupBy(""label"").agg(F.count(""id"").alias(""count"")).orderBy(desc(""count"")) display(communities) Results: CONCLUSION In this post, we have learned several things: * How to use GraphFrames (and any other Spark packages) within an IPython notebook, including for the IBM Analytics for Apache Spark service on Bluemix. * We’ve introduced the pixiedust module that, among other things, provides a simple API to create compelling in-context interactive visualizations. * We’ve shown how to create a graph from data stored in the Cloudant JSON database service. * Finally, we’ve explored a few of the graph computation APIs provided by GraphFrames. Of course there is much more to explore, but hopefully this post gave you ideas you can reuse. All the exercises and code are conveniently available in a completed Jupyter Notebook . Feel free to import it into your own Spark environment or on the IBM Apache Spark service — and use it as a starting point in your own project. SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Click to share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Apache Spark / GraphFrames / GraphX / IPython / Jupyter / Notebooks Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","We show how to build a graph of airports and flight paths using GraphFrames. Then, visualize the data and apply various graph algorithms to analyze it.",Getting started with GraphFrames in Apache Spark™,Live,239 695,"RStudio Blog * Home * Subscribe to feed SPARK 1.4 FOR RSTUDIO July 14, 2015 in RStudio IDE | Tags: Spark , SparkR Today’s guest post is written by Vincent Warmerdam of GoDataDriven and is reposted with Vincent’s permission from blog.godatadriven.com . You can learn more about how to use SparkR with RStudio at the 2015 EARL Conference in Boston November 2-4, where Vincent will be speaking live. This document contains a tutorial on how to provision a spark cluster with RStudio. You will need a machine that can run bash scripts and a functioning account on AWS. Note that this tutorial is meant for Spark 1.4.0. Future versions will most likely be provisioned in another way but this should be good enough to help you get started. At the end of this tutorial you will have a fully provisioned spark cluster that allows you to handle simple dataframe operations on gigabytes of data within RStudio. AWS PREP Make sure you have an AWS account with billing. Next make sure that you have downloaded your .pem files and that you have your keys ready. SPARK STARTUP Next go and get spark locally on your machine from the spark homepage . It’s a pretty big blob. Unzip it once it is downloaded go to the ec2 folder in the spark folder. Run the following command from the command line. ./spark-ec2 \ --key-pair=spark-df \ --identity-file=/Users/code/Downloads/spark-df.pem \ --region=eu-west-1 \ -s 1 \ --instance-type c3.2xlarge \ launch mysparkr This script will use your keys to connect to amazon and setup a spark standalone cluster for you. You can specify what type of machines you want to use as well as how many and where on amazon. You will only need to wait until everything is installed, which can take up to 10 minutes. More info can be found here . When the command signals that it is done, you can ssh into your machine via the command line. ./spark-ec2 -k spark-df -i /Users/code/Downloads/spark-df.pem --region=eu-west-1 login mysparkr Once you are in your amazon machine you can immediately run SparkR from the terminal. chmod u+w /root/spark/ ./spark/bin/sparkR As just a toy example, you should be able to confirm that the following code already works. ddf <- createDataFrame(sqlContext, faithful) head(ddf) printSchema(ddf) This ddf dataframe is no ordinary dataframe object. It is a distributed dataframe, one that can be distributed across a network of workers such that we could query it for parallelized commands through spark. SPARK UI This R command you have just run launches a spark job. Spark has a webui so you can keep track of the cluster. To visit the web-ui, first confirm on what IP-address the master node is via this command: curl icanhazip.com You can now visit the webui via your browser. :4040 From here you can view anything you may want to know about your spark clusters (like executor status, job process and even a DAG visualisation). This is a good moment to stand still and realize that this on it’s own right is already very cool. We can start up a spark cluster in 15 minutes and use R to control it. We can specify how many servers we need by only changing a number on the command line and without any real developer effort we gain access to all this parallelizing power. Still, working from a terminal might not be too productive. We’d prefer to work with a GUI and we would like some basic plotting functionality when working with data. So let’s install RStudio and get some tools connected. RSTUDIO SETUP Get out of the SparkR shell by entering q() . Next, download and install Rstudio. wget http://download2.rstudio.org/rstudio-server-rhel-0.99.446-x86_64.rpm sudo yum install --nogpgcheck -y rstudio-server-rhel-0.99.446-x86_64.rpm rstudio-server restart While this is installing. Make sure the TCP connection on the 8787 port is open in the AWS security group setting for the master node. A recommended setting is to only allow access from your ip. Then, add a user that can access RStudio. We make sure that this user can also access all the RStudio files. adduser analyst passwd analyst You also need to do this (the details of why are a bit involved). These edits need to be made because the analyst user doesn’t have root permissions. chmod a+w /mnt/spark chmod a+w /mnt2/spark sed -e 's/^ulimit/#ulimit/g' /root/spark/conf/spark-env.sh > /root/spark/conf/spark-env2.sh mv /root/spark/conf/spark-env2.sh /root/spark/conf/spark-env.sh ulimit -n 1000000 When this is known, point the browser to :8787 . Then login in as analyst. RSTUDIO – SPARK LINK Awesome. RStudio is set up. First start up the master submit. /root/spark/sbin/stop-all.sh /root/spark/sbin/start-all.sh This will reboot Spark (both the master and slave nodes). You can confirm that spark works after this command by pointing the browser to :8080 . Next, let’s go and start Spark from RStudio. Start a new R script, and run the following code: print('Now connecting to Spark for you.') spark_link <- system('cat /root/spark-ec2/cluster-url', intern=TRUE) .libPaths(c(.libPaths(), '/root/spark/R/lib')) Sys.setenv(SPARK_HOME = '/root/spark') Sys.setenv(PATH = paste(Sys.getenv(c('PATH')), '/root/spark/bin', sep=':')) library(SparkR) sc <- sparkR.init(spark_link) sqlContext <- sparkRSQL.init(sc) print('Spark Context available as \""sc\"". \\n') print('Spark SQL Context available as \""sqlContext\"". \\n') LOADING DATA FROM S3 Let’s confirm that we can now play with the RStudio stack by downloading some libraries and having it run against a data that lives on S3. small_file = ""s3n://:@/data.json"" dist_df <- read.df(sqlContext, small_file, ""json"") %>% cache This dist_df is now a distributed dataframe, which has a different api than the normal R dataframe but is similar to dplyr . head(summarize(groupBy(dist_df, df$type), count = n(df$auc))) Also, we can install magrittr to make our code look a lot nicer. local_df <- dist_df %>% groupBy(df$type) %>% summarize(count = n(df$id)) %>% collect The collect method pulls the distributed dataframe back into a normal dataframe on a single machine so you can use plotting methods on it again and use R as you would normally. A common use case would be to use spark to sample or aggregate a large dataset which can then be further explored in R. Again, if you want to view the spark ui for these jobs you can just go to: :4040 A MORE COMPLETE STACK Unfortunately this stack has an old version of R (we need version 3.2 to get the newest version of ggplot2/dplyr). Also, as of right now there isn’t support for the machine learning libraries yet. These are known issues at the moment and version 1.5 should show some fixes. Version 1.5 will also feature RStudio installation as part of the ec2 stack. Another issue is that the namespace of dplyr currently conflicts with sparkr , time will tell how this gets resolved. Same would go for other data features like windowing function and more elaborate data types. KILLING THE CLUSTER When you are done with the cluster, you only need to exit the ssh connection and run the following command: ./spark-ec2 -k spark-df -i /Users/code/Downloads/spark-df.pem --region=eu-west-1 destroy mysparkr CONCLUSION The economics of spark are very interesting. We only pay amazon for the time that we are using Spark as a compute engine. All other times we’d only pay for S3. This means that if we analyse for 8 hours, we’d only pay for 8 hours. Spark is also very flexible in that it allows us to continue coding in R (or python or scala) without having to learn multiple domain specific languages or frameworks like in hadoop. Spark makes big data really simple again. This document is meant to help you get started with Spark and RStudio but in a production environment there are a few things you still need to account for: * security , our web connection is not done through https, even though we are telling amazon to only use our ip, we may be at security risk if there is a man in the middle listening . * multiple users , this setup will work fine for a single user but if multiple users are working on such a cluster you may need to rethink some steps with regards to user groups, file access and resource management. * privacy , this setup works well for ec2 but if you have sensitive, private user data then you may need to do this on premise because the data cannot leave your own datacenter. Most install steps would be the same but the initial installation of Spark would require the most work. See the docs for more information. Spark is an amazing tool, expect more features in the future. POSSIBLE GOTYA HangingIt can happen that the ec2 script hangs in the Waiting for cluster to enter 'ssh-ready' state part. This can happen if you use amazon a lot. To prevent this you may want to remove some lines in ~/.ssh/known_hosts . More info here . Another option is to add the following lines to your ~/.ssh/config file. # AWS EC2 public hostnames (changing IPs) Host *.compute.amazonaws.com StrictHostKeyChecking no UserKnownHostsFile /dev/null SHARE THIS: * Reddit * More * * Email * Facebook * * Print * Twitter * * LIKE THIS: Like Loading...RELATED SEARCH LINKS * Contact Us * Development @ Github * RStudio Support * RStudio Website * R-bloggers CATEGORIES * Featured * News * Packages * R Markdown * RStudio IDE * Shiny * shinyapps.io * Training * Uncategorized ARCHIVES * May 2016 * April 2016 * March 2016 * February 2016 * January 2016 * December 2015 * October 2015 * September 2015 * August 2015 * July 2015 * June 2015 * May 2015 * April 2015 * March 2015 * February 2015 * January 2015 * December 2014 * November 2014 * October 2014 * September 2014 * August 2014 * July 2014 * June 2014 * May 2014 * April 2014 * March 2014 * February 2014 * January 2014 * December 2013 * November 2013 * October 2013 * September 2013 * June 2013 * April 2013 * February 2013 * January 2013 * December 2012 * November 2012 * October 2012 * September 2012 * August 2012 * June 2012 * May 2012 * January 2012 * October 2011 * June 2011 * April 2011 * February 2011 EMAIL SUBSCRIPTION Enter your email address to subscribe to this blog and receive notifications of new posts by email. Join 19,578 other followers RStudio is an affiliated project of the Foundation for Open Access Statistics 3 COMMENTS July 20, 2015 at 8:43 pm ypouliot Thanks Garrett, very helpful. I did run into one problem: the SparkR shell is not present. I.e., ./spark/bin/sparkR returns “no such file”. I couldn’t find it anywhere. Could you advise, please? July 21, 2015 at 8:43 am Vincent D. Warmerdam (@fishnets88) this file should be present on the server of amazon, not the github project. just to double check, you were able to log in to the master node and then this server didn’t have the `./spark/bin/sparkR` ? * July 21, 2015 at 1:09 pm ypouliot That’s right, it wasn’t present on the master node (?) « Accelerating R: RStudio and the new R Consortium Article Spotlight: Persistent data storage in Shiny apps »Blog at WordPress.com. The Tarski Theme . Subscribe to feed. FollowFOLLOW “RSTUDIO BLOG” Get every new post delivered to your Inbox. Join 19,578 other followers Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:",Today’s guest post is written by Vincent Warmerdam of GoDataDriven and is reposted with Vincent’s permission from blog.godatadriven.com. You can learn more about how to use SparkR with …,Spark 1.4 for RStudio,Live,240 698,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Home * Cognitive Computing * Data Science * Web Dev * Brad Noble Blocked Unblock Follow Following Developer Advocacy at IBM. Formerly, product design at Cloudant (@ibmcloudant), founder at PostPost (RIP), and lunk at various agencies. 2 days ago -------------------------------------------------------------------------------- I AM NOT A DATA SCIENTIST BUT I PLAY ONE IN THIS BLOG POST, THANKS TO PIXIEDUST At a recent All Hands, I shared some thoughts about platforms and notebooks. If you weren’t there, you didn’t miss much. The only takeaway — and takeaway is probably generous — was this Venn diagram: Readers may notice that there’s an idea lurking in the footnote at the bottom of this diagram. The idea is that notebooks, considered by most to be the domain of the data scientist, have a real shot at helping teams of all types who are working on data problems. I’m happy with the colors, but to bring this idea to life, we’ll need more than a Venn diagram, amirite? Enter, PixieDust. NOTEBOOKS FOR EVERYONE PixieDust is a helper library for Python notebooks. It makes working with data simpler. With PixieDust, I can do this in a notebook… # load a CSV with pixiedust.sampledata() df = pixiedust.sampleData(""https://github.com/ibm-cds-labs/open-data/raw/master/cars/cars.csv"") # display the data with pixiedust display(df) Instead of doing all this… from pyspark.sql.types import DecimalType import matplotlib.pyplot as plt from matplotlib import cm import math #Load the csv, this assumes that the file is already downloaded on a local file system path=""/path/to/my/csv"" df3 = sqlContext.read.format('com.databricks.spark.csv')\ .options(header='true', mode=""DROPMALFORMED"", inferschema='true').load(path) maxRows = 100 def toPandas(workingDF): decimals = [] for f in workingDF.schema.fields: if f.dataType.__class__ == DecimalType: decimals.append(f.name) pdf = workingDF.toPandas() for y in pdf.columns: if pdf[y].dtype.name == ""object"" and y in decimals: #spark converts Decimal type to object during toPandas, cast it as float pdf[y] = pdf[y].astype(float) return pdf xFields = [""horsepower""] yFields = [""mpg""] workingDF = df3.select(xFields + yFields) workingDF = workingDF.dropna() count = workingDF.count() if count > maxRows: workingDF = workingDF.sample(False, (float(maxRows) / float(count))) pdf = toPandas(workingDF) #sort by xFields pdf.sort_values(xFields, inplace=True) fig, ax = plt.subplots(figsize=( int(1000/ 96), int(750 / 96) )) for i,keyField in enumerate(xFields): pdf.plot(kind='scatter', x=keyField, y=yFields[0], label=keyField, ax=ax, color=cm.jet(1.*i/len(xFields))) #Conf the legend if ax.get_legend() is not None and ax.title is None or not ax.title.get_visible() or ax.title.get_text() == '': numLabels = len(ax.get_legend_handles_labels()[1]) nCol = int(min(max(math.sqrt( numLabels ), 3), 6)) nRows = int(numLabels/nCol) bboxPos = max(1.15, 1.0 + ((float(nRows)/2)/10.0)) ax.legend(loc='upper center', bbox_to_anchor=(0.5, bboxPos),ncol=nCol, fancybox=True, shadow=True) #conf the xticks labels = [s.get_text() for s in ax.get_xticklabels()] totalWidth = sum(len(s) for s in labels) * 5 if totalWidth > 1000: #filter down the list to max 20 xl = [(i,a) for i,a in enumerate(labels) if i % int(len(labels)/20) == 0] ax.set_xticks([x[0] for x in xl]) ax.set_xticklabels([x[1] for x in xl]) plt.xticks(rotation=30) plt.show() To get this… A scatterplot! No code! With options and controls I can use! That’s data I can explore! STEPPING THROUGH THE BENEFITS With PixieDust, I can[1] 1. Visualize my data , without having to RTFM and trial-and-error Matplotlib (or other renderers) 2. Explore my data in an embedded interface, and switch between renderers (e.g., Matplotlib, Bokeh, Seaborn) 3. Use Spark , without having to RTFM Spark 4. Do those things , all of which I hadn’t done before — not even once — and then share those things with people, which I’m doing now! With PixieDust, data scientists and data engineers can * Use Python and Scala in the same notebook * Share variables between Scala and Python * Access Spark libraries written in Scala from Python notebooks * Access Python visualizations from Scala notebooks * Use any other tools they like, e.g., hard-coded Matplotlib, Bokeh, etc. Now, people with varied skills and skill levels — even people like me — can use and share notebooks, and collaborate. But don’t just take my word for it. Ben Hudson , an offering manager on the dashDB team, said this about PixieDust: I wanted an easy way to map out some geographical data I added to the dataset, but all the Python tutorials I had come across were too complex for my needs, so PixieDust was perfect for me. Instead of having to import a ton of packages and try to reverse-engineer code from an online tutorial, I only had to do a few clicks to generate a really nice map using PixieDust. PixieDust also made general graphing tasks a lot easier (no need for matplotlib) and it was really straightforward to use in general.(Ben’s even started logging PixieDust issues on Github . Thanks, Ben!) USE PIXIEDUST You have a couple options. IBM Data Science Experience (DSX): Check out the PixieDust intro notebook on DSX to see PixieDust in action. To play with this notebook in DSX, follow these steps to bring the notebook into your account: 1. Click Add Notebooks 2. Click From URL 3. Enter Notebook Name 4. Enter Notebook URL : https://github.com/ibm-cds-labs/pixiedust/raw/master/notebook/DSX/Welcome%20to%20PixieDust.ipynb 5. Select the Spark Service 6. Click Create Notebook Jupyter Notebooks : If you’re comfortable on the command line, you can run PixieDust inside Jupyter Notebooks on your laptop, too. The PixieDust installation guide has you covered for an easy install, and takes care of configuration and all the dependencies at once (e.g., installs Spark, Scala, the Cloudant-Spark connector, and a few sample notebooks). FURTHER READING * PixieDust on Github * PixieDust documentation * Announcing PixieDust 1.0 , by David Taieb FOOTNOTES [1] I am not a data scientist. Not even close. * Data Science * Python * Data Engineering * Scala * Apache Spark 17 Blocked Unblock Follow FollowingBRAD NOBLE Developer Advocacy at IBM. Formerly, product design at Cloudant ( @ibmcloudant ), founder at PostPost (RIP), and lunk at various agencies. FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform. * Share * 17 * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",PixieDust is a helper library for notebooks that makes it easier for teams of all types to work with data.,I Am Not a Data Scientist – IBM Watson Data Lab,Live,241 703,"Compose Databases * MongoDB * Elasticsearch * RethinkDB * Redis * PostgreSQL * etcd * RabbitMQ * ScyllaDB * MySQL Enterprise Pricing Articles Sign in Free 30-Day TrialOMNI LABS – MAKING THE MOST OF COMPOSE Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 8, 2016Learn how startup Omni Labs uses Compose-hosted MongoDB and a combination of Node.js, React, and Spark Python to help bootstrap their startup. We had the pleasure of meeting Vikram Tiwari, a full-stack developer at Omni Labs, at DataLayer 2016 in September. Tiwari presented on the topic of working with Compose to bootstrap your startup , based on his experience at Omni Labs, a bootstrapped startup in San Francisco that seeks to make it easier for marketers to work with data. We spoke with Tiwari and Alex Modon, CEO and co-founder, to learn more about their experience with MongoDB hosted on Compose. Omni is an ""automated visualization platform for marketers to see all of their data in one place, without having to manually do anything,"" explained Modon. While there are many BI tools available for companies to use, they still require quite technical specialization and time to manage. ""With our platform, there's no pixel placement or database integrations. Users just sign-in to see all of their up-to-date marketing KPI's in a custom dashboard."" The company's name comes from the pursuit of omnichannel marketing - gathering and analyzing data across multiple platforms to construct the most effective cross-media campaigns. Omni enables their customers to stream raw reports from their current media partners via API integrations while transforming that data into constantly updated KPI's. Omni is built on a Node.js backend with a React front-end and using Spark Python for data processing, with MongoDB and other databases underneath. ""Our Mongo-powered app serves as a center point for our customers,"" explained Vikram. ""Customers can do multiple queries on the data set, see past performance reports, or even set up alerts that get posted into a Slack channel."" All the server stacks are built around Node.js, and all the data that is collected goes through ETL pipelines built on Python and Google Cloud and processed by Spark. From there, the data is stored in Google's data warehouse, Big Query. ""The data is processed back to the client and we push some part of that data into MongoDB and some of it into Redis, based on how real-time the needs are."" While Omni is still in the early startup phase, much of their focus is on building predictive analytics for customers. They use Tensorflow for much of the machine learning process. ""Machines are really good at making decisions, as long as you feed them the right amount of data and tell them what success is. We're working really hard on rolling out products that are more predictive and help analyze opportunities to optimize campaigns, generate new media plans, and basically take care of vendor management."" Because the platform was built on various JavaScript tools (Node, JQuery, and React) MongoDB was an easy choice for Omni. The open-source MongoDB community, a plethora of answers available on StackOverflow, and a mature set of libraries also nudged them towards MongoDB. As for why Compose, Modon added, ""With any startup, the most valued resource is time. Compose removes the 'white knuckle' approach to database management. There's only so many hours in the day, so it's great knowing that our database is being taken care of by a company with a high level of quality and dedication."" -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Jon Silvers works in marketing at Compose. Love this article? Head over to Jon Silvers’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud * Subscribe Join 75,000 other devs on our weekly database newsletter. © 2016 Compose",Customer use-case.,Making the Most of Compose – Customer: Omni Labs,Live,242 704,"Compose The Compose logo Articles Sign in Free 30-day trialMONGO METRICS: CALCULATING THE MODE Published Apr 12, 2017 metrics mode mongodb Mongo Metrics: Calculating the ModeIn this third entry in our Mongo Metrics series, we'll round out the ""top 3"" classical analytics methods by taking a look at mode. Check out our previous articles in this series to learn more about computing and using the mean and median in MongoDB. We've seen how the mean and median can provide us different perspectives on what a ""typical"" order might look like. We've also seen how it's important to have multiple different ""angles"" on your data to gain a full understanding of what a typical order might look like. Now, let's take a look at one more angle: the mode. WHAT'S IN A MODE? Mode is one of the simpler of the classic methods to understand. Simply put, the mode is the most common item, or the one occurring most frequently, in a set of data. Unlike the mean or median , we may not necessarily obtain a useful result with mode. For example, if all of the items in our dataset are encountered exactly once, then mode won't give us a useful result. Let's take a look at a data set where the mode has great value: determining which products or price points are popular. Mode is great for this because stores will often price many items at the same price points. By analyzing how well products do at various price points, stores can determine more efficient pricing and improve their overall sales. For this example, we'll borrow the pet store product catalog from our Metrics Maven's article on mode in PostgreSQL : order_id | date | item_count | order_value ------------------------------------------------ 50000 | 2016-09-02 | 3 | 35.97 50001 | 2016-09-02 | 2 | 7.98 50002 | 2016-09-02 | 1 | 5.99 50003 | 2016-09-02 | 1 | 4.99 50004 | 2016-09-02 | 7 | 78.93 50005 | 2016-09-02 | 0 | (NULL) 50006 | 2016-09-02 | 1 | 5.99 50007 | 2016-09-02 | 2 | 19.98 50008 | 2016-09-02 | 1 | 5.99 50009 | 2016-09-02 | 2 | 12.98 50010 | 2016-09-02 | 1 | 20.99 Which, stored in JSON format in MongoDB, looks like the following: { ""_id"" : ObjectId(""58db58313b9bbe23a46e91af""), ""order_id"" : 50005, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 0 } { ""_id"" : ObjectId(""58db58873b9bbe21cb6e91b1""), ""order_id"" : 50002, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 1, ""order_value"" : 5.99 } { ""_id"" : ObjectId(""58db58d33b9bbe1f886e91b0""), ""order_id"" : 50003, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 1, ""order_value"" : 4.99 } { ""_id"" : ObjectId(""58db58fc3b9bbe21cb6e91b2""), ""order_id"" : 50010, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 1, ""order_value"" : 20.99 } { ""_id"" : ObjectId(""58db591d3b9bbe21cb6e91b3""), ""order_id"" : 50006, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 1, ""order_value"" : 5.99 } { ""_id"" : ObjectId(""58db59403b9bbe21cb6e91b4""), ""order_id"" : 50008, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 1, ""order_value"" : 5.99 } { ""_id"" : ObjectId(""58db596a3b9bbe21cb6e91b5""), ""order_id"" : 50009, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 2, ""order_value"" : 12.98 } { ""_id"" : ObjectId(""58db598c3b9bbe240a6e91ae""), ""order_id"" : 50007, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 2, ""order_value"" : 19.98 } { ""_id"" : ObjectId(""58db59ac3b9bbec65f6e91c0""), ""order_id"" : 50000, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 3, ""order_value"" : 35.97 } { ""_id"" : ObjectId(""58db59d43b9bbe21cb6e91b6""), ""order_id"" : 50004, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 7, ""order_value"" : 78.93 } Unlike PostgreSQL, MongoDB doesn't have a MODE keyword so we'll have to compute it ourselves. Luckily, the MongoDB aggregations pipeline comes to the rescue yet again. Let's take a look at how we can use it to compute the mode of our data set. GETTING IN THE MODE Before we get started, make sure you have a foundational understanding of the $match and $group operators in the MongoDB aggregation pipeline. If you need some background, you can check out our previous article on MongoDB aggregations by example . For our first step, we need to figure out which fields we want to calculate the mode on. Let's start by getting the mode of the order_value field so we can get a better picture of what a typical order value might be. Mode is calculated by grouping the data in the data set together based on order_value , counting the number of items in each group, and finding the group with the highest count. We can do that using the $group and $sum aggregation operators and filtering out any NULL or invalid fields by first running it through a $match operation. Then, we'll sort the results in descending order using the $sort aggregation operator. Finally, we'll return only the first document in the sort by using the $limit operator. Our aggregation starts with the $match operator to filter out our NULL values: { $match: { order_value: { $exists: true } } } Next, let's run our $group query to group all of our order values into distinct groups. We'll also need to count the number of times an order_value occurs so we can sort it later. We can do that all in one shot with the following query: { $group: { _id: ""$order_value"", count: { $sum: 1 } } } Once this stage of the pipeline is reached, our data should now look like the following: { ""_id"" : 78.93, ""count"" : 1 } { ""_id"" : 35.97, ""count"" : 1 } { ""_id"" : 19.98, ""count"" : 1 } { ""_id"" : 20.99, ""count"" : 1 } { ""_id"" : 12.98, ""count"" : 1 } { ""_id"" : 4.99, ""count"" : 1 } { ""_id"" : 5.99, ""count"" : 3 } The last step now is to find the order_value s with the maximum count. There are a few ways we can do this, but one of the simplest is to sort the data by the count field and then just return the top result. First, let's sort the data using the $sort aggregation. We'll sort on the count field, and sort in descending order: { $sort: { ""count"": -1 } } This should give us the following result: { ""_id"" : 5.99, ""count"" : 3 } { ""_id"" : 78.93, ""count"" : 1 } { ""_id"" : 35.97, ""count"" : 1 } { ""_id"" : 19.98, ""count"" : 1 } { ""_id"" : 20.99, ""count"" : 1 } { ""_id"" : 12.98, ""count"" : 1 } { ""_id"" : 4.99, ""count"" : 1 } Finally, we'll use the $limit aggregation to simply limit the return values to only the first one: { $limit: 1 } This should return only the first document that we matched: { ""_id"" : 5.99, ""count"" : 3 } And there's our mode - our most common order is one with a value of $5.99, and it was encountered 3 times. Our completed query looks like the following: > db.transactions.aggregate([ { $match: { order_value: { $exists: true } } }, { $group: { _id: ""$order_value"", count: { $sum: 1 } } }, { $sort: { ""count"": -1} } , { $limit: 1 } ]) We can also calculate the mode for the number of items purchase in a transaction by performing the same calculation on the item_count field: > db.transactions.aggregate([ { $match: { item_count: { $exists: true } } }, { $group: { _id: ""$item_count"", count: { $sum: 1 } } }, { $sort: { ""count"": -1} } , { $limit: 1 } ]) Which gives us the following: { ""_id"" : 1, ""count"" : 5 } This means that the most common number of items in a transaction is 1, and it occured in 5 transactions. WHY SHOULD I CARE? That's a great question - with all of those wonderful metrics out there, why should you care about the mode ? Like always, it comes down to giving you a different perspective on your data. You can read an excellent writeup about the differences in the Metrics Maven article on Mode , and the following table from that article is perhaps the most insightful: Mean item count = 2.10 Median item count = 1.5 Mode item count = 1 Mean order value = $19.98 Median order value = $10.48 Mode order value = $5.99 When you have data that's likely to repeat itself (ie: repeated transactions), the mode can show you details that mean and median don't. If we expected the mean or even the median to help us determine what to expect from a typical order, we might be very surprised when our projections were substantially off. Our median order value of $10.48 is almost double what our most frequent order price actually is. Mean here is almost completely useless as it is heavily skewed by a few outliers. WRAPPING IT UP While these are certainly not the only way to compute metrics across our MongoDB data sets, the three ""classic"" statistical methods are a great starting point for analyzing your data. We'll continue this series at a later date by exploring more ways we can analyze and gain insights from our MongoDB data. -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. Image by: Pixabay / Skitterphoto John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of gadgets, turning caffeine into code, and writing about it all. Love this article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES Mar 29, 2017MONGO METRICS: FINDING A HAPPY MEDIAN In this second entry in our new ""Mongo Metrics"" series, we'll take a look at using the MongoDB aggregations pipeline to compu… John O'Connor Mar 6, 2017MONGO METRICS: CALCULATING THE MEAN Mongo Metrics is a new series in collaboration with Compose's Resident Data Scientist Lisa Smith that shows you how to extrac… John O'Connor Feb 23, 2017AGGREGATIONS IN MONGODB BY EXAMPLE In this second half of MongoDB by Example, we'll explore the MongoDB aggregation pipeline. The first half of this series cov… John O'Connor Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","In this third entry in our Mongo Metrics series, we'll round out the ""top 3"" classical analytics methods by taking a look at mode.",Mongo Metrics: Calculating the Mode,Live,243 705,"* Select a country/region: United States IBM� * Site map Search * Related materials Download * NO RELATED MATERIALS FOUND * * * * * * LinkedIn * Google+ * Twitter * Facebook * * Related materials * NO RELATED MATERIALS FOUND * Download * * * * * * Download CONTACT IBM CONSIDERING A PURCHASE? * Email IBM FOOTER LINKS * Contact * Privacy * Terms of use * Accessibility","Data exploration and analysis is a repetitive, iterative process, but in order to meet business demands, data scientists do not always have the luxury of long development cycles. What if data scientists could answer bigger and tougher questions faster? What if they could more easily and rapidly experiment, test hypotheses and work more collaboratively on interactive analytics?",Notebooks: A power tool for data scientists,Live,244 706,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSEIUM CONFERENCE AND HEARTBITS HACKATHONMike Elsmore / March 21, 2016HOW WAS SEIUM?SEIUM is a week-long polyglot conference held at the University of Minho in Braga,Portugal. I was invited to give a talk titled No Service, which centered on thetechnology ecosystem for offline Web development.My talk followed a morning workshop on building Android applications, whichcovered using local storage engines within apps. I had a natural audience for mypresentation: people already considering how to make applications workindependent of servers.SEIUM’s excellent site design at seium.orgMY TALKI covered three main subjects for offline-first HTML5 development: 1. Application Cache and its “gotchas” that make it useful but annoying 2. Service Workers and how they are fantastic but still not fully compliant across browsers 3. In-Browser storage and how with localStorage , it’s easy to implement as a key-value storeFrom this point, I got on to the subject of Local Databases.Local Databases were a big focus of my talk. During this section I described thecurrent ecosystem of browsers and their support for IndexedDB and the deprecated-but-still-widely-used Web SQL . I then looked at how libraries like LocalForage , Dexie and PouchDB abstract the pain of dealing with Local Databases. Going further, I explainedthe utility of PouchDB and its amazing reproduction of the Apache CouchDBinterface that allows it to seamlessly work with IBM Cloudant and other toolsthat implement the CouchDB replication protocol . I also encouraged audience participation by using http://elsmore.me/seium-demo/ onstage, which is a basic chat app that uses PouchDB to demonstrate data syncfunctionality and offline capabilities.I received lots of good questions, including the all-important one on “what notto store in PouchDB”. It was a pleasure to be invited to participate in theconference, and hope I get the opportunity to do so again in the future.OFFLINE-FIRST IN HEARTBITS HACKHeartBits was a hackathon organized by the Medical and Informatics faculties ofthe University to explore how technology could be applied to improve generalhealth. With so many developers who had never been to a hackathon before, theyproduced a wide range of ideas.I spoke with many of the attendees and recieved a fantastic overview of thestudent doctors’ goals — and an even better view of how their engineeringteammates designed apps to achieve them. The collaboration between the twodisciplines was astonishing.During the event one team used some of the tools covered in my talk. They builta prototype of an offline Web app called GestaMed to help women manage theirhealth and track medication schedules during pregnancy. The team consisted offour great people: * Diogo Barroso, Faculty of Engineering of University of Porto ( GitHub , LinkedIn ) * João Maia, Faculty of Engineering of University of Porto ( GitHub , LinkedIn ) * Sofia Sousa Teles, Faculty of Medicine at the University of Lisbon * Miguel Mendes, Faculty of Engineering of University of PortoMedicine info in GestaMed (1 of 2)Medicine info in GestaMed (2 of 2)They researched and built a database of medications from textbooks on druginteractions and applied this data to ensure safe consumption during differentstages of pregnancy. This database was then imported into IBM Cloudant so that it could be replicated to all the apps. The 24-hour deadline didn’tleave much time to devote to native UI development, so they built an Apache Cordova app to allow for cross-platform use. Finally, they also used PouchDB as the local storage to seamlessly sync data with IBM Cloudant.They didn’t win, but they did an amazing job. It was brilliant watching themlearn new technologies. All in all it was an amazing weekend. Here’s to nexttime. Cheers!SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Recap of the SEIUM conference in Braga, Portugal, and its companion HeartBits hackathon. Offline-first Web apps were a big topic.",SEIUM Conference and HeartBits Hackathon,Live,245 708,"OFFLINE VERSE Bradley Holt / May 26, 2016A screenshot of IBM Verse, our web-based email and calendaring software. One of my areas of focus as a Developer Advocate here at IBM Cloud Data Services is Offline First , an approach to building web and mobile apps in which the app is designed to work in the most resource-constrained environment first and then progressive enhancement is applied to take advantage of network connectivity when available. I spoke with Yingle Jia (Senior Software Engineer, IBM Verse and IBM Notes) about the offline capabilities recently added to IBM Verse , our web-based business email and calendaring software. Bradley: Verse is IBM’s new web-based email client. For those who aren’t familiar with Verse, can you give us a brief overview of Verse and why it was created? Yingle: Yes, sure. IBM Verse is a cloud-based business email and calendaring offering. It is email reimagined for a new way to work, not just another email client. From the beginning, IBM Verse is created to employ innovative user-centric design, advanced search and social analytics to help users quickly find and focus on things important to them. Bradley: My manager was recently using Verse and he noticed the new “offline settings” section. What sorts of offline capabilities does Verse have? For example, can I read and respond to email while offline? Yingle: Thanks for trying! We designed Verse offline to be a complement to the Verse online experience. For the initial offline GA, we support synchronization of 7 days of mail in all folders, 7 days of preceding calendar events, and 30 days of future events. Also, common email operations like reading, composing, saving, sending email, moving to folder, etc are supported while offline. Security is important for business email and we do encrypt the offline storage. Moreover, we are committed to continuously improve the offline capabilities and user experience over time. Bradley: What was the motivation for building offline capabilities into Verse? Yingle: The web-based approach allows us to quickly roll out new features and bug fixes, however, our customers, including IBM itself, made it clear that they need to be able to access Verse while offline, for example when on an air plane or at a customer site where network access is not available or limited. Also, caching data locally can greatly improve user experience, even when the user is connected. Caching is important for cloud-based offerings! Bradley: What browser features were required to make Verse work offline? Have you encountered any browser compatibility issues? Yingle: Verse offline is built upon standard web technologies and we support all major browsers which are supported for online. Technologies being used include IndexedDB, WebCrypto, Web Workers, etc. We did encounter a couple of browser compatibility issues, and reported defects to the corresponding browser vendors. Of course, we tried hard to avoid browser specific code, and the majority ( 99%) of our code is optimized to run well in all major browsers. Bradley: Were the offline capabilities in Verse added from the beginning? If not, were there any challenges with adding offline capabilities after the initial development of Verse? Yingle: We officially started offline support work after the initial Verse GA. MVC design pattern is heavily used in Verse from the beginning, which makes it easier to add offline support without major architectural changes. Of course, it is still a big challenge to add offline support, since we cannot stop the agile development of Verse to add offline support, and we have a dozen teams working on Verse development! Bradley: I recently spoke with the development team at The Weather Company (a recent IBM acquisition). They have put significant efforts into developing a Progressive Web App. Have you considered taking a Progressive Web App approach to Verse? Yingle: Yes, definitely, we are deeply interested in leveraging new web technologies and programming patterns to improve Verse! A big thanks to Yingle Jia for taking the time to talk with me about the offline capabilities in IBM Verse! If you’re interested in getting more involved in the Offline First movement then please consider joining us for Offline Camp , a three day retreat (June 24-27 th ) in the Catskill Mountains . Offline Camp will be a small gathering of about 30 developers, designers, and others interested in furthering the Offline First movement. SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: IBM Verse / Offline First / Progressive Web Apps Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","I spoke with Yingle Jia (Senior Software Engineer, IBM Verse and IBM Notes) about the offline capabilities recently added to IBM Verse, our web-based business email and calendaring software.",Offline Verse,Live,246 711,"Compose The Compose logo Articles Sign in Free 30-day trialDOCUMENT VALIDATION IN MONGODB BY EXAMPLE Published Feb 16, 2017 mongodb developing Document Validation in MongoDB By ExampleIn this article, we'll explore MongoDB document validation by example using an invoice application for a fictitious cookie company. We'll look at some of the different types of validation available in MongoDB, and provide a practical working example of validations in action. Document validation was introduced in MongoDB 3.2 and defines a new way for developers to control the type of data being inserted into their MongoDB instances. We like to show rather than tell so we'll use a practical example to demonstrate basic validations and the commands used to add them to MongoDB. DOCUMENT VALIDATION IN A NUTSHELL Document databases are a flexible alternative to the pre-defined schemas of relational databases. Each document in a collection can have a unique set of fields, and those fields can be added or removed from documents once they are inserted which makes document databases, and MongoDB in particular, an excellent way to prototype applications. However, this flexibility is not without cost and the most underestimated cost is that of predictability. Since the data fields stored in a document can be changed for each document in a collection, developers lose the ability to make assumptions about the data stored in a collection. This can have major implications in your applications - if a transaction in a finance application was inserted with the wrong fields it could throw off calculations and reports that are vital to business. Developers accustomed to relational databases recognize the importance of predictability in data formats, and that's one of the reasons that validation was introduced in MongoDB 3.2. Let's see how document validation works by making an application that uses it. CREATING THE DATA MODELS As a demonstration by example, we're going to create a fictitious cookie company. We'll use this as an example since the entities in this kind of business can be generalized to apply to other businesses. In this case, we'll simply to 3 main data entities: 1. A Customer , which represents a person making a purchase 2. A Product , which represents an item being sold 3. A Transaction , which represents the purchase of a number of products by a customer. Since this is a trivial example, let's build these out in minimal form. In practice, you can make your data entities as complex as you need. CUSTOMER The Customer entity represents someone making a purchase, so we'll include some data typically found in a Customer entity. A typical customer entity might look like the following: { ""id"": ""1"", ""firstName"": ""Jane"", ""lastName"": ""Doe"", ""phoneNumber"": ""555-555-1212"", ""email"": ""Jane.Doe@compose.io"" } Once we know what properties we'll want to include, we'll need to determine what types of validations we'd like to do on this entity. The first step in adding validation is to figure out exactly what we'd like to validate. We can validate any of the fields in a collection and can validate based on the existence of a field, data type and format in that field, values in a field, and correlations between two fields in a document. In the case of the Customer entity, we'd like to validate the following: * firstName , lastName , phoneNumber and email are all required to exist * phoneNumber is inserted in a specific format (123-456-7890) * email exists (we won't validate email format for now) We can represent these validations in an intermediate format (before putting them into the database) using the JSONSchema spec. While JSONSchema isn't a necessary step to do validations in MongoDB, it's helpful to codifying our rules in a standard format and JSONSchema is quickly gaining traction for doing server-side validations. { ""$schema"": ""http://json-schema.org/draft-04/schema#"", ""type"": ""object"", ""properties"": { ""id"": { ""type"": ""string"" }, ""firstName"": { ""type"": ""string"" }, ""lastName"": { ""type"": ""string"" }, ""phoneNumber"": { ""type"": ""string"", ""pattern"": ""^([0-9]{3}-[0-9]{3}-[0-9]{4}$"" }, ""email"": { ""type"": ""string"" } }, ""required"": [ ""id"", ""firstName"", ""lastName"", ""phoneNumber"", ""email"" ] } Using JSONSchema also allows us to re-use validations on the application side as well, such as RESTHeart's JSONSchema validation . PRODUCT Just as we did above with the Customer entity, let's take a look at what an example Product entity might contain: { ""id"": ""1"", ""name"": ""Chocolate Chip Cookie"", ""listPrice"": 2.99, ""sku"": 555555555, ""productId"": ""123abc"" } We'll also codify our validations in JSONSchema format as well: { ""$schema"": ""http://json-schema.org/draft-04/schema#"", ""type"": ""object"", ""properties"": { ""id"": { ""type"": ""string"" }, ""name"": { ""type"": ""string"" }, ""listPrice"": { ""type"": ""number"" }, ""sku"": { ""type"": ""integer"" }, ""productId"": { ""type"": ""string"" } }, ""required"": [ ""id"", ""name"", ""listPrice"", ""sku"", ""productId"" ] } TRANSACTION The last entity we'll use in our fictitious cookie shop is a transaction. A transaction represents a single purchase of one or more products by one customer (many-to-one relationship). An inserted transaction record might look like the following: { ""id"": ""1"", ""productId"": ""1"", ""customerId"": ""1"", ""amount"": 20.00 } Lastly, we'll codify the validations we want in JSONSchema format: { ""$schema"": ""http://json-schema.org/draft-04/schema#"", ""type"": ""object"", ""properties"": { ""id"": { ""type"": ""string"" }, ""productId"": { ""type"": ""string"" }, ""customerId"": { ""type"": ""string"" }, ""amount"": { ""type"": ""number"" } }, ""required"": [ ""id"", ""productId"", ""customerId"", ""amount"" ] } Now that we have the structure and validation rules for our application, let's add these validation rules to our Mongo database. ADDING VALIDATION RULES Now that we have an idea of how we want to validate our data, let's add those validation rules to a MongoDB collection. First, let's spin up a new MongoDB on Compose deployment and create a new database for your cookie shop. Be sure to add a database user so we can connect to the database after this step. We'll create a new collection using the mongo command line application, which you can install for your platform . Once you've installed the mongo command line application, created a new database, and added a database user, it's time to create your collection through the mongo command line tool. Open a terminal and type the following: mongo mongodb://dbuser:secret@aws-us-east-1-portal.8.dblayer.com:15234/cookieshop This will load up the interactive mongo shell. Now, let's create our collections in the database with the validations we determined earlier. We'll start with the Customer collection: > db.createCollection(""customers"", { validator: { $and: [ { ""firstName"": {$type: ""string"", $exists: true} }, { ""lastName"": { $type: ""string"", $exists: true} }, { ""phoneNumber"": { $type: ""string"", $exists: true, $regex: /^[0-9]{3}-[0-9]{3}-[0-9]{4}$/ } }, { ""email"": { $type: ""string"", $exists: true } } ] } }) We'll leave email validation alone for now since it can be a bit complicated for a trivial example. Next, let's add our products collection and validations: > db.createCollection(""products"", { validator: { $and: [ { ""name"": {$type: ""string"", $exists: true} }, { ""listPrice"": { $type: ""double"", $exists: true} }, { ""sku"": { $type: ""int"", $exists: true} } ] } }) Finally, we'll add our transactions collection which contains a reference to documents in the products and customers collections: db.createCollection(""transactions"", { validator: { $and: [ { ""productId"": {$type: ""objectId"", $exists: true} }, { ""customerId"": { $type: ""objectId"", $exists: true} }, { ""amount"": { $type: ""double"", $exists: true} } ] } }) The objectId type is a special type that allows us to reference documents from other collections. In our case, we'll use it to associate a specific product and user in a transaction. TESTING VALIDATIONS Now, it's time to test our validations to make sure they worked out. We'll start by adding a new customer: db.customers.insertOne({ firstName: ""John"", lastName: ""O'Connor"", phoneNumber: ""555-555-1212"" }); Notice that we've omitted the email field from our user, which was marked as required when we set up our validations. If we set up the validations correctly, we'd expect the insertion to fail which it does: 2017-02-09T12:45:36.714-0800 E QUERY [thread1] uncaught exception: WriteError({ ""index"" : 0, ""code"" : 121, ""errmsg"" : ""Document failed validation"", ""op"" : { ""_id"" : ObjectId(""589cd4f06ca2fef0f7737fb9""), ""firstName"" : ""John"", ""lastName"" : ""O'Connor"", ""phoneNumber"" : ""555-555-1212"" } }) : undefined Once we add the email field to the customer, the validation passes and the new customer is inserted: { ""acknowledged"" : true, ""insertedId"" : ObjectId(""589cd56b6ca2fef0f7737fbc"") } The acknowledge message lets us know that the customer was inserted correctly. Save the insertedId for later as we're going to use it when we make a new transaction. Now, let's add a product and a transaction: db.products.insertOne({ name: ""Chocolate Chip"", listPrice: 2.99, sku: 1 }); Again, make sure to keep track of the insertedId so we can use it while making a transaction. Finally, let's add a transaction in which our new customer purchases our new product: db.transactions.insertOne({ productId: ObjectId(""589cd9216ca2fef0f7737fc4""), customerId: ObjectId(""589cd56b6ca2fef0f7737fbc""), amount: 2.99 }); WRAPPING UP While document validations aren't necessarily desirable in all scenarios, they provide developers with a more robust set of options when deciding where they want to place the responsibility for data integrity within their applications. In this article, we demonstrated how to create collections that have validations in MongoDB to ensure our data has a predictable format and set of data. In the next article, we'll use that predictability with MongoDB aggregations to gain insights into our fictitious business by mining data in our database. -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. Image by: Wikimedia Commons John O'Connor is a software architect that enjoys tinkering with things, designing software, and writing about it all. Love this article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES Jan 25, 2017DRONE DEPLOY CONQUERS THE DATA LAYER Compose has quite a few unique customers. One of the more unique that we've visited with is DroneDeploy, a company that autom… Thom Crowe Jan 24, 2017BUILDING INSTANT RESTFUL API'S WITH MONGODB AND RESTHEART When you need to turn your Mongo database into a RESTFul API, RESTHeart can get you up-and-running quickly. In this article,… John O'Connor Jan 13, 2017NEWSBITS - MONGODB RANSOMS, PYTHON AND POSTGRESQL, EAGLES AND BEAMS AND MORE NewsBits for the week ending January 13th: MongoDB ransoms continue, picking the right PostgreSQL/Python driver, Apache Eagle… Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","In this article, we'll explore MongoDB document validation by example using an invoice application for a fictitious cookie company. We'll look at some of the different types of validation available in MongoDB, and provide a practical working example of validations in action.",Document Validation in MongoDB By Example,Live,247 713,Cloudant Query provides you with a declarative way to define and query indexes. This video introduces you to Cloudant Query concepts. Find more videos and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center,Cloudant Query provides you with a declarative way to define and query indexes. This video introduces you to Cloudant Query concepts.,Introducing the new Cloudant query,Live,248 721,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Jul 18 -------------------------------------------------------------------------------- QUERYING YOUR CLOUDANT DATABASE WITH SQL UPDATING THE SILVERLINING NODE.JS LIBRARY TO SUPPORT THE BASICS OF SQL Cloudant and its Apache CouchDB stable-mate are “NoSQL” databases — that is, they are schemaless JSON document stores. Unlike a traditional relational database, you don’t need to define your schema before writing data to the database. Just post your JSON to the database and change your mind as often as you like! One of the appealing things about relational databases is the query language. Structured Query Language or SQL was developed by IBM in the 1970s and was widely adopted across a host of databases ever since. In its simplest form, SQL reads like a sentence: SELECT name, colour, price FROM animalsdb WHERE type='cat' OR (price > 500 AND price < 1000) LIMIT 50 This statement translates to: “Fetch me the name, colour and price from the animals database, but only the rows that are cats, or ones which are more expensive than 500 but cheaper than 1000. And I only want a maximum of 50 rows returned.”It is a convenient way of expressing the fields you want to fetch, the filter you wish to apply to the data, and the maximum number of rows you want in reply. Many databases can store BLOB types , but this isn’t one of those kinds of blobs. Image credit: mark du toit .Unfortunately, NoSQL databases don’t generally support the SQL language. Cloudant and Apache CouchDB™ have their own form of query language where the query is expressed as a JSON object: “ Cloudant Query ” (CQ) and “ Mango ,” in their respective contexts. The CQ or Mango equivalent of the above SQL statement is: It’s a world of curly brackets! If you’re happier expressing your query in SQL, then there is a way. SILVERLINING + SQL The latest version of the silverlining Node.js library can now accept SQL queries. It will convert the SQL into a Cloudant Query and deliver the results. Simply install the Silverlining library: npm install -s silverlining And add it to your Node.js app by passing your Cloudant URL to the library: var db = require('silverlining')('https://USER:PASS@HOST.cloudant.com/animalsdb' We can then start querying our database with an SQL statement: db.query('SELECT name FROM animalsdb').then(function(data) { // data! }); Here are some other sample queries: Silverlining achieves this by converting your SQL query into the equivalent Cloudant Query object. If you’d like to see that data yourself, then call the explain function instead of query to be returned by the query that would have been used: LIMITATIONS Before we get carried away, this feature doesn’t suddenly make Cloudant support joins, unions, transactions, stored procedures etc. It’s just a translation from SQL to Cloudant Query . It doesn’t support aggregations or grouping either, but you can use Silverlining’s count , sum , and stats functions to generate performant grouped aggregation without any fuss. This feature simply makes it easier to explore data sets if you already have SQL language experience. If you enjoyed this article, please ♡ it to recommend it to other Medium readers. * Web Development * JavaScript * Couchdb * Cloudant * Database Blocked Unblock Follow FollowingGLYNN BIRD Developer Advocate @ IBM Watson Data Platform. Views are my own etc. FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform. * Share * * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Cloudant and its Apache CouchDB stable-mate are “NoSQL” databases — that is, they are schemaless JSON document stores. Unlike a traditional relational database, you don’t need to define your schema…",Querying your Cloudant database with SQL – IBM Watson Data Lab – Medium,Live,249 722,"Homepage IBM Watson Data Lab Follow Sign in Get started * Home * Web Dev * Serverless * Data Science * Object Storage * Containers * Mark Watson Blocked Unblock Follow Following Developer Advocate, IBM Watson Data Platform Oct 18, 2017 -------------------------------------------------------------------------------- BUILDING YOUR FIRST MACHINE LEARNING SYSTEM TRAIN YOUR MODEL AND DEPLOY IT, WATSON ML FOR DEVELOPERS (PART 2) In Part 1 I gave you an overview of machine learning, discussed some of the tools you can use to build end-to-end ML systems, and the path I like to follow when building them. In this post we are going to follow this path to train a machine learning model, deploy it to Watson ML, and run predictions against it in real time. Look Ahead: In Part 3 we’ll create a small web application and backend to demonstrate how you can integrate Watson ML and make machine learning predictions in an end-user application. The Model Cafe in the Allston neighborhood of Boston. Image: Toby McGuire .We are going to use our small data set from Part 1 because the point of this post is to get something up and running quickly — not to actually build an accurate system for making predictions. Here’s the data set: Square Feet # Bedrooms Color Price ----------- ---------- ----- ----- 2,100 3 White $100,000 2,300 4 White $125,000 2,500 4 Brown $150,000 In Part 1 I talked about the tools I use to build machine learning systems. Before we start building our ML system, let’s setup our tools. TOOL SETUP Bluemix/DSX: You’ll need a Bluemix and Data Science Experience account. If you don’t have one, go to https://datascience.ibm.com to sign up. This will create a single account where you can access Bluemix and DSX. Watson Machine Learning: You’ll need an instance of Watson Machine Learning. You can provision a new instance here . Apache Spark™: You’ll need a Spark instance, but if you don’t have one now you can create one later. Now that you’re all set up, let’s follow the process I outlined in Part 1. STEP 1: IDENTIFY WHAT YOU WANT TO PREDICT AND THE SOURCE OF YOUR DATA We’ve identified that we want to predict house prices, and the data set we want to use to drive those predictions. I have made the data set available on GitHub: https://raw.githubusercontent.com/markwatsonatx/watson-ml-for-developers/master/data/house-prices.csv This URL is important because we’ll need to pull this data into our Jupyter Notebook in the next step. STEP 2: CREATE A JUPYTER NOTEBOOK — IMPORT, CLEAN, AND ANALYZE THE DATA CREATE A JUPYTER NOTEBOOK We’re going to analyze our data in a Jupyter Notebook in the IBM Data Science Experience. Jupyter Notebooks are documents that run in a web browser and are composed of cells. Cells can contain markup or executable code. We’ll be coding in Python. I’ll show you how we can import and analyze our data with just three lines of code. -------------------------------------------------------------------------------- Download the following notebook to your computer: https://dataplatform.ibm.com/analytics/notebooks/3e83ffa1-f52a-4b76-bbb5-498b6b7f9505/view?access_token=a7dfdd01dbc24c53a5ac9688fbdd32da1b59156117d721fe10d12660f18dd591 Open DSX and create a new project called “Watson ML for Developers”. From here, create a new Spark instance for it. In the project navigate to Analytic assets and click New notebook . Choose From file . Specify a name, like “House Prices”, and choose the notebook you downloaded above. Finally, click Create Notebook . You should be taken directly to edit the notebook. -------------------------------------------------------------------------------- If this is your first time using Jupyter notebooks here are a few tips that you may find helpful (if you are already familiar with Jupyter notebooks, feel free to skip ahead): 1. Always make sure your kernel is running. You should see the status of your kernel in the top right. 2. If your kernel is not running, you can restart it from the Kernel menu. From here you can also interrupt your kernel, or change your kernel (if you want to use a different version of Python or Apache Spark). 3. A notebook is made of up markup and code cells. You can walk through the notebook and execute the code cells by clicking the run button in the toolbar or from the Cell menu. -------------------------------------------------------------------------------- IMPORT, CLEAN, AND ANALYZE THE DATA Let’s look at the first three code cells in the notebook where we will load and analyze our data. Here’s the first code cell: import pixiedust This cell just imports a Python library called PixieDust . PixieDust is an open source helper library that works as an add-on to Jupyter Notebooks that makes it easy to import and visualize data. In the second cell we load our sample data: df = pixiedust.sampleData(""https://raw.githubusercontent.com/markwatsonatx/watson-ml-for-developers/master/data/house-prices.csv"") This will generate a Spark DataFrame called “df”. A DataFrame is a data set organized into named columns. You can think of it as a spreadsheet, or a relational database table. The Spark ML API uses DataFrames to train and test ML models. Finally, we’ll call the display function in PixieDust to display our data: display(df) It should look something like this: In this case we are displaying a simple table, but PixieDust also provides graphs and charts for helping you understand and analyze your data without writing any code. In just three lines of code we have imported and analyzed our data set. Now it’s time to do some machine learning! STEP 3: USE APACHE SPARK ML TO BUILD AND TEST A MACHINE LEARNING MODEL BUILD A MACHINE LEARNING MODEL We’re going to build our first ML model in just a handful of cells. To start we need to import the Spark ML libraries that we’ll be using: from pyspark.ml import Pipeline from pyspark.ml.regression import LinearRegression from pyspark.ml.feature import VectorAssembler This is a regression problem (we’re trying to predict a real number), so we are going to use the LinearRegression algorithm in pyspark.ml.regression . There are other regression algorithms, but those are outside of the scope of this post. We are going to build our ML model in just four lines of code. These four lines are in a single cell in our notebook, like so: assembler = VectorAssembler( inputCols=['SquareFeet','Bedrooms'], outputCol=""features"" ) lr = LinearRegression(labelCol='Price', featuresCol='features') pipeline = Pipeline(stages=[assembler, lr]) model = pipeline.fit(df) Let’s break this down, line by line. First of all we need to specify our features . In the previous post we decided that we would use Square Feet and # Bedrooms as our features. Our ML algorithm expects a single vector of feature columns, so here we use a VectorAssembler to tell our ML pipeline (we’ll talk about pipelines in a minute) that we want SquareFeet and Bedrooms as our features: assembler = VectorAssembler( inputCols=['SquareFeet','Bedrooms'], outputCol=""features"" ) Next, we create an instance of LinearRegression , the ML algorithm we are going to use. At a minimum, you must specify the features and the labels. There are other parameters you can provide to tweak the algorithm, but they’re not going to do us much good when working with three data points :) lr = LinearRegression(labelCol='Price', featuresCol='features') Next, we create our pipeline . A Pipeline allows us to specify the steps that should be performed when training an ML model. In this case, we first want to assemble our two feature columns into a single vector — that’s the assembler. Then we want to run it through our LinearRegression algorithm. In upcoming posts I’ll discuss other operations that you’ll run through the pipeline — like converting non-numeric data to numeric data. pipeline = Pipeline(stages=[assembler, lr]) Finally, we pass our DataFrame to the fit method on the pipeline to create our ML model. model = pipeline.fit(df) Congratulations, you now have a machine learning model that you can use to predict house prices! TEST THE MODEL It’s time to test our model. In our example we are going to run a single prediction. In future posts I’ll discuss how you can analyze the accuracy of your model by running a large number of predictions based on your original data set. Here we create a Python function to get our prediction: def get_prediction(square_feet, num_bedrooms): request_df = spark.createDataFrame( [(square_feet, num_bedrooms)], ['SquareFeet','Bedrooms'] ) response_df = model.transform(request_df) return response_df Let’s break this cell down. First of all, in order to generate a prediction against an ML model generated using Spark ML, we need to pass it a DataFrame with the data we want to use in our prediction (i.e., the square footage and # bedrooms for the house price we want to predict). This line of code creates the DataFrame we’ll pass to our model: request_df = spark.createDataFrame( [(square_feet, num_bedrooms)], ['SquareFeet','Bedrooms'] ) Then we’ll call transform on the model, passing in the request DataFrame. This returns another DataFrame: response_df = model.transform(request_df) Let’s run a prediction for a house that is 2,400 square feet and has 4 bedrooms: response = get_prediction(2400, 4) response.show() The result is a DataFrame that looks like this: +----------+--------+------------+------------------+ |SquareFeet|Bedrooms| features| prediction| +----------+--------+------------+------------------+ | 2400| 4|[2400.0,4.0]|137499.99999999968| +----------+--------+------------+------------------+ Tip: You can use PixieDust to visualize any DataFrame, including this one. If you’ve imported PixieDust and you have a DataFrame, display() is your friend :)Our ML model returned back our features along with a prediction. In this case, it predicted that a house that is 2,400 square feet and has 4 bedrooms should have a price of about $137,500, which is directly in between our 2,300 square foot house and our 2,500 square foot house. STEP 4: DEPLOY AND TEST THE MODEL WITH WATSON ML DEPLOY THE MODEL We’ve trained and tested our machine learning model, but if we want to predict house prices from a web or mobile app it’s not going to do us much good in this notebook. That’s where Watson ML comes in. In the same notebook, we’re going to deploy this model to Watson ML and create a “scoring endpoint”, or a REST API for making predictions. The first thing you’ll need to do is specify your Watson ML credentials. You can find your credentials by going to the Watson ML in Bluemix and clicking Service Credentials on the left ( head to the catalog to deploy it ): Fill in the following cell with your credentials: service_path = 'https://ibm-watson-ml.mybluemix.net' username = 'YOUR_WML_USER_NAME' password = 'YOUR_WML_PASSWORD' instance_id = 'YOUR_WML_INSTANCE_ID' model_name = 'House Prices Model' deployment_name = 'House Prices Deployment' The next cell initializes some libraries for connecting to Watson ML. These libraries are built into DSX: from repository.mlrepositoryclient import MLRepositoryClient from repository.mlrepositoryartifact import MLRepositoryArtifact ml_repository_client = MLRepositoryClient(service_path) ml_repository_client.authorize(username, password) Next, we’ll use the same libraries to save our model to Watson ML. We pass the trained model, our data set, and a name for the model — in this case we’re calling it “House Prices Model”: model_artifact = MLRepositoryArtifact( model, training_data=df, name=model_name ) saved_model = ml_repository_client.models.save(model_artifact) model_id = saved_model.uid The call to save the model returns an object that we store in our saved_model variable from which we extract the unique ID for the model. This is important as it will be used later to create a deployment for the model. We now have a trained machine learning model that we have deployed to Watson ML, but we still don’t have a way to access it. The next few cells will do just that. We are going to create a Deployment for our ML model. To do this, we are going to use the Watson ML Rest API . The Watson ML Rest API uses token-based authentication, so our first step is to generate a token using our Watson ML credentials: headers = urllib3.util.make_headers( basic_auth='{}:{}'.format(username, password) ) url = '{}/v3/identity/token'.format(service_path) response = requests.get(url, headers=headers) ml_token = 'Bearer ' + json.loads(response.text).get('token') Now we can create our deployment. Here we make an HTTP POST to the published_models/deployments endpoint — passing in our Watson ML instance_id and the model_id of our newly saved model. deployment_url = service_path + ""/v3/wml_instances/"" + instance_id + ""/published_models/"" + model_id + ""/deployments/"" deployment_header = { 'Content-Type': 'application/json', 'Authorization': ml_token } deployment_payload = { ""type"": ""online"", ""name"": deployment_name } deployment_response = requests.post( deployment_url, json=deployment_payload, headers=deployment_header ) scoring_url = json.loads(deployment_response.text) .get('entity') .get('scoring_url') print scoring_url The last line above prints the scoring_url parsed from the response received from Watson ML. This is an HTTP endpoint that we can use to make predictions. You now have a deployed machine learning model that you can use to predict house prices from anywhere! You can call it from a front-end application, your middleware, or from a notebook — we’ll do just that next :) TEST THE MODEL For now, we’re going to test our Watson ML deployment from our notebook, but the real value of deploying your ML models to Watson ML is that you can run predictions from anywhere. In the notebook I created a new function called get_prediction_from_watson_ml . Just like the last function, this one takes the square footage and the number of bedrooms for the house price you would like to predict. Rather than calling the Spark ML APIs, you can see that this function performs an HTTP POST to the scoring_url we received earlier. def get_prediction_from_watson_ml(square_feet, num_bedrooms): scoring_header = { 'Content-Type': 'application/json', 'Authorization': ml_token } scoring_payload = { 'fields': ['SquareFeet','Bedrooms'], 'values': [[square_feet, num_bedrooms]] } scoring_response = requests.post( scoring_url, json=scoring_payload, headers=scoring_header ) return scoring_response.text Let’s run the same prediction we ran earlier — a house that is 2,400 square feet and has 4 bedrooms: response = get_prediction_from_watson_ml(2400, 4) print response The call to our Watson ML REST API returned our features along with the same prediction we received when we ran our test using Spark ML and the local ML model that we generated. { ""fields"": [""SquareFeet"", ""Bedrooms"", ""features"", ""prediction""], ""values"": [[2400, 4, [2400.0, 4.0], 137499.99999999968]] } NEXT STEPS In this post we built an end-to-end machine learning system using the IBM Data Science Experience, Spark ML, and Watson ML. In just a few lines of code, we imported and visualized a data set, built an ML pipeline and trained an ML model, and made that model available to make predictions from software running anywhere. Although we barely scratched the surface of machine learning, I hope this article gave you a basic understanding of how to build an ML system. In the next post, I will show you how to consume the Watson ML scoring endpoint from an end-user application. In future posts, I will slowly venture deeper into machine learning with working examples for common ML problems: supervised and unsupervised, binary and multiclass classification, clustering, and more. * Machine Learning * Pixiedust * Data Science * Jupyter Notebook * Cognitive Computing One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. 57 Blocked Unblock Follow FollowingMARK WATSON Developer Advocate, IBM Watson Data Platform FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Cloud. * 57 * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","In Part 1, I gave you an overview of machine learning, discussed some of the tools you can use to build end-to-end ML systems, and the path I like to follow when building them. In this post we are going to follow this path to train a machine learning model, deploy it to Watson ML, and run predictions against it in real time.",Building Your First Machine Learning System ,Live,250 723,"Jump to navigation * Twitter * LinkedIn * Facebook * About * Contact * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chats * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Subscribe ×PODCASTS DATA SCIENCE EXPERT INTERVIEW: DEZ BLANCHFIELD, CRAIG BROWN, DAVID MATHISON, JENNIFER SHIN AND MIKE TAMIR PART 2 Post Comment November 16, 2016 | 21:02 play mute max volume MP3 Data science expert interview: Dez Blanchfield, Craig Brown, David Mathison, Jennifer Shin and Mike Tamir part 2 Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin .OVERVIEW The IBM Insight at World of Watson 2016 conference brought together many leading experts in data science, cognitive computing and big data analytics. In this, the second part of a two-part podcast recorded at the conference, IBM data science evangelist James Kobielus interviews five industry thought leaders to gain their insights into the trends facing data professionals: * Dez Blanchfield (The Bloor Group) * Craig Brown (Untapped Potential) * David Mathison (CDO Club) * Jennifer Shin (8 Path Solutions) * Mike Tamir (Intertrust Technologies Corporation) Explore the power that a productivity platform can bring to team data science by learning more about the IBM Watson Data Platform . Listen to part 1 Follow @IBMBigData Topics: Analytics , Big Data Technology , Big Data Use Cases , Data Scientists , Hadoop Tags: big data , business analyst , chief data officers , cognitive computing , data analytics , data science , open analytics , predictive analyticsRELATED CONTENT PODCAST DATA SCIENCE EXPERT INTERVIEW: DEZ BLANCHFIELD, CRAIG BROWN, DAVID MATHISON, JENNIFER SHIN AND MIKE TAMIR PART 1 Take a peek at the future of data science in this discussion with five thought leaders in the data analytics industry, the first installment of a two-part interview recorded at the IBM Insight at World of Watson 2016 conference. Listen to Podcast Blog Calling all TM1 users: Your next on-premises planning solution is here Video Dez Blanchfield's predictions based on what he learned at World of Watson 2016 Podcast Cyber Beat Live: Can analytics and cognitive computing stop cyber criminals? Blog Accessing the power of R through a robust statistical analysis tool Podcast Finance in Focus: Meet Watson—your new surveillance officer Video Insurers: Isn't it time to go beyond traditional views of policyholders relations? Video IBM Incentive Compensation Management: Improve sales results and operational efficiencies Blog The cognitive level of surveillance for financial institutions Video Dez Blanchfield's top 3 takeaways from World of Watson 2016 Video Recommender System with Elasticsearch: Nick Pentreath & Jean-François Puget Video Hyperparameter optimization: Sven Hafeneger Video An introduction to extending Spark ML for custom models: Holden Karau View the discussion thread. IBM * Site Map * Privacy * Terms of Use * 2014 IBM FOLLOW IBM BIG DATA & ANALYTICS * Facebook * YouTube * Twitter * @IBMbigdata * LinkedIn * Google+ * SlideShare * Twitter * @IBManalytics * Explore By Topic * Use Cases * Industries * Analytics * Technology * For Developers * Big Data & Analytics Heroes * Explore By Content Type * Blogs * Videos * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Events * Around the Web * About The Big Data & Analytics Hub * Contact Us * RSS Feeds * Additional Big Data Resources * AnalyticsZone * Big Data University * Channel Big Data * developerWorks Big Data Community * IBM big data for the enterprise * IBM Data Magazine * Smarter Questions Blog * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics Heroes More * Events * Upcoming Events * Webcasts * Twitter Chats * Meetups * Around the Web * For Developers * Big Data & Analytics Heroes SearchEXPLORE BY TOPIC: Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Sales Performance Management Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Presentation Calling all IBM TM1 users! There’s a new on-premises solution in town Podcast The unusual suspects in cyber warfareMORE Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Presentation Calling all IBM TM1 users! There’s a new on-premises solution in town Podcast The unusual suspects in cyber warfare Blog Calling all TM1 users: Your next on-premises planning solution is here Presentation 8 innovative ideas for data architects Video Dez Blanchfield's predictions based on what he learned at World of Watson 2016 Blog Internet of Things: A continuum of change with opportunities galore Presentation Calling all IBM TM1 users! There’s a new on-premises solution in town Blog Calling all TM1 users: Your next on-premises planning solution is here Presentation 8 innovative ideas for data architectsMORE Blog Internet of Things: A continuum of change with opportunities galore Presentation Calling all IBM TM1 users! There’s a new on-premises solution in town Blog Calling all TM1 users: Your next on-premises planning solution is here Presentation 8 innovative ideas for data architects Blog Accessing the power of R through a robust statistical analysis tool Video Insurers: Isn't it time to go beyond traditional views of policyholders relations? Video IBM Incentive Compensation Management: Improve sales results and operational efficiencies Podcast The unusual suspects in cyber warfare Podcast Cyber Beat Live: Can analytics and cognitive computing stop cyber criminals? Podcast Finance in Focus: Meet Watson—your new surveillance officer Video Insurers: Isn't it time to go beyond traditional views of policyholders relations?MORE Podcast The unusual suspects in cyber warfare Podcast Cyber Beat Live: Can analytics and cognitive computing stop cyber criminals? Podcast Finance in Focus: Meet Watson—your new surveillance officer Video Insurers: Isn't it time to go beyond traditional views of policyholders relations? Blog The cognitive level of surveillance for financial institutions Blog Dynamic duo: Big data and design thinking Video Data streams in telecom: Koen Dejonghe Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David Mathison, Jennifer Shin and Mike... Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David Mathison, Jennifer Shin and Mike...MORE Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David Mathison, Jennifer Shin and Mike... Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David Mathison, Jennifer Shin and Mike... Presentation 8 innovative ideas for data architects Video Dez Blanchfield's predictions based on what he learned at World of Watson 2016 Blog Accessing the power of R through a robust statistical analysis tool * Home * Explore By Topic * Use Cases * All * Acquire, Grow & Retain Customers * Create New Business Models * Improve IT Economics * Manage Risk * Optimize Operations & Reduce Fraud * Transform Financial Processes * Industries * All * Banking * Consumer Products * Education * Energy & Utilities * Government * Healthcare & Life Sciences * Industrial * Insurance * Media & Entertainment * Retail * Telecommunications * Analytics * All * Content Analytics * Customer Analytics * Entity Analytics * Social Media Analytics * Technology * All * Business Intelligence * Cloud Database * Data Governance * Data Warehouse * Database Management Systems * Data Science * Hadoop & Spark * Internet of Things * Predictive Analytics * Streaming Analytics * Content By Type * Blogs * Videos * All Videos * IBM Big Data In A Minute * Video Chat * Analytics Video Chats * Big Data Bytes * Big Data Developers Streaming Meetups * Cyber Beat Live * Podcasts * White Papers & Reports * Infographics & Animations * Presentations * Galleries * Big Data & Analytics Heroes * For Developers * Events * Upcoming Events * Webcasts * Twitter Chat * Meetups * Around The Web * About Us * Contact Us * Search Site","Take a peek at the future of data science in this discussion with five thought leaders in the data analytics industry, the second installment of a two-part interview recorded at the IBM Insight at World of Watson 2016 conference.","Data science expert interview: Dez Blanchfield, Craig Brown, David Mathison, Jennifer Shin and Mike Tamir part 2",Live,251 724,"* Home * Research * Partnerships and Chairs * Staff * Books * Articles * Videos * Presentations * Contact Information * Subscribe to our Newsletter * 中文 * Marketing Analytics * Credit Risk Analytics * Fraud Analytics * Process Analytics * Human Resource Analytics * Prof. dr. Bart Baesens * Prof. dr. Seppe vanden Broucke * Aimée Backiel * Sandra Mitrović * Klaas Nelissen * María Óskarsdóttir * Michael Reusens * Eugen Stripling * Tine Van Calster * Basic Java Programming * Principles of Database Management * Business Information Systems * Mini Lecture Series * Other Videos WEB PICKS (WEEK OF 4 SEPTEMBER 2017) Posted on September 9, 2017Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings , the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources . * Silicon Valley siphons our data like oil. But the deep drilling has just begun Personal data is to the tech world what oil is to the fossil fuel industry. That’s why companies like Amazon and Facebook plan to dig deeper than we ever imagined. * A Survey of 3,000 Executives Reveals How Businesses Succeed with AI “The next digital frontier is here, and it’s AI.” * Scraping data from the public web may be legal When is it okay to grab data from someone else’s website, without their explicit permission? A new ruling by a federal judge in California might have dramatic implications on this question, and on the open nature of the web in general. * Data Alone Isn’t Ground Truth You should always carry a healthy dose of skepticism in your back pocket. * To Survive in Tough Times, Restaurants Turn to Data-Mining “According to the tech wizards who are determined to jolt the restaurant industry out of its current slump, information culled and crunched from a wide array of sources can identify customers who like to linger, based on data about their dining histories.” * How the GDPR will disrupt Google and Facebook Google and Facebook will be disrupted by the new European data protection rules that are due to apply in May 2018. This note explains how. * Machine Learning for Humans Simple, plain-English explanations accompanied by math, code, and real-world examples. * Why We Need Accountable Algorithms AI and machine learning algorithms are marketed as unbiased, objective tools. They are not * Support Hypothesis In September, Stripe is supporting the development of Hypothesis, an open-source testing library for Python created by David MacIver. Hypothesis is the only project we’ve found that provides effective tooling for testing code for machine learning, a domain in which testing and correctness are notoriously difficult. * Cornea AI aims to predict the popularity of your next photo The Cornea score uses Artificial Intelligence to predict the popularity of your photo. * Logo Rank is an AI system that understands logo design It’s trained on a million+ logo images to give you tips and ideas. It can also be used to see if your designer took inspiration from stock icons. * ggpage Creates Page Layout Visualizations in R * Can CNNs transliterate Pinyin into Chinese characters correctly? This project examines how well neural networks can convert Pinyin, the official romanization system for Chinese, into Chinese characters. * Simulate colorblindness in R figures This new R package provides a variety of functions that are helpful to simulate the effects of colorblindness in R figures. * PyTorch or TensorFlow? “This is a guide to the main differences I’ve found between PyTorch and TensorFlow.” * Deep Learning is not the AI future “While Deep Learning had many impressive successes, it is only a small part of Machine Learning, which is a small part of AI. We argue that future AI should explore other ways beyond DL.” ‹ Web Picks (week of 21 August 2017) —Ad—We display ads on this section of the site. -------------------------------------------------------------------------------- Recent Posts * Web Picks (week of 4 September 2017) * Web Picks (week of 21 August 2017) * What discount factor is commonly used in calculating Customer Lifetime Value (CLV)? * Simple Linear Regression? Do It The Bayesian Way * Web Picks (week of 7 August 2017) Archives * September 2017 * August 2017 * July 2017 * June 2017 * May 2017 * April 2017 * March 2017 * February 2017 * January 2017 * December 2016 * November 2016 * October 2016 * September 2016 * August 2016 * July 2016 * June 2016 * May 2016 * April 2016 * March 2016 * February 2016 * January 2016 * December 2015 * November 2015 * October 2015 * September 2015 * August 2015 * July 2015 * June 2015 * May 2015 * * * © DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU Leuven KU Leuven, Department of Decision Sciences and Information Management Naamsestraat 69, 3000 Leuven, Belgium DataMiningApps on Twitter , Facebook , YouTube info@dataminingapps.com",Interesting data science links from around the web.,Web Picks (week of 4 September 2017),Live,252 730,"Homepage Follow Sign in / Sign up Homepage * Home * Data Science Experience * * Watson Data Platform * Armand Ruiz Blocked Unblock Follow Following Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats #Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my own Oct 3 -------------------------------------------------------------------------------- LIFELONG (MACHINE) LEARNING: HOW AUTOMATION CAN HELP YOUR MODELS GET SMARTER OVER TIME MACHINE LEARNING SHOULD HAPPEN CONSTANTLY Imagine you’re interviewing a new job applicant who graduated top of their class and has a stellar résumé. They know everything there is to know about the job, and has the skills that your business needs. There’s just one catch: from the moment they join your team, they’ve vowed never to learn anything new again. You probably wouldn’t make that hire, because you know that life long learning is vital if someone is going to add long-term value to your team. Yet when we turn to the field of machine learning, we see companies making a similar mistake all the time. Data scientists work hard to develop, train and test new machine learning models and neural networks. However, once the models get deployed, they don’t learn anything new. After a few weeks or months, become static and stale, and their usefulness as a predictive tool deteriorates. WHY MODELS STOP LEARNING Data scientists are well aware of this problem, and would love to find a way to enable their models to participate in the equivalent of lifelong learning. However, moving a model into production is typically a tough task, and deployment requires help from busy IT specialists. When a single deployment can take weeks, it’s no wonder that most data scientists prefer to hand over their latest model and move onto the next project, rather than persist with the drudgery of continually retraining and redeploying their existing models. Deployment isn’t just painful for data scientists — it can be a headache for IT teams too. Data scientists might have used any one of a wide variety of languages, frameworks and tools to build their models, and there is no guarantee that those choices will make the model easy to integrate into production systems. In a worst-case scenario, the model may need to be substantially refactored or even rebuilt from scratch before it can be deployed. As a result, if data scientists ask for their models to be redeployed too frequently, they may be met with significant resistance from the IT department. STREAMLINING DEPLOYMENT TO KEEP MODELS IN TRAINING The good news is that model deployment isn’t inherently labor-intensive. Just as in other forms of software development, the principles of DevOps apply here. With the right platform, it is possible to create seamless continuous deployment pipelines that automate many aspects of the process, transforming deployment from weeks of manual effort to a matter of a few mouse-clicks. For example, with IBM® Watson® Machine Learning integrated in IBM Data Science Experience , data scientists can develop models using a wide range of languages (including Python, R and Scala) and frameworks (such as SparkML, Scikit-Learn, xgboost and SPSS). The solution will abstract the models into a standardized API that can be integrated easily with production systems. This gives data scientists the flexibility they need to choose best-of-breed tools and techniques during development, without increasing the complexity of deployment for the IT team. Watson Machine Learning aims to combine other elements of IBM Watson Data Platform to provide a continuous feedback loop. When your model is ready to move into production, you can specify how frequently you would like to retrain it, and automate the redeployment process. You can also monitor and validate the results of the retrained model to ensure that the new version is an improvement — and with integrated version control, you can easily roll back to the previous release if necessary. GIVING DATA SCIENTISTS MORE POWER These capabilities help to reduce the need for IT teams to act as intermediaries in the deployment process, eliminating the biggest bottleneck for continuous improvement of machine learning m odels. They also place more power in the hands of data scientists, empowering them to focus on building and maintaining the most accurate models possible, instead of being forced to sacrifice quality for practicality. Most importantly, solutions like Watson Machine Learning give your models the chance to do what they were always meant to do: learn. By continuously retraining your models against the latest data, you can ensure that they continue to reflect today’s business realities, giving your organization the insight it needs to make smarter decisions and seize competitive advantage. Get started with Data Science Experience for free -------------------------------------------------------------------------------- Originally published at www.ibm.com on October 3, 2017. * Data Science * Machine Learning * IBM * Ibm Watson One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. Blocked Unblock Follow FollowingARMAND RUIZ Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats #Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my own FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Imagine you’re interviewing a new job applicant who graduated top of their class and has a stellar résumé. They know everything there is to know about the job, and has the skills that your business…",Lifelong (machine) learning: how automation can help your models get smarter over time,Live,253 739,"APPLE, IBM ADD MACHINE LEARNING TO PARTNERSHIP WITH WATSON-CORE ML COUPLING Ron Miller 15 hoursApple and IBM may seem like an odd couple , but the two companies have been working closely together for several years now. That has involved IBM sharing its enterprise expertise with Apple and Apple sharing its design sense with IBM. The companies have actually built hundreds of enterprise apps running on iOS devices. Today, they took that friendship a step further when they announced they were providing a way to combine IBM Watson machine learning with Apple Core ML to make the business apps running on Apple devices all the more intelligent. The way it works is that a customer builds a machine learning model using Watson, taking advantage of data in an enterprise repository to train the model. For instance, a company may want to help field service techs point their iPhone camera at a machine and identify the make and model to order the correct parts. You could potentially train a model to recognize all the different machines using Watson’s image recognition capability. The next step is to convert that model into Core ML and include it in your custom app. Apple introduced Core ML at the Worldwide Developers Conference last June as a way to make it easy for developers to move machine learning models from popular model building tools like TensorFlow, Caffe or IBM Watson to apps running on iOS devices. After creating the model, you run it through the Core ML converter tools and insert it in your Apple app. The agreement with IBM makes it easier to do this using IBM Watson as the model building part of the equation. This allows the two partners to make the apps created under the partnership even smarter with machine learning. “Apple developers need a way to quickly and easily build these apps and leverage the cloud where it’s delivered. [The partnership] lets developers take advantage of the Core ML integration,” Mahmoud Naghshineh, general manager for IBM Partnerships and Alliances explained. To make it even easier, IBM also announced a cloud console to simplify the connection between the Watson model building process and inserting that model in the application running on the Apple device. Over time, the app can share data back with Watson and improve the machine learning algorithm running on the edge device in a classic device-cloud partnership. “That’s the beauty of this combination. As you run the application, it’s real time and you don’t need to be connected to Watson, but as you classify different parts [on the device], that data gets collected and when you’re connected to Watson on a lower [bandwidth] interaction basis, you can feed it back to train your machine learning model and make it even better,” Naghshineh said. The point of the partnership has always been to use data and analytics to build new business processes, by taking existing approaches and reengineering them for a touch screen. “This adds a level of machine learning to that original goal moving it forward to take advantage of the latest tech. “We are taking this to the next level through machine learning. We are very much on that path and bringing improved accelerated capabilities and providing better insight to [give users] a much greater experience,” Naghshineh said.",Apple and IBM announce they were providing a way to combine IBM Watson machine learning with Apple Core ML to make the business apps running on Apple devices all the more intelligent.,"Apple, IBM add machine learning to partnership with Watson-Core ML coupling",Live,254 742,"REDIS PUBSUB, NODE, AND SOCKET.IO Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 10, 2016Sockets are the high power pipeline of the realtime web and in this article we'll show how a minimal amount of code can bring database data to life in a web browser. With the rise of bots and the chat based tools such as Slack and Messenger, users today have come to expect much more immediate interactions from their applications. One of the tools that most front end developers should have in their toolbox today is socket based communication. With a socket based solution it is easy to deliver realtime updating like leaderboards, stock quotes, tweets or any other streaming style of data to both mobile and web applications. Here we will look at using just such a set of tools with NodeJS and Socket.io on both the server and in the browser. And we will complement them with a Redis PubSub implementation to model interacting with backend services and Smoothie.js to finish off the front end with a visualization. We'll use tweets as an example but it is easy to substitute any kind of realtime data you may have available. NODEJS AND SOCKET.IO We need three things on the server side. First something to serve a web page since that in essence is our front end application. ExpressJS works just fine: var express = require('express'); var app = express(); var http = require('http').Server(app); app.use('/', express.static('www')); http.listen(8000, function(){ console.log('listening on *:8000'); }); Above we setup Node as an HTTP server to deliver our web page (application) which is just some static assets in the www directory. Second we create our server's socket infrastructure with Socket.io: var io = require('socket.io')(http); Seriously, that's it for setting it up. We haven't sent any messages yet, nor received any, but the infrastructure is now in place. And what is most interesting is that this will work over most current infrastructure because it starts with long-polling and then upgrades the connection to an actual websocket. So, you can use the socket model even without sockets currently. See engine-io for details. Third, we'll include a Redis subscription and wire up broadcasting an actual Socket.io message: var redis = require('redis'); var url = config.get('redis.url'); var client1 = redis.createClient(url); var client2 = redis.createClient(url); client1.on('message', function(chan, msg) { client2.hgetall(msg, function(err, res) { res.key = msg; io.sockets.emit(res); }); }); client1.subscribe('yourChannelName'); We use two Redis connections. client1 handles the PubSub subscription while client2 actually gets the hash for the key that came through the subscription (it is be possible to remove the second connection and push all of the data through the PubSub channel too). Then with io.sockets.emit(res); we broadcast all of the data to any connected clients. We've left out the Redis publish side of above but it really isn't any more complicated than reversing what we've shown: client.publish(""yourChannelName"", msg); As you can see the simplicity of this highlights how effective Node is as an event based networking tool. Next we'll move on to the client side which listens for the broadcast. WEB BROWSER AND SOCKET.IO As you might have guessed the browser side of Socket.io is pretty easy too. So, with the assumption that an html page has been delivered to your browser via Express and your Node server then the following sock.on() will be called every time a broadcast message is emitted from your server. The beauty here is that the Socket.io library defaults to contacting the same server which delivered the page. That little bit of script is perfect for handing a continuous stream of events off to a realtime charting tool. While there are lots of JavaScript charting libraries, one of the easiest for this style of data is Smoothie.js . To use it set up a tag in the body of an html page and then you can attach the chart and stream the data to it. The JavaScript to wire all of the charting up, attach it to the canvas, and stream follows: function createGraphOnPageLoad() { var sock = io(); var smoothie = new SmoothieChart(); smoothie.streamTo(document.getElementById('twits')); var redLine = new TimeSeries(); var blueLine = new TimeSeries(); smoothie.addTimeSeries(redLine,{ strokeStyle:'rgb(255, 0, 0)', lineWidth:3 } ); smoothie.addTimeSeries(blueLine, { strokeStyle:'rgb(0, 0, 255)', lineWidth:3 }); sock.on('twits', function(msg) { var at = new Date().getTime(); var reach = msg.reach * 1; if(msg.category == ""Red"") { redLine.append(at, reach); } else { blueLine.append(at, reach); } }); } The above function should be called after the page loads which ensures that the canvas element is already created. It creates the socket, creates the chart and wires it to the canvas. Then it creates two timeSeries. The messages actually represent tweets and the reach is the number of people who receive the tweet. The Redis PubSub actually transports both red and blue categorized tweets. The timelines represent how many followers could see the tweet on the y-axis and time on the x-axis. Red and blue are categories of twitter searches for comparison. It adds them to the chart at which point it waits for the events which are actually tweets and then it appends them. On each append the chart is updated with the reach metric and inserted at the current time. The web page output looks like this: While it is a simple charting solution it does a good job of showing the value of the full chain of soft realtime data via Node, Redis, and Socket.io. To view the code example on github go here . -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. Image by William Iven Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton writes code and then writes about it. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud * Subscribe Join 75,000 other devs on our weekly database newsletter. © 2016 Compose",Sockets are the high power pipeline of the realtime web and in this article we'll show how a minimal amount of code can bring database data to life in a web browser.,"Redis PubSub, Node, and Socket.io",Live,255 743,"RStudio Blog * Home * Subscribe to feed XML2 1.0.0 July 5, 2016 in Packages We are pleased to announced that xml2 1.0.0 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library, and makes it easy to work with XML and HTML files in R. Install the latest version with: install.packages(""xml2"") There are three major improvements in 1.0.0: 1. You can now modify and create XML documents. 2. xml_find_first() replaces xml_find_one() , and provides better semantics for missing nodes. 3. Improved namespace handling when working with XPath. There are many other small improvements and bug fixes: please see the release notes for a complete list. MODIFICATION AND CREATION xml2 now supports modification and creation of XML nodes. This includes new functions xml_new_document() , xml_new_child() , xml_new_sibling() , xml_set_namespace() , xml_remove() , xml_replace() , xml_root() , and replacement methods for xml_name() , xml_attr() , xml_attrs() and xml_text() . The basic process of creating an XML document by hand looks something like this: root <- xml_new_document() %>% xml_add_child(""root"") root %>% xml_add_child(""a1"", x = ""1"", y = ""2"") %>% xml_add_child(""b"") %>% xml_add_child(""c"") %>% invisible() root %>% xml_add_child(""a2"") %>% xml_add_sibling(""a3"") %>% invisible() cat(as.character(root)) #> #> For a complete description of creation and mutation, please see vignette(""modification"", package = ""xml2"") . XML_FIND_FIRST() xml_find_one() has been deprecated in favor of xml_find_first() . xml_find_first() now always returns a single node: if there are multiple matches, it returns the first (without a warning), and if there are no matches, it returns a new xml_missing object. This makes it much easier to work with ragged/inconsistent hierarchies: x1 <- read_xml("" See Sea "") c <- x1 %>% xml_find_all("".//b"") %>% xml_find_first("".//c"") c #> {xml_nodeset (3)} #> [1] #> [2] See #> [3] Sea Missing nodes are replaced by missing values in functions that return vectors: xml_name(c) #> [1] NA ""c"" ""c"" xml_text(c) #> [1] NA ""See"" ""Sea"" XPATH AND NAMESPACES XPath is challenging to use if your document contains any namespaces: x <- read_xml(' ') x %>% xml_find_all("".//baz"") #> {xml_nodeset (0)} To make life slightly easier, the default xml_ns() object is automatically passed to xml_find_*() : x %>% xml_ns() #> d1 <-> http://foo.com #> d2 <-> http://bar.com x %>% xml_find_all("".//d1:baz"") #> {xml_nodeset (1)} #> [1] If you just want to avoid the hassle of namespaces altogether, we have a new nuclear option: xml_ns_strip() : xml_ns_strip(x) x %>% xml_find_all("".//baz"") #> {xml_nodeset (2)} #> [1] #> [2] SHARE THIS: * Reddit * More * * Email * Facebook * * Print * Twitter * * LIKE THIS: Like Loading...RELATED SEARCH LINKS * Contact Us * Development @ Github * RStudio Support * RStudio Website * R-bloggers CATEGORIES * Featured * News * Packages * R Markdown * RStudio IDE * Shiny * shinyapps.io * Training * Uncategorized ARCHIVES * July 2016 * June 2016 * May 2016 * April 2016 * March 2016 * February 2016 * January 2016 * December 2015 * October 2015 * September 2015 * August 2015 * July 2015 * June 2015 * May 2015 * April 2015 * March 2015 * February 2015 * January 2015 * December 2014 * November 2014 * October 2014 * September 2014 * August 2014 * July 2014 * June 2014 * May 2014 * April 2014 * March 2014 * February 2014 * January 2014 * December 2013 * November 2013 * October 2013 * September 2013 * June 2013 * April 2013 * February 2013 * January 2013 * December 2012 * November 2012 * October 2012 * September 2012 * August 2012 * June 2012 * May 2012 * January 2012 * October 2011 * June 2011 * April 2011 * February 2011 EMAIL SUBSCRIPTION Enter your email address to subscribe to this blog and receive notifications of new posts by email. Join 19,744 other followers RStudio is an affiliated project of the Foundation for Open Access Statistics 1 COMMENT July 10, 2016 at 2:04 am Petites choses pour les vacances | Polit’bistro : des politiques, du café […] Pour les nerds qui aiment la programmation statistique et le Web, il y a le XML, et pour le XML avec R, il y a xml2, désormais en version 1.0.0. […] « Join us at rstudio::conf 2017! httr 1.2.0 »Blog at WordPress.com. The Tarski Theme . Subscribe to feed. FollowFOLLOW “RSTUDIO BLOG” Get every new post delivered to your Inbox. Join 19,744 other followers Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","We are pleased to announced that xml2 1.0.0 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library, and makes it easy to work with XML and HTML files in R. Install t…",xml2 1.0.0,Live,256 751,"After visiting a trade show and seeing a succession of dull stands with only leaflets to hand out, Chris Snow and I came up with the idea of building a Cloudant cluster of Raspberry Pis to put at IBM Cloudant’s booth. Cloudant’s NoSQL Database-as-a-Service clusters are hidden away in the depths of data centres around the world belonging to SoftLayer, Rackspace, Microsoft and Amazon, so there is little tangible product to display at a conference stand; no software to install, no drivers required, no sql! A working Cloudant cluster at the booth allows distributed databases to be seen in action with flashing lights indicating per-node activity.I started by building the Developer Preview of CouchDB 2.0 on a single Raspberry Pi running Debian Wheezy to make sure it was feasible. It wasn’t as simple as typing “sudo apt-get install couchdb” - that only gets you CouchDB 1.2. I needed CouchDB 2.0, which includes the multi-node clustering technology that Cloudant developed and has been donated back to Apache CouchDB’s open-source community. The process for installing CouchDB 2.0 is a bit more involved and involves building the project from source after installing its dependencies. The single-node worked and so plans were made to build a 12-node cluster. Here is my original sketch:The idea was to have 12 Pis arranged in ring to mimic the logical arrangement of the full-size servers in a real production Cloudant custer. Each machine was to be connected via wifi to router at the rear and a load balancer (a 13th Pi) would direct traffic around the cluster.* The hardware was ordered and boxes were unpacked. The blank SD cards were burned with fresh operating system images* LEDs were hand-soldered to resistors and GPIO connectors.Then came the tricky bit: building and installing the software on 12 devices. The automation tool I chose to help with this was Ansible which allows the scripting of tasks in YML ‘playbooks’ which can be executed in parallel via SSH on multiple host machines. The playbooks I created are published in two Github repositories:* https://github.com/glynnbird/ansible-cluster-tools - a grab-bag of scripts I used to configure the cluster, install services and customise the installation.A key feature of the project was to make each machine’s LED flash whenever that node was performing an action. To do this I created a Node.js script called ‘flasher’ which pulses an LED on and off whenever a line of text arrives on stdin. This allows output from log files to be piped to ‘flasher’ very simply e.g.tail -f node1.log | flasher > /dev/null &This brings me to head-scratching problem that caused me an hour or two of head scratching. It turns out thattail -f node1.log | grep 'FLASH'happily produces output, but not when its output is piped to another process. i.e.# this works - each line appearing in node1.log containing ‘FLASH’# appears on stdouttail -f node1.log | grep 'FLASH'# this doesn’t work - the LED doesn’t flash - the ‘flasher’ script# doesn’t see any input!tail -f node1.log | grep 'FLASH' | flasherWhy not? You have to do:tail -f node1.log | grep --line-buffered 'FLASH' | flasher > /dev/null &otherwise nothing happens. Silly me.I had to patch CouchDB’s “Fabric” Erlang code to ensure that log messages were created containing the word ‘FLASH’ whenever a node was asked to store or retrieve data e.g.all_docs(DbName, Options, #mrargs{keys=undefined} = Args0) ->couch_log:notice(""FLASH all_docs"", []),The effect of this is that when a document is added into the distributed database, three machines’ LEDs flash simultaneously, indicating the three nodes dealing with the shard that the data resides in. By sharding the data, Cloudant can store more data than could be held on one machine and divides read, write and indexing load into smaller chunks.When all of the database nodes were configured and a load-balancer running HAproxy was built, the cluster was up and running, shown here testing the flashing of LEDs:After that, the devices were sent away to be turned into something worthy of displaying at a conference booth:So if you see Cloudant represented at a developer conference near you, stop by and say hello and I’ll show you how it all works. See how the cluster shares the workload around the cluster, how it keeps multiple copies of the same data and how it can survive node failures automatically. The cluster’s data can be replicated to other instances of CouchDB, to live Cloudant accounts or to mobile devices running PouchDB or Cloudant Sync for iOS or Android.Buy SD cards with a pre-installed operating system image. Burning your own is very slow.Use “class 10” SD cards. It doesn’t make the Raspberry Pis any faster, but it does make dealing with images on your Mac/PC a good deal quicker.Automate everything. Ansible was invaluable for coordinating actions across all the nodes in parallel.Use a tagged release of CouchDB - if you build the “master” branch, then you will also get unstable “master” versions of its dependencies.Use the new Raspberry Pi 2 model - they are much quicker and cost the same as the older models.","Cloudant’s NoSQL Database-as-a-Service clusters are hidden away in the depths of data centres around the world belonging to SoftLayer, Rackspace, Microsoft and Amazon, so there is little tangible product to display at a conference stand; no software to install, no drivers required, no sql! A working Cloudant cluster at the booth allows distributed databases to be seen in action with flashing lights indicating per-node activity.",Building a Cloudant cluster of Raspberry Pis,Live,257 757,"Homepage Follow Sign in / Sign up 47 3 Oliver Cameron Blocked Unblock Follow Following I lead the self-driving car team at @udacity. Previously founder of a @ycombinator startup. yesterday 2 min read -------------------------------------------------------------------------------- OPEN SOURCING 223GB OF DRIVING DATA COLLECTED IN MOUNTAIN VIEW, CA BY OUR LINCOLN MKZ Data available on GitHubA necessity in building an open source self-driving car is data. Lots and lots of data. We recently open sourced 40GB of driving data to assist the participants of the Udacity Self-Driving Car Challenge #2 , but now we’re going much bigger with a 183GB release. This data is free for anyone to use, anywhere in the world. WHAT’S INCLUDED 223GB of image frames and log data from 70 minutes of driving in Mountain View on two separate days, with one day being sunny, and the other overcast. Here is a sample of the log included in the dataset. Note: Along with an image frame from our cameras, we also include latitude, longitude, gear, brake, throttle, steering angles and speed . Mountain View, CATo download both datasets, please head to our GitHub repo . -------------------------------------------------------------------------------- We can’t wait to see what you do with the data! Please share examples with us in our self-driving car Slack community , participate in Challenge #2 , or send a Tweet to @olivercameron . Enjoy! Self Driving Cars Autonomous Vehicles Open Source Machine Learning Big Data 47 3 Blocked Unblock Follow FollowingOLIVER CAMERON I lead the self-driving car team at @udacity . Previously founder of a @ycombinator startup. FollowUDACITY INC Be in Demand × Don’t miss Oliver Cameron’s next story Blocked Unblock Follow Following Oliver Cameron","A necessity in building an open source self-driving car is data. Lots and lots of data. We recently open sourced 40GB of driving data to assist the participants of the Udacity Self-Driving Car Challenge #2, but now we’re going much bigger with a 183GB release. This data is free for anyone to use, anywhere in the world.",Open Sourcing 223GB of Driving Data – Udacity Inc,Live,258 759,"Compose The Compose logo Articles Sign in Free 30-day trialETCD 2 TO 3: NEW APIS AND NEW POSSIBILITIES Published May 11, 2017 etcd etcd 2 to 3: new APIs and new possibilitiesThe change from version 2 to 3 of the distributed etcd database also sees massive changes in how the database works. To help you understand the what and why of the changes, read on... At Compose our engineering teams have been getting deep into etcd version 3.x, the follow-up to etcd 2.x that is currently deployable on Compose. Etcd has become an essential tool behind the scenes of many cloud computing projects and products as it offers a simple, reliable, consistent, key-value database that can be used as the source of truth for huge clusters of cloud-deployed applications and their configuration. A jump in major numbers always means that a lot of things change in any product, usually in response to the requirements of customers and users of the preceding version. In etcd 3.x, this is doubly so as fundamental concepts have been reworked to suit the demands of scale and efficiency and that means there's a new learning curve. FROM HTTP TO GRPC Let's start with a change that touches every point of the system; how applications communicate with etcd. The etcd 2.x system's API was built on JSON communicated HTTP endpoints. This was very accessible; all you needed was curl or similar and you could work with it. This is what is now called the etcd API version 2. It worked for the original scale of etcd but the developers were looking to handling ""tens of thousands of clients and millions of keys in a single cluster"". For that, they have moved over to gRPC which is built on top of Protocol Buffers . It's inspired by HTTP/REST but runs over HTTP/2 , uses static routes only rather than ones with parameters embedded in them and sends back API-centric results rather than HTTP status codes. It also builds in support for full-duplex streaming for long running connections. This is the etcd API version 3. An etcd 2.x server only understands the version 2 API. An etcd 3.x server can understand both version 2 and version 3 APIs but, and it's a huge but, anything you create with clients using one API version will be invisible to clients using the other API version. That's because around the back end, each API routes to a separate data store - they are so different that they are isolated from each other inside the server. ALL CHANGE IN ETCDCTL That split goes all the way up to the command line often your first port of call when working with etcd. Etcdctl , the command-line tool for etcd, is one binary but it now behaves like one of two programs depending on the ETCDCTL_API environment variable. Set it to 2, and it behaves like the etcdctl application from etcdv2 using HTTP/JSON communications and the familiar set of commands. Set it to 3 and pretty much every command is different as the applications works in terms of the newer API. To give you an idea, here's a screenshot of both versions of the command side by side. From this point on, when we say etcd2, we're referring to the API version 2 and etcd3 refers to the API version 3. GOODBYE HIERARCHY, HELLO FLAT KEYSPACE One of the interesting attributes of keys in etcd2 is the ability to also hold directories of more keys with values or more directories. This lets you create hierarchical file-system like structures for holding your data, like ""/clusters/node00/activity/xyz"". You could perform various operations with reference to this hierarchy too, so etcd2 allowed clients to wait for activity on a key or a directory (or any of its children) so, for example, you could monitor ""/clusters/node00"" for changes. Well, that's all gone. There's now a simple flat namespace for keys. The switch to flat namespaces makes things much easier to manage in terms of consistency and efficiency in clustered systems which is why most people want something like etcd in the first place. You can create a key that's ""/clusters/node00/activity/xyz"" but it's handled as a single string. There's no directories implied or created. That said, you can create your own hierarchy through how you name things and etcd3 is there with a prefix option to let you match anything that starts with a particular key value. So you can emulate directory structures; for example, given that key above, we could just look for changes for anything in ""node00"" with this command: ETCDCTL_API=3 etcdctl watch --prefix ""/cluster/node00/"" And get a similar effect. Prefixes mitigate the loss of directory structures in etcd3 for the more predictable flat namespace. If you are making extensive use of directory structures in etcd2, this is going to be the first thing you want to allow for in your migration to etcd3. COMPARE AND SWAP OUT, TRANSACTIONS IN In etcd2, much is made of the atomicity of particular options, such as compare-and-swap to ensure that no two clients interfere with each other and leave the data inconsistent. The problem with atomic actions is, though, as things get more complex more data needs to be consistently modified and an atomic action is by definition, limited in scope to protecting the action. Etcd3 still has atomic operations, but they are now joined by the more interesting transactions. These aren't transactions in the traditional ""giant lock"" sense, but a compact guarded ""if ... then ... else"" operation. Here's a small sample of Go code and the clientv3 library using a transaction: tx := cli.Txn(context.TODO()) txresp, err := tx.If( clientv3.Compare(clientv3.Value(""foo""), ""="", ""bar""), ).Then( clientv3.OpPut(""foo"", ""sanfoo""), clientv3.OpPut(""newfoo"", ""newbar""), ).Else( clientv3.OpPut(""foo"", ""bar""), clientv3.OpDelete(""newfoo""), ).Commit() In the If() section, a comparison is defined (checking key foo to see if it's equal to bar ). You can have multiple comparison operators here; the If is true if all the comparisons are true. If that is true, the operations in the Then() section are run. If not, the Else() sections operations are run. You can do multiple operations and all the changes will be handled as a single index increment in etcd's database. It's quite a powerful primitive and it's what you'll use to replace the Compare-and-swap and Compare-and-delete operations in etcd2 code. TTLS EXPIRED, LEASES OBTAINED The change with TTLs in etcd3 sees the per key TTLs of etcd2 turn into a more general Lease. Leases can be created and have keys attached to them. The Lease itself has a time to live and when that expires all the keys attached to the Lease get expired. You can keep the Lease alive with a KeepAlive request or make it go away with a Revoke request. What this gives you, practically, is much better-synchronized behavior. A server could create a set of property values with all the keys to those values under one Lease. If it is the server's responsibility to send KeepAlive requests to the Lease, when it stops doing that then all the related properties neatly disappear. Working with it is simple enough too: // Get a lease lease, err := cli.Grant(context.TODO(), 10) // Attach a key to it _, err = cli.Put(context.TODO(), ""foo"", ""bar"", clientv3.WithLease(lease.ID)) ... // Prod it to keep alive once... _, err = cli.KeepAliveOnce(context.TODO(), lease.ID) // Sleep time.Sleep(time.Second*5) // Read the time to live status, err = cli.TimeToLive(context.TODO(), lease.ID) fmt.Printf(""Status: %v\n"", status.TTL) WATCHING RATHER THAN WAITING Watching in etcd2 meant waiting for changes; opening an HTTP connection for each key you wanted to watch and waiting for it to return changes. For etcd3, and in keeping with getting everything to scale better, the way you watch is now handled by watcher RPCs. Create a watcher RPC and request watches on keys or ranges of keys from it and it'll return a stream of changes to those keys. You can ask for previous revisions too, back to when the server last compacted its data, and play back from there. In the Go client for etcd3, the Watcher RPC is managed for you and all you need to do is request a Watch which returns you a Go channel down which the changes arrive. That looks something like this: rch := cli.Watch(context.Background(), ""foo"", clientv3.WithPrefix()) go func(chn clientv3.WatchChan) { for wresp := range chn { for _, ev := range wresp.Events { fmt.Printf(""%s %q : %q\n"", ev.Type, ev.Kv.Key, ev.Kv.Value) } } }(rch) This snippet launches a goroutine which prints out incoming change events. I'm using the prefix option which was mentioned earlier. This uses the key value as the prefix we want to match with so I get changes for ""foo"", ""foo2"", ""foonicular"", ""foo/bar/ftang/ftang"" and whatever other keys start with ""foo"". PREVIOUS VALUES OR NOT Many etcd2 operations could return the previous value associated with a key so you could see what you'd deleted or what you'd replaced. By default, etcd3 doesn't do this. There is a WithPrevKV() option you can add to operations, but don't assume it'll always return anything. To optimize etcdv3, the server compacts the data regularly and if the compacted data isn't available, there's nothing for WithPrevKV() to return. If you can, stop relying on this behavior. If you can't though, an option is to create a transaction which reads the current value and returns it before changing it. It's fiddly, but it'll be atomic and reliable. SO ETCD3? Given all these changes, it is pragmatically worth considering etcd 3.x and the etcd's version 3 API as a new database in terms of developing your client and creating your ops workflows. It is built for efficient scaling up of workloads though and avoids the dangers of simple operations in complex environments with its use of leases and watchers. There's no simple migration path for applications and, currently, there are not as many client drivers for various languages as there are for etcd2. That said, gRPC is widely available and you can consider developing your own driver. If you want an enterprise-scaled, consistent, observable source of truth, then etcd 3.x and the etcd version 3 API are the way to go. We've only skimmed over the changes here and not touched any of the new features that have appeared; we'll have more on that when it gets closer to etcd3 being made available on Compose. -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. attribution HypnoArt Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES May 5, 2017NEWSBITS - ELASTICSEARCH, REDIS, MONGODB, ETCD, GCC, GO, HOMEBREW AND MORE NewBits for the week ending 5th May - Elasticsearch goes to 5.4, Redis history revealed, MongoDB and etcd updates, GCC is 30… Dj Walker-Morgan Apr 28, 2017NEWSBITS - MYSQL, ELASTICSEARCH, MONGODB, ETCD, COCKROACHDB, SQL SERVER, CRICKET AND JUICE NewBits for the week ending 28th April - MySQL 8.0.1's preview demos better replication, Elasticsearch, MongoDB and etcd get… Dj Walker-Morgan Feb 17, 2017NEWSBITS: REDIS, ETCD AND ELASTICSEARCH UPDATES, GO 1.8, GITHUB GUIDES AND CHATOPS AND MORE NewsBits for the week ending 17th February - Redis gets a critical update, etcd's latest release, Elasticsearch gets a bump,… Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",The change from version 2 to 3 of the distributed etcd database also sees massive changes in how the database works. Let's understand the what and why of the changes.,etcd 2 to 3: new APIs and new possibilities,Live,259 765,"KDNUGGETS Data Mining, Analytics, Big Data, and Data Science Subscribe to KDnuggets News | Follow | Contact * SOFTWARE * NEWS * Top stories * Opinions * Tutorials * JOBS * Academic * Companies * Courses * Datasets * EDUCATION * Certificates * Meetings * Webinars KDnuggets Home » News » 2016 » Oct » Tutorials, Overviews » MLDB: The Machine Learning Database ( 16:n37 )LATEST NEWS, STORIES * MLDB: The Machine Learning Database Top 10 Data Science Videos on Youtube Data Science + Criminal Justice Deep Learning meets Deep Deployment Equifax: Strategic Data Performance Analyst More News & Stories | Top Stories MLDB: THE MACHINE LEARNING DATABASE Previous post Tweet Tags: Classification , Database , Machine Learning , TensorFlow , Transfer Learning -------------------------------------------------------------------------------- MLDB is an open­source database designed for machine learning. Send it commands over a RESTful API to store data, explore it using SQL, then train machine learning models and expose them as APIs. By François Maillet, MLDB.ai . In this post, we’ll show how easy it is to use MLDB to build your own real­time image classification service. We will use different brand of cars in this example, but you can adapt what we show to train a model on any image dataset you want. We will be using a TensorFlow deep convolutional neural network, transfer learning, and everything will run off MLDB. TRANSFER LEARNING WITH THE INCEPTION MODEL At a high level, transfer learning allows us to take a model that was trained on one task and use its learned knowledge on another task. We use the Inception-­v3 model , a deep convolutional neural network, that was trained on the ImageNet Large Visual Recognition Challenge dataset. The task of that challenge was to classify images into a varied set of 1000 classes, like badger, freight car or cheeseburger . The Inception model was openly released as a trained TensorFlow graph. TensorFlow is a deep learning library that Google open­-sourced last year, and MLDB has a built-­in integration for it. As you’ll see, MLDB makes it extremely simple to run TensorFlow models directly in SQL. When solving any machine learning problem, one critical step is picking and designing feature extractors. They are used to take the thing we want to classify, be it an image, a song or a news article, and transform it into a numerical representation, called a feature vector, that can be given to a classifier. Traditionally, the selection of feature extractors was done by hand. One of the really exciting things about deep neural networks is that they can learn feature extractors themselves. Below is the architecture of the Inception model, where images go in from the left and predictions come out to the right. The very last layer will be of size 1000 and give a probability for each of the classes. However, the layers that come before are transformations over the raw image learned by the network because they were the most useful to solve the image classification task. Some layers are for example edge detectors. So the idea will be to run images through the network, but instead of getting the output of the last layer, that is specialised to the ImageNet task, getting the second to last, which will give us a conceptual numerical representation of the images. We can then use that representation as features that we can give to a new classifier that we will train on our own task. So you can think of the Inception model as a way to get from an image to a feature vector over which a new classifier can efficiently operate. We are leveraging hundreds of hours of GPU compute-­time that went into training the Inception model, but applying it to a completely new task. INCEPTION ON MLDB Let’s get started! The code below uses our pymldb library. You can read more about it on the MLDB Documentation . What did we do here? We made a simple PUT call using pymldb to create the ​ inception function, of type tensorflow.graph . It is parameterized using a JSON blob. The function loads a trained instance of the Inception model (note that MLDB can transparently load remote resources, as well as files inside of compressed archives; more on this here ). We specify that the input to the model will be the remote resource located at url , and the output will be the ​ pool_3 layer of the model, which is the second to last layer. Using the pool_3 layer will give us high level features, while the last layer called softmax is the one that is specialized to the ImageNet task. Now that the ​ inception function is created, it is available in SQL and as a REST endpoint. We can then run an image through the network with a simple SQL query. Here we’ll run Inception on the KDNuggets logo, and what we’ll get is the numerical representation of that image. Those 2048 numbers are what we can use as our feature vector: PREPARING A TRAINING DATASET WITH SQL Now we can import our data for training. We have a CSV file containing about 200 links to car images from 3 popular brands: Audi, BMW and Tesla. It’s important to remember that although we are using a car dataset, you could replace it with your own images of anything you want. We can import the CSV file in a dataset by running an ​ import.text procedure : We can generate some quick stats with SQL: We can now use a procedure of type transform to apply the ​ Inception model over all images and store the results in another dataset. A transform procedure simply executes an SQL query and saves the result in a new dataset. Running the code below is essentially doing feature extraction over our image dataset. TRAINING A SPECIALIZED MODEL Now that we have features for all of our images, we use a procedure of type classifier.experiment to train and test a random forest classifier. The dataset will be split 50/50 between train and test by default. Notice the contents of the ​ inputData key, that specifies what data to use for training and testing, is SQL. The {* EXCLUDING(label)} is a good example of MLDB’s row expression syntax that is meant to work with sparse datasets with millions of columns. Looking at the performance on the test set, this model is doing a pretty good job: DOING REAL­TIME PREDICTIONS Now that we have a trained model, how do we use it to score new images? There are two things we need to do for this: extract the features from the image and then run that in our newly trained classifier. This is essentially our scoring pipeline. What we do is create a function called brand_predictor of type sql.expression . This allows us to persist an SQL expression as a function that we can then call many times. When we trained our classifier above, the training procedure created a car_brand_cls_scorer_0 automatically, available in the usual SQL/Rest, that will run the model. It will be expecting an input column named ​ features . And just like that we’re now ready to score new images off the internet: { ""output"": { ""scores"": [ [ ""\""audi\"""", [ -8, ""2016-05-05T04:18:03Z"" ] ], [ ""\""bmw\"""", [ -7.333333492279053, ""2016-05-05T04:18:03Z"" ] ], [ ""\""tesla\"""", [ 0.2666666805744171, ""2016-05-05T04:18:03Z"" ] ] ] } } The image we gave it represented a Tesla, and that is the label that got the highest score. CONCLUSION The Machine Learning Database solves machine learning problems end­-to-­end, from data collection to production deployment, and offers world­-class performance yielding potentially dramatic increases in ROI when compared to other machine learning platforms. In this post, we only scratched the surface of what you can do with MLDB. We have a white-­paper that goes over all of our design decisions in details. If we’ve peaked your interest, here are a few links that may interest you: * try MLDB for free in 5 minutes by launching a hosted instance run a trial version of MLDB on your own hardware using Docker or Virtualbox Check out our demos and tutorials , especially DeepTeach which uses the same techniques as shown in this post, and MLPaint , a white­box real­time handwritten digit recogniser MLDB Github repository Don’t hesitate to get in touch! You can find us on Gitter , or follow us on Twitter . All the code from this article is available in the MLDB repository as a Jupyter notebook , and is also shipped with MLDB. Boot up an instance, go the the demos folder and you can run a live version. Happy MLDBing! Bio: François Maillet is a computer scientist specialising in machine learning and data science. He leads the machine learning team at MLDB.ai, a Montréal startup building the Machine Learning Database ​ (MLDB). François has been applying machine learning for almost 10 years to solve varied problems, like real­-time bidding algorithms and behavioral modelling for the adtech industry, automatic bully detection on web forums, audio similarity and fingerprinting, steerable music recommendation and playlist generation. Related: * Recycling Deep Learning Models with Transfer Learning Spark for Scale: Machine Learning for Big Data The Deception of Supervised Learning -------------------------------------------------------------------------------- Previous post -------------------------------------------------------------------------------- TOP STORIES PAST 30 DAYS Most Popular 1. The 10 Algorithms Machine Learning Engineers Need to Know 21 Must-Know Data Science Interview Questions and Answers How to Become a Data Scientist - Part 1 7 Steps to Mastering Machine Learning With Python Top Algorithms and Methods Used by Data Scientists 9 Key Deep Learning Papers, Explained 7 Steps to Mastering Apache Spark 2.0 Most Shared 1. Top Algorithms and Methods Used by Data Scientists Data Science for Internet of Things (IoT) : Ten Differences From Traditional Data Science 7 Steps to Mastering Apache Spark 2.0 Battle of the Data Science Venn Diagrams Top Data Scientist Claudia Perlich on Biggest Issues in Data Science Data Science Basics: Data Mining vs. Statistics Automated Data Science & Machine Learning: An Interview with the Auto-sklearn Team MORE RECENT STORIES * Equifax: Senior Statistical Modeler Equifax: Senior Director, Search-Match & Data-Linking Rexer Analytics Data Science Survey Highlights Equifax: Metadata Expert Artificial Intelligence, Deep Learning, and Neural Networks, E... Strata Hadoop 2016: Fast Data and Robots NYU Stern – Master of Science in Business Analytics K2 Data Science Bootcamp Data Preparation Tips, Tricks, and Tools: An Interview with th... EDISON Data Science Framework to define the Data Science Profe... Novel Tensor Mining Tool to Enable Automated Modeling Equifax: Employee Analytics Leader Equifax: Data Visualization Engineer Equifax: Data Strategy Leader How to Get Stuff Done at a Data Startup Apache: Big Data Europe (Nov. 14-16) – Leading Event for... The R Graph Gallery Data Visualization Collection Zaireo: Data Scientist Top tweets, Oct 05-11: Most Active #DataScientists on #Gith... Top 12 Interesting Careers to Explore in Big Data KDnuggets Home » News » 2016 » Oct » Tutorials, Overviews » MLDB: The Machine Learning Database ( 16:n37 ) © 2016 KDnuggets. About KDnuggets Subscribe to KDnuggets News | Follow @kdnuggets | | X","MLDB is an open­source database designed for machine learning. Send it commands over a RESTful API to store data, explore it using SQL, then train machine learning models and expose them as APIs.",The Machine Learning Database,Live,260 766,This video shows you how to replicate one of the sample databases on cloudant.com to your Cloudant account. Sign up for a Cloudant account here: https://cloudant.com/sign-up/. Find more videos and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center,This video shows you how to replicate one of the sample databases on cloudant.com to your Cloudant account. ,Replicate a Sample Database,Live,261 767,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCHECK OUT IBM’S “NEW BUILDERS” PODCASTMike Broberg / April 28, 2016We here on the CDS developer advocacy team sure like to code. But we also liketo talk. Whether we’re presenting at conferences, leading sessions athackathons, or recording demos of our apps — it’s a rare moment we’re notspreading the good word.Now, we have a new outlet for our motor mouths: The New Builders Podcast !Episodes at https://developer.ibm.com/tv/builders/The New Builders is a weekly podcast featuring developers from around the web,talking about the new languages, libraries, and infrastructure they’re using tobuild their apps. It’s a mix of perspectives from engineers outside and insideof IBM. Here’s a recap of the first three episodes: * The first episode features our own Bradley Holt in a roundtable discussion on web/mobile development, where he advocates for offline first design. On the other side of the conversation is Greg Avola , the CTO for social beer app Untappd . They talk about progressive web apps, HTML5, Ionic, PouchDB, and more. While Untappd doesn’t persist data locally for offline access, the app makes heavy use of cross-platform development with Apache Cordova™. * The second episode features leaders from private messaging app Cyber Dust : CEO & Co-Founder Ryan Ozonian and Lead Engineer Rohit Kotian . They talk about about scaling their stack, which for their core messaging platform is a lot of Java and GridGain. Their users’ messages are held in-memory and never persisted anywhere. Take comfort in the ephemeralness. * The third episode is a discussion with our own David Taieb and IBM Lead Data Scientist Jorge Castañón . Jorge and David built an analytics app for predicting flight delays at airports. But they faced a big challenge in connecting to on-premises data sources and moving data to the cloud, where it could be more efficiently analyzed. Listen in for their approach to data movement and machine learning in Apache Spark™.When I first started working with Cloudant in 2012, I learned the most fromtalking directly to engineers and customers. Often the output was mediainterviews or Q&A articles I’d post to our blog. My marketing colleagues Doug Flora and Jim Young , who are producing this podcast, are taking a similar approach, but doing waybetter than I ever did. I’ve really enjoyed the podcast so far. The New Buildersis worth a listen.The first episode is embedded here below. Enjoy!SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","The New Builders podcast features weekly interviews with developers from around the web, discussing code, infrastructure, and their overall stack.",IBM's New Builders podcast,Live,262 768,"* R Views * About this Blog * Contributors * Some Resources * * R Views * About this Blog * Contributors * Some Resources * DECEMBER ’16 RSTUDIO TIPS AND TRICKS by Sean Lopp Here is this month’s collection of RStudio Tips and Tricks. Thank you to those who responded to last month’s post ; many of your tips are included below! Be sure to subscribe to @rstudiotips on Twitter for more. This month’s tips fall into two categories: Keyboard Shortcuts and Easier R Markdown KEYBOARD SHORTCUTS The RStudio IDE is built upon “hooks”. Hooks are actions that the IDE can take. For instance, there is a hook to create a new file. Most users interact with hooks with point-and-click interactions. ( RStudio toolbar -> new file or File -> New File ). But, there is an alternative! All of these hooks have been surfaced to end users and can be bound to a keyboard shortcut. (Some of these actions are “secret” – they aren’t exposed through point-and-click options.) CUSTOM KEYBOARD SHORTCUTS To view the complete list of actions, the current keybindings, and to customize keybindings, go to: Tools -> Modify Keyboard Shortcuts . CODE CHUNK NAVIGATION Define shortcuts for code chunk navigation using the previous tip. For example, Alt+Cmd+Down for Next Chunk and Alt+Cmd+Up for Previous Chunk. ASSIGNMENT OPERATOR Use Alt+- (press Alt at the same time as pressing - ). This adds the assignment operator and spacing. PIPE OPERATOR Use Cmd+Shift+m (for Mac) or Ctrl+Shift+m (for Windows). This adds the pipe operator %>% and spacing. EASIER R MARKDOWN R MARKDOWN OPTIONS R Markdown output formats include arguments specified in the YAML header. Don’t worry about remembering all of the key-value pairs; in RStudio, you can access and change the most common through a user-interface: SPELL CHECKER Use the built-in spell checker when writing a R Markdown document. (Code chunks are automatically ignored.) SQL CODE CHUNKS Execute SQL queries against database connections directly in R Markdown chunks. R MARKDOWN WEBSITES Are you building a website with R Markdown ? Any RStudio project with an R Markdown website will include a Build Website option in the build pane. What’s your favorite RStudio Tip? seanlopp 2016-12-08T17:53:20+00:00 250 Northern Ave, Boston, MA 02210 844-448-1212 info@rstudio.com DMCA Trademark Support ECCN * Switch tabs w/o muscle cramps: New RStudio Desktop 1.0.136 switches w/ Ctrl+Tab. Lots of tabs? Ctrl+Shift+. to select tab by name! #rstats 6 days ago Copyright 2016 RStudio | All Rights Reserved | Legal Terms Twitter Linkedin Facebook Rss Email github Rss",Here is this month’s collection of RStudio Tips and Tricks. Thank you to those who responded to last month’s post; many of your tips are included below! Be sure to subscribe to @rstudiotips on Twitter for more.This month’s tips fall into two categories: Keyboard Shortcuts and Easier R MarkdownKeyboard ShortcutsThe RStudio,December '16 RStudio Tips and Tricks,Live,263 771,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectINTRODUCING SPARK-CLOUDANT, AN OPEN SOURCE SPARK CONNECTOR FOR CLOUDANT DATAmikebreslin / March 9, 2016We would like to introduce you to the spark-cloudant connector, allowing you touse Spark to conduct advanced analytics on your Cloudant data. Thespark-cloudant connector can be found on GitHub or the Spark Packages site and is available for all to use under the Apache 2.0 License . As with most things Spark, it’s available for Python and Scala applications.If you haven’t heard of Apache Spark™, it is the new cool kid on the block inthe analytics space. Spark is touted as being an order of magnitude faster andmuch easier to use than its analytic predecessors, and its popularity hasskyrocketed in the past couple of years. If you would like to learn more aboutSpark in general, I recommend checking out the Spark Fundamentals classes on Big Data University and the great tutorials on IBM developerWorks .Flexible JSON database plus in-memory analytics, ftw!START FAST WITH SPARK ON BLUEMIXSo how do you get going quickly in analyzing your Cloudant data in Spark?Luckily, IBM has a fully-managed Spark-aaS offering in IBM Bluemix that has the latest version of the spark-cloudant connectoralready loaded for you. Head on over to the Bluemix catalog to sign-up and create a Spark instance to get started. Since the spark-cloudantconnector is open source, you are also free to use it in your own stand-aloneSpark deployments with Cloudant or Apache CouchDB™. Next, check out the README on GitHub, the Bluemix docs on Spark-aaS , and the great video tutorials on the Learning Center showing how to use the connector in both a Scala and Python notebook.The integration with Spark opens the door to a number of new analytical usecases for Cloudant data. You can load whole databases into a Spark cluster foranalysis. Alternatively you can read from a Cloudant secondary index (a.k.a.“MapReduce view”) to pull a filtered subset or cleansed version of your CloudantJSON. Once you have the data in Spark, use SparkSQL for full adhoc queryingcapabilities in familiar SQL syntax. Spark can efficiently transform or filteryour data and write it back into Cloudant or another data source. Because Sparkhas a variety of connection capabilities, you can also use it to conductfederated analytics over disparate data sources such as Cloudant, dashDB andObject Storage.EXAMPLE: CLOUDANT ANALYTICS WITH SPARKTo provide another example of using the spark-cloudant connector, check out this example Python Notebook on GitHub and load it into your Spark service running on Bluemix. (It becomesinteractive once you upload it to a Spark notebook using the instructionsbelow.) This notebook does the following: * Loads a Cloudant database spark_sales from Cloudant’s examples account containing documents with sales rep, month, and amount fields.(Feel free to replicate the https://examples.cloudant.com/spark_sales database into your own Cloudant account and update the connection details if you prefer.) * Detects and prints the schema found in the JSON documents. * Counts the number of documents in the database. * Prints out a subset of the data and shows how to print out a specific field in the data. * Uses SparkSQL to perform counts, sums, and order by value queries on the data. * Prints a graph of the monthly sales. * Filters the data based on a specific sales rep and month. * Counts and shows the filtered data. * Saves the filtered data as documents into a Cloudant database in your own account.(You need to create the database in your Cloudant account and enter credentials for your account in the notebook before this final step will work.) Notes for new Bluemix users: 1. After provisioning the IBM Analytics for Apache Spark service, click on its service tile in the Bluemix dashboard and open the UI to manage Spark instances. 2. Create a new instance (if needed) and a new notebook within that instance. 3. On the Create Notebook page, choose “From URL” and use the URL for the raw IPython notebook data, which should look like https://raw.githubusercontent.com/cloudant-labs/spark-cloudant/master/examples/ipython/python_Cloudant2.ipynb 4. Run the code block-by-block using the triangular play button in the menu bar, but be sure to read the code comments before running block 10 and modify the snippet accordingly.We hope you find the Spark integration a powerful tool to conduct analytics onyour Cloudant data. If you have any feedback or encounter an issue with thespark-cloudant connector, please open an issue in GitHub.--------------------------------------------------------------------------------© “Apache”, “CouchDB”, “Spark”, “Apache CouchDB”, “Apache Spark”, and the Sparklogo are trademarks or registered trademarks of The Apache Software Foundation.All other brands and trademarks are the property of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Meet the new spark-cloudant connector, for adding powerful analytics to your Cloudant JSON. I also include a simple example that shows how to use SparkSQL to order Cloudant data by value.","Introducing spark-cloudant, an open source Spark connector for Cloudant data",Live,264 772,"Enterprise Pricing Articles Sign in Free 30-Day TrialREDIS CONFIGURATION CONTROLS - NEW AT COMPOSE Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 2, 2016At Compose, we're all about giving you control of your databases where we can and Redis users on Compose are about to get a whole lot more control. It's a story about iterating design. The team at Compose who work on Redis looked at their recently introduced Slow Query Logs feature and decided they could make it better. In the process they created the new Redis Configuration Controls. The Redis Configuration Controls allows experienced users to change a selection of Redis settings so they can tune their deployments to behave exactly as they want them. These aren't for new users to modify without thoroughly researching the consequences and taking care in the process; please consult the linked documentation for each setting before changing it. The design of our user interface follows the actual redis.conf configuration file, turning it into an interactive form. This means it'll be easier to take application recommended configurations and apply them to your own Redis deployment. Under the hood, we handle these new settings by automatically applying them to both nodes of your high availability deployment. There's no need to wait for synchronisation at the Redis level to ensure they are applied and saved. Over the coming months, we plan on refining the experience of Configuration Controls. USING THE CONFIGURATION CONTROLS We'll briefly introduce each setting in this article, but for full details of them, you'll find links to the Redis documentation for many of them by clicking on the blue circled question mark next to its name. As we mentioned, the sections and fields of the interface are modelled on the redis.conf configuration file and that means, where there isn't a link to the documentation, you can find more information there. Now, onwards to the Configuration Controls themselves. You'll find the them in the Compose console for Redis under the Settings tab. At the top of the Settings view, as before, are the version and upgrade controls, then the ""Redis as a Cache"" control and then, below that are the new Configuration Controls. They open with the warning we've just given, that these are expert Redis settings with a link to the example Redis configuration file. Then we're into the various settings groups and settings. Any changes made in these settings will only be applied when the Apply Configuration Changes button at the bottom is pressed. This will put any changes you have made to the configuration into practice on each of the servers. NETWORK The first group of settings concern the Network . This contains timeout and tcp-keepalive . TIMEOUT When it's set to 0, the timeout setting's default, idle client connections stay open until they are closed by the client. You may want to ensure idle clients are ejected after some number of seconds and setting this to a non-zero value will set that number of seconds. TCP-KEEPALIVE While some parts of the network will also step in to disconnect idle connections, use of a keepalive will send TCP ACKs at regular intervals to keep the connection open. That interval can be set here. Setting it to 0, which is the default, disables this feature. SECURITY This section is slightly different because its requirepass setting is set outside the Configuration Controls. REQUIREPASS The Redis authentication credential is a simple password and this is where it can be set. Clicking on Change will send you to the Overview page where that credential can safely be changed. Be aware that any other settings you may have made in the Configuration Controls will be discarded when you click Change . LIMITS MAXMEMORY-POLICY This setting replaces the old ""Change Maxmemory Policy"" control by letting you directly set the policy. The Redis documentation on eviction policies covers what the available settings - no eviction, LRU, volatile LRU, random, volatile random and volatile TTL - do. If you continue on reading that page you'll see there's a setting you can use to fine tune some of those policies which is... MAXMEMORY-SAMPLES This setting lets the user control how the sampling-LRU mechanism works in Redis by setting the number of samples used. It defaults to 5. LUA SCRIPTING LUA-TIME-LIMIT There's only one setting in Lua Scripting and it sets the lua-time-limit . That's the number of milliseconds that a Lua script can run before being kicked into touch by Redis for taking too long. It's a safety feature to stop the system being hogged by badly written loopy scripts. Important fact: This doesn't kill the script, it logs it and tells other clients the system is busy while waiting to be told to kill the script. The default is five seconds which is enormous when you consider a script is supposed to run in a millisecond. SLOW LOGS The Slow Query Log feature is where we began. It uses two configuration settings, slowlog-log-slower-than and slowlog-max-len . A brief reminder – read more in our slow log introduction . SLOWLOG-MAX-LEN The slow log is actually a queue of slow log events and you can control the size of that queue with slowlog-max-len . The bigger you make it, the more memory you will consume. Ideally, it should be big enough for you to catch your problematic slow commands, but not so big that it becomes an issue itself. The default is 128 and we recommend you run with that till you are certain you need to expand it. SLOWLOG-LOG-SLOWER-THAN The other way to capture that tricky slow event would be to filter out all the slow, but not that slow, log events . That's where slowlog-log-slower-than comes in. It sets the threshold on what qualifies as a slow event. It defaults to 10000 microseconds. The slow query log viewer has moved to the main tab bar of the console as part of the switch to the Configuration Control to make it more accessible and it retains it's own Settings dialog so you can quickly and safely adjust just the two values that matter. EVENT NOTIFICATION NOTIFY-KEYSPACE-EVENTS Another setting that was previously available is the Event Notification 's notify-keyspace-events . This setting gives you the ability to plug into the changes going inside the database. The feature is called ""Keyspace Notifications"" and you can read about it in the Redis documentation . The short version of that is you set a configuration variable, notify-keyspace-events , to a string which represents what events you want to hear about. Setting it to the string ""KEA"" says you want a stream of all events. You can listen to that stream by connecting to Redis and issuing a psubscribe command looking for messages with a key pattern of __key*__:* . WRAPPING UP That's it for the current settings available in the Redis Configuration Controls. Keep an eye on Compose Notes for updates on new settings being made available. Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose",The team at Compose who work on Redis looked at their recently introduced Slow Query Logs feature and decided they could make it better. In the process they created the new Redis Configuration Controls.,Redis Configuration Controls,Live,265 774,"Homepage Follow Sign in / Sign up Homepage * Home * Data Science Experience * * Watson Data Platform * IBM Data Science Experience Blocked Unblock Follow Following May 4 -------------------------------------------------------------------------------- DEVELOPING IBM STREAMS APPLICATIONS WITH THE PYTHON API (VERSION 1.6) The IBM Data Science Experience (DSX) platform now integrates Streaming Analytics services using version 1.6 of the Python Application API, which enables application development and monitoring entirely in Python. The currently supported Python version is Python 3.5. Python developers can use the streamsx package to: * Create IBM Streams applications in DSX Jupyter notebooks. * Create apps that are run in the Streaming Analytics service. * Access data streams from views defined in any app that is running on the service. Furthermore, Python developers can now monitor submitted jobs with the Python REST API. This is particularly interesting for developers who want to retrieve and visualize streaming data in Jupyter notebooks , for example, for debugging or extra logging. To develop streaming applications with Python 3.5 in DSX Jupyter notebooks, you can use the STREAMING_ANALYTICS_SERVICE context to submit a Python application to the IBM Streaming Analytics service. Sample DSX Jupyter notebooks for Python applications that process streams are available on the community page of DSX: * Hello World! : Create a simple Hello World! application to get started and deploy this application to the Streaming Analytics service. * Healthcare Demo : Create an application that ingests and analyzes streaming data from a feed, and then visualizes the data in the notebook. You finally submit this application to the Streaming Analytics service. * Neural Net Demo : Create a sample data set, create a model for the sample data, use that model in a streaming application, visualize the streaming data, and finally submit the streaming application to the Streaming Analytics service. EXAMPLE: THE NEURAL NET NOTEBOOK To illustrate the workflow of building a streaming application in DSX, we can walk through the Neural Net demo listed above. The workflow is comprised of three essential steps: 1. Use the Python API to compose the streaming application. 2. Submit the application to be run in a Streaming Analytics service. 3. Retrieve data back into the notebook for visualization. The purpose of the Neural Net notebook is to demonstrate how a data scientist can train a model on a set of data, and then immediately incorporate that model into a Streaming Application. CREATING A SAMPLE DATA SET First, we create a sample data set comparing the temperature of an engine to the probability that it will fail within the next hour: xvalues = np.linspace(20,100, 100) yvalues = np.array([((np.cos((x-50)/100)*100 + np.sin(x/100)*100 + np.random.normal(0, 13, 1)[0])/150.0 for x in xvalues]) yvalues = [y - np.amin(yvalues) for y in yvalues] create_plot(xvalues, yvalues, title=""Engine Temp Vs. Probability of Failure"", xlabel = ""Probability of Failure"", ylabel = ""Engine Temp in Degrees Celcius"", xlim = (20,100), ylim = (0,1)) For brevity, several imports and function definitions were removed, however the full code is shown in the notebook itself . TRAINING A MODEL Given the data set we created, we use the PyBrain library to train a Feed Forward Neural Network (FFN) as a model to predict failure probabilities given a temperature. # The neural net to be trained net = buildNetwork(1,100,100,100,1, bias = True, hiddenclass = SigmoidLayer, outclass = LinearLayer) # Construct a data set of the training data ds = SupervisedDataSet(1, 1) for x, y in zip(xvalues, yvalues): ds.addSample((x,), (y,)) # The training harness. Used to train the model. trainer = BackpropTrainer(net, ds, learningrate = 0.0001, momentum=0, verbose = False, batchlearning=False) # Train the model. for i in range(50): trainer.train() # Display the model in the plot. fig, ax = create_plot(xvalues, yvalues, title=""Engine Temp Vs. Probability of Failure"", xlabel = ""Probability of Failure"", ylabel = ""Engine Temp in Degrees Celcius"", xlim = (20,100), ylim = (0,1)) ax.plot(xvalues, [net.activate([x]) for x in xvalues], linewidth = 2, color = 'blue', label = 'NN output') The fully trained model, net , is a simple Python object, which, when provided with a temperature value, produces a probability reading. Above, we can see the output of the model (in blue) plotted against the data set. USING THE MODEL IN A STREAMING APPLICATION It isn’t enough to simply have the net model in the DSX notebook, we might want to send it into production to predict failures in real time. To insert the model into a real-time streaming application with the streamsx.topology Python API, you must use classes that create and manipulate streaming data. The following two classes represent such creation and manipulation of data, and are necessary components of the streaming application. The periodicSource class submits a random number between 20 and 100 every 0.1 seconds, and is used to simulate sample temperature readings. The NeuralNetModel class simply takes a data item, feeds it as input to the neural net, and returns the output onto a stream. # The source of our data. Every 0.1 seconds, a number between 20-100 will be inserted into the stream # INPUT: None # OUTPUT: A float with range [20,100] class PeriodicSource(object): def __call__(self): while True: time.sleep(0.1) yield random.uniform(20,100) # A class which runs the neural net on data it is passed. # INPUT: the input to the neural net, in this case a floating point number # OUTPUT: an array containing the output of the neural net, as well as the input to the neural net. class NeuralNetModel(object): def __init__(self, net): self.net = net def __call__(self, num): return [num, self.net.activate([num])[0]] BUILDING THE STREAMING APPLICATION The Application uses the periodicSource class to generate a stream temperature readings, which are then processed by an instance of the NeuralNetModel class to create a stream of probability readings. Since we are interested in viewing these probability readings, we allow the stream to be viewable with the view() method. # Define operator periodic_src = periodicSource() nnm = NeuralNetModel(net) # Build Graph top = topology.Topology(""myTop"") stream = top.source(periodic_src) # Run the temp readings through the neural net and mark the # output as viewable. view = stream.transform(nnm).view() Now that we have defined the application, we submit it to be run on a Streaming Analytics service on Bluemix using a call to submit . vs={'streaming-analytics': [{'name': service_name, 'credentials': json.loads (credentials)}]} cfg = {context.ConfigParams.VCAP_SERVICES : vs, context.ConfigParams.SERVICE_NAME : service_name} job = context.submit(context.ContextTypes.STREAMING_ANALYTICS_SERVICE, top, config=cfg) You’ll notice that the credentials and service_name values are used to define a cfg object used for authentication. Both of these can be obtained from the Streaming Analytics service management page on Bluemix. VIEWING STREAMING DATA Once the call to submit has completed successfully, the application is running. We can view its output in DSX using the view object that was created earlier. fig, ax = create_plot([], [], title=""Engine Temp Vs. Probability of Failure"", xlabel = ""Probability of Failure"", ylabel = ""Engine Temp in Degrees Celcius"", xlim = (20,100), ylim = (0,1)) xdata = [] ydata = [] try: queue = view.start_data_fetch() for line in iter(queue.get, None): xdata.append(line[0]) ydata.append(float(line[1])) ax.lines[0].set_xdata(xdata) ax.lines[0].set_ydata(ydata) fig.canvas.draw() except: raise finally: view.stop_data_fetch() Each dot in the above graph represents a live temperature reading used to predict likelihood of failure. Every time a new temperature reading is sent through the model, its output is reflected in the graph. CLOSING THE LOOP ON DSX Data visualization is becoming an increasingly important part of data science. After creating a model, a data scientist needs immediate visual feedback on its effectiveness both in and out of a production environment. Whether with static or real-time data, DSX is a tool that helps developers achieve this. -------------------------------------------------------------------------------- Originally published at datascience.ibm.com on May 4, 2017 by William Marshall. * Machine Learning One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. Blocked Unblock Follow FollowingIBM DATA SCIENCE EXPERIENCE FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","The IBM Data Science Experience (DSX) platform now integrates Streaming Analytics services using version 1.6 of the Python Application API, which enables application development and monitoring…",Developing IBM Streams applications with the Python API (Version 1.6),Live,266 780,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSENTIMENT ANALYSIS OF REDDIT AMASChetna Warade / March 10, 2016Reddit recently announced a coffee-table book, Ask Me Anything: Volume One . It’s a collection of their favorite Ask Me Anything (AMA) web events, in which anyone can get online with luminaries like BillGates, Madonna, Chris Rock, Elon Musk, or President Obama and ask them anyquestion that comes to mind.While this is a gorgeous book, it’s missing one key element that makes AMA’s sovaluable and rich: actionable data. When an AMA is online, you can access andanalyze the text to glean insights from the discussion. The possibilities forinteresting analyses are endless. For instance, check out this interactive graph that measures how people use language on reddit. Search for a term to seetrends.The book organizes AMAs in categories like Inspiring , Informative , Provocative , Fascinating , Beautiful , Courageous , Humorous , and Ingenious . Which category would you land in? We wondered the same thing about ourselves.In the spirit of eating our own dogfood (in every sense), we’ll explore thisquestion using an AMA hosted by IBM developers and our home-grown analysistools. Watson Tone Analyzer helps you understand how you’re coming across toothers, so it’s perfect for this job.Here’s how we built our own reddit AMA sentiment analysis solution (and you cantoo). In this tutorial, we: 1. Take an IBM-hosted AMA . 2. Load its data with our handy Simple Data Pipe , which leverages Bluemix (IBM’s Cloud platform service) and runs Node.js to move JSON data from reddit (or another source), enriches the data with Watson Tone Analyzer, and lands results in Cloudant. 3. Run commands in an iPython notebook to analyze the Cloudant JSON output, using Apache Spark to analyze the Watson Tone Analyzer-enriched data to gauge positive or negative emotions measured across multiple tone dimensions, like anger, joy, openness, and more.The Spark-Cloudant Connector is the special sauce that makes this solution work.It lets you connect your Apache Spark instance to a Cloudant NoSQL database andanalyze the data.DEPLOY SIMPLE DATA PIPEThe fastest way to deploy this app to Bluemix is to click the Deploy to Bluemix button, which automatically provisions and binds the Cloudant service too.If you would rather deploy manually , or have any issues, refer to the readme .When deployment is done, click the EDIT CODE button.INSTALL REDDIT CONNECTORSince we’re importing data from reddit, you need to establish a connectionbetween reddit and Simple Data Pipe.Note: If you have a local copy of Simple Data Pipe, you can install this connector using Cloud Foundry . 1. In Bluemix, at the deployment succeeded screen, click the EDIT CODE button. 2. Click the package.json file to open it. 3. Edit the package.json file to add the following line to the dependencies list: ""simple-data-pipe-connector-reddit"": ""^0.1.2"" Tip: be sure to end the line above with a comma and follow proper JSON syntax. 4. From the menu, choose File Save . 5. Press the Deploy app button and wait for the app to deploy again. ADD SERVICES IN BLUEMIXTo work its magic, the reddit connector needs help from a couple of additionalservices. In Bluemix, we’re going analyze our data using the Apache Spark andWatson Tone Analyzer services. So add them now by following these steps:PROVISION IBM ANALYTICS FOR APACHE SPARK SERVICE 1. Login to Bluemix (or sign up for a free trial) . 2. On your Bluemix dashboard, click Work with Data . Click New Service . Find and click Apache Spark then click Choose Apache Spark Click Create .PROVISION WATSON TONE ANALYZER SERVICE 1. In Bluemix, go to the top menu, and click Catalog . 2. In the Search box, type Tone Analyzer , then click the Tone Analyzer tile. 3. Under app , click the arrow and choose your new Simple Data Pipe application. Doing so binds the service to your new app. 4. In Service name enter only tone analyzer (delete any extra characters) 5. Click Create . 6. If you’re prompted to restage your app, do so by clicking Restage .LOAD THE REDDIT AMA DATA 1. Launch simple data pipe in one of the following ways: * If you just restaged, click the URL for your simple data pipe app. * Or, in Bluemix, go to the top menu and click Dashboard , then on your Simple Data Pipe app tile, click the Open URL button. 2. In Simple Data Pipe, go to menu on the left and click Create a New Pipe . 3. Click the Type dropdown list, and choose Reddit AMA .When you added a reddit connector earlier, you added the Reddit option you’re choosing now. 4. In Name , enter ibmama . 5. If you want, enter a Description . 6. Click Save and continue . 7. Enter the URL for the AMA. We’ll use the sample IBM-hosted AMA we mentioned earlier: https://www.reddit.com/r/IAmA/comments/3ilzey/were_a_bunch_of_developers_from_ibm_ask_us 8. Click Connect to AMA . You see a You’re connected confirmation message. 9. Click Save and continue . 10. On the Filter Data screen, make the following 2 choices: * under Comments to Load , select Top comments only . * under Output format , choose JSON flattened . Then click Save and continue . Why flattened JSON? Flat JSON format is much easier for Apache Spark to process, so for this tutorial, the flattened option is the best choice. If you decide to use the Simple Data Pipe to process reddit data with something other than Spark, you probably want to choose JSON to get the output in its purest form. 11. Click Skip , to bypass scheduling. 12. Click Run now . When the data’s done loading, you see a Pipe Run complete! message. 13. Click View details . Tip: You can review the processed reddit comments in Cloudant along with theenriched Tone Analyzer metadata by clicking the run’s Details link and then clicking the Top comments only link. If prompted, enter your Cloudant password.ANALYZE AMA DATACREATE NEW PYTHON NOTEBOOK 1. In Bluemix, open your Apache Spark service. Go to your dasbhoard and, under Services , click the Apache Spark tile and click Open . 2. Open an existing instance or create a new one. 3. Click New Notebook . 4. Click the From URL tab. 5. Enter any name, and under Notebook URL enter https://github.com/ibm-cds-labs/reddit-sentiment-analysis/raw/master/notebook/Reddit-AMA-python.ipynb 6. Click Create Notebook 7. Copy and enter your Cloudant credentials.In a new browser tab or window, open your bluemix dashboard and click your Cloudant service to open it. From the menu on the left, click Service Credentials . If prompted, click Add Credentials . Copy your Cloudant host , username , and password into the corresponding places in cell 3 of the notebook (replacing XXXX’s). 8. Still in cell 3, at the end of the line, specify which cloudant database to load by making sure the following string includes name of the pipe you just created, ibmama .reddit_ibmama_top_comments_only Edit this string to include the name you gave your pipe in the preceding section. The naming convention here is reddit_PIPENAME_top_comments_only 9. Leave this notebook open. We’ll run this code in a minute.ABOUT THE SPARK-CLOUDANT CONNECTORBefore we run commands in the notebook, let’s peek under the hood. We use the Spark-Cloudant Connector , which lets you connect your Apache Spark instance to a Cloudant NoSQL DBinstance and analyze the data. This is a great way to leverage Spark’slightning-fast processing power directly on your Cloudant JSON data.RUN THE CODE AND GENERATE REPORTSNew to notebooks? If you’ve never used a Python notebook before, here’s how you run commands. Youmust run cells in order from top to bottom. To run a cell, click it (a boxappears around it) and in the menu above the notebook, click the Run button. While the command processes, an * asterisk appears (for a moment ora few minutes) in place of the number. When the asterisk disappears, and thenumber returns, processing is done, and you may move on to the next cell.Now you can run the code in each notebook cell. Here’s what you’re doing as yourun each command: 1. Run cells 1 and 2 to connect to a SparkContext.A SparkContext is the connection to a Spark cluster. It’s how you create RDDs and other items on that cluster. 2. Connect to your Cloudant database. Run cell 3 (which you just customized, adding your database credentials) to connect to Cloudant, where the AMA data resides. 3. Create the dataframe and get it in tabular format. In cell 4, run df.printSchema() then in cell 5, run df.show() . 4. Prep the dataframes for SQL commands. In cell 6, run df.registerTempTable(""reddit""); 5. Now start analyzing this data. Watson Tone Analyzer captures tones in the text, gauging: * emotions like Joy, Disgust, Anger, Fear, and Sadness * social traits like Agreeableness, Openness, Conscientiousness, Extraversion, and Emotional Range * language styles like Analytical, Tentative, and Confident First, run the following code to compute the distribution of tweets by sentiment scores greater than 70%. sentimentDistribution=[0] * 13 for i, sentiment in enumerate(df.columns[-23:13]): sentimentDistribution[i]=sqlContext.sql(""SELECT count(*) as sentCount FROM reddit where cast("" + sentiment + "" as String) > 70.0"")\ .collect()[0].sentCount 6. With the data stored in sentimentDistribution array, run the following code that plots the data as a bar chart. %matplotlib inline import matplotlib import numpy as np import matplotlib.pyplot as plt ind=np.arange(13) width = 0.35 bar = plt.bar(ind, sentimentDistribution, width, color='g', label = ""distributions"") params = plt.gcf() plSize = params.get_size_inches() params.set_size_inches( (plSize[0]*3.5, plSize[1]*2) ) plt.ylabel('Reddit comment count') plt.xlabel('Emotion Tone') plt.title('Histogram of comments by sentiments > 70% in IBM Reddit AMA') plt.xticks(ind+width, df.columns[-23:13]) plt.legend() plt.show() This bar chart shows the number of comments that scored above 70% for each tone. 7. In the last cell, run the following code to group by tone values:comments=[] for i, sentiment in enumerate(df.columns[-23:13]): commentset = df.filter(""cast("" + sentiment + "" as String) > 70.0"") comments.append(commentset.map(lambda p: p.author + ""\n\n"" + p.text).collect()) print ""\n--------------------------------------------------------------------------------------------"" print sentiment print ""--------------------------------------------------------------------------------------------\n"" for comment in comments[i]: print ""[-] "" + comment +""\n"" REVIEW RESULTSScroll through the resulting list. You’ll see comments grouped by tone. Rememberthat these are comments that scored greater than 70% for each value.Comments that scored high for Confident and Conscientiousnessare listed and grouped under those tones.Some comments appear under multiple headings, because they scored high for morethan one. For example, the following comment appears under the language style Analytical and also under the social trait Emotional Range (sensitivity to environment, moodiness).How do you keep convincing people to pay for Lotus notes as an email solution?Watson Tone Analyzer documentation says: “Tone analysis is less about analyzinghow someone else feels, and more about analyzing how you are coming across toothers.” So, how did IBMers come across within this AMA?Comments from IBMers take up most of the Agreeableness (tendency to be compassionate and cooperative toward others) section.They live there beside some “agreeable” questions from outsiders that come witha wink, likeIs your favorite TV show Halt and Catch Fire? I really want it to be...That comment also scored high under Extraversion and Emotional Range , maybe for its enthusiasm.No comments from IBMers appear under Emotional Range . These guys are a bunch of cool cats, perhaps–or just polite and friendly AMAhosts.Note: No comments scored over 70% on emotions like Joy, Anger, Fear, Disgust, andSadness. This conversation just didn’t get that heated. Try running anotherreddit AMA discussion through these same steps to see how results differ.So, when reddit includes this IBM AMA in their next book, which category willthey apply? Comments from non-IBMers may land this AMA in the Provocative or Humorous group. IBMers alone? Courageous , of course. ;-) Or perhaps, Informative , which would put us in good company.Meanwhile, we’ll keep working hard and aspire to Ingenious .OTHER OPTIONSNow you know how to tweak the Simple Data Pipe to load data from a source youwant, like reddit. Once you do so, the Cloudant-Spark Connector makes it easy toperform analysis on your Cloudant JSON. In this example, we used an iPythonnotebook to help us leverage Watson Tone Analyzer, but you can use the analysistool of your choice.When you ran Simple Data Pipe, the reddit AMA landed in Cloudant. From there,it’s a breeze to send data on into dashDB. The dashDB data warehouse is also agreat place to run analytics. Stay tuned for my next post, which will show youhow to take reddit data, load it into dashDB, and analyze with R (Can’t wait? Watch a video on how these two work together ).TRY THESE AMASLaunch your Simple Data Pipe app again and return to the Load reddit AMA Data section. In step 7, swap in one of these AMA URLs and check out the results. * Matei Zaharia, creator of Spark https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/ * Chris Rock https://www.reddit.com/r/IAmA/comments/2pi16o/chris_rock_here_ama/ * Tim Berners Lee https://www.reddit.com/r/IAmA/comments/2091d4/i_am_tim_bernerslee_i_invented_the_www_25_years/ * Neil deGrasse Tyson https://www.reddit.com/r/IAmA/comments/qccer/i_am_neil_degrasse_tyson_ask_me_anything/ * Bill Gates https://www.reddit.com/r/IAmA/comments/18bhme/im_bill_gates_cochair_of_the_bill_melinda_gates/ * Louis C. K. https://www.reddit.com/r/IAmA/comments/n9tef/hi_im_louis_ck_and_this_is_a_thing/ * Amy Poehler https://www.reddit.com/r/IAmA/comments/2kp7w0/im_amy_poehler_amaa/ * IBM’s Chef Watson https://www.reddit.com/r/IAmA/comments/3id842/we_are_the_ibm_chef_watson_team_along_with_our/ * Barack Obama https://www.reddit.com/comments/z1c9z/i_am_barack_obama_president_of_the_united_states/SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Use Apache Spark, Cloudant, and Watson Tone Analyzer to perform sentiment analysis on a reddit Ask Me Anything web event.",Sentiment Analysis of Reddit AMAs,Live,267 787,"* R Views * About this Blog * Contributors * Some Resources * * R Views * About this Blog * Contributors * Some Resources * REPRODUCIBLE FINANCE WITH R: SECTOR CORRELATIONS SHINY APP by Jonathan Regenstein In a previous post , we built an R Notebook that pulled in data on sector ETFs and allowed us to calculate the rolling correlation between a sector ETF and the S&P 500 ETF, whose ticker is SPY. Today, we’ll wrap that into a Shiny app that allows the user to choose a sector, a returns time period such as ‘daily’ or ‘weekly’, and a rolling window. For example, if a user wants to explore the 60-day rolling correlation between the S&P 500 and an energy ETF, our app will show that. As is customary, we will use the flexdashboard format and reuse as much as possible from our Notebook. The final app is here , with the code available in the upper right-hand corner. Let’s step through this script. The first code chunk is where we do the heavy lifting in this app. We will build a function that takes as parameters an ETF ticker, a returns period, and a window of time, and then calculates the desired rolling correlation between that ETF ticker and SPY. That function uses getSymbols() to pull in prices and periodReturns() to convert to log returns, either daily, weekly or monthly. Then we merge into one xts object and calculate rolling correlations, depending on the window parameter. It should look familiar from the Notebook, but honestly, the transition from the previous Notebook to this code chunk wasn’t as smooth as would be ideal. I broke this into two functions in the Notebook, but thought it flowed more smoothly as one function in the app since I don’t need the intermediate results stored in a persistent way. Combining the two functions wasn’t difficult, but it did break the reproducible chain in a way that I don’t love. In the real world, I would (and, in my IDE, I did) refactor the Notebook to line up with the app better. Enough self-shaming, back to it. Next, we need to create a sidebar where our users can select a sector, a returns period and a rolling window. Nothing fancy here, but one thing to note is how we use selectInput to translate from the sector to the ETF ticker symbol. This means our users don’t have to remember those three-letter codes; they just choose the name of the desired sector from a drop-down menu. Have a close look at the last three lines of code in that chunk. These are a new addition that let the user determine if the mean, max and/or min rolling correlation should be included in the dygraph. We haven’t built any way of calculating those values yet, but we will shortly. This is the UI component. Those three lines of code create checkboxes and are set to default as FALSE, meaning they won’t be plotted unless the user chooses to do so. I wanted to force the user to actively click a control to include these, but that’s a purely stylistic choice. Perhaps you don’t want to give them a choice at all here? Next, we create our reactive values that will form the substance of this app. First, we need to calculate and store an object of rolling correlations, and we’ll use a reactive that passes user inputs to our sector_correlations function. Then, we build reactive objects to store mean, minimum and maximum rolling correlations. These values will help contextualize our final dygraph. At this point, we have done some good work: built a function to calculate rolling correlations based on user input, built a sidebar to take that user input, and coded reactives to hold the values and some helpful statistics. The hard work is done, and really we did most of the hard work in the Notebook, where we toiled over the logic of arriving at this point. All that’s left now is to display this work in a compelling way. Dygraphs plus value boxes has worked in past; let’s stick with it! That dygraph code should look familiar from the Notebook and previous posts, except we have added a little interactive feature. By including if(input$mean == TRUE) {avg()} , we allow the user to change the graph by checking or unchecking the ‘mean’ input box in the sidebar. We are going to display this same information numerically in a value box, but the lines make this graph a bit more compelling. Speaking of those value boxes, they rely on the reactives we built above, but, unlike the graph lines, they are always going to be displayed. The user doesn’t have a choice here. Again, this just adds a bit of context to the graph. Note that the lines and the value boxes take their value from the same reactives. If we were to change those reactives, both UI components would be affected. Our job is done! This a simple but powerful app: the user can choose to see the 60-day rolling correlations between the S&P 500 and an energy ETF, or the 10-month rolling correlations between the S&P 500 and a utility ETF, etc. I played around with this a little bit and was surprised that the 10-week rolling correlation between the S&P 500 and health care stocks plunged in April of 2016. Someone smarter than I can probably explain, or at least hypothesize, as to why that happened. A closing thought about how this app might have been different: we are severely limiting what the user can do here, and intentionally so. The user can choose only from the sector ETFs that we are offering in the selectInput dropdown. This is a sector correlations app, so I included only a few sector ETFs. But, we could just as easily have made this a textInput and allowed the users to enter whatever ticker symbol struck their fancy. In that case, this would not longer be a sector correlations app; it would be a general stock correlations app. We could go even further and make this a general asset correlations app, in which case we would allow the user to select things like commodity, currency and housing returns and see how they correlate with stock market returns. Think about how that might change our data import logic and time series alignment. Thanks for reading, enjoy the app, happy coding, and see you next time! Jonathan Regenstein 2017-02-02T19:43:19+00:00LEAVE A COMMENT CANCEL REPLY Comment 250 Northern Ave, Boston, MA 02210 844-448-1212 info@rstudio.com DMCA Trademark Support ECCN * Missed #rstudioconf ? Here are some tips from IDE engineer @kevin_ushey ! Slides from all talks forthcoming. #rstats twitter.com/bhaskar_vk/sta… 2 weeks ago Copyright 2016 RStudio | All Rights Reserved | Legal Terms Twitter Linkedin Facebook Rss Email github Rss","In a previous post, we built an R Notebook that pulled in data on sector ETFs and allowed us to calculate the rolling correlation between a sector ETF and the S&P 500 ETF, whose ticker is SPY. Today, we’ll wrap that into a Shiny app that allows the user to choose a",Sector Correlations Shiny App,Live,268 788,"☰ * Login * Sign Up * Learning Paths * Courses * Badges * Our Badges * BDU Badge Program * Events * Blog * Resources * Resources List * Downloads * BLOG Welcome to the Big Data University Blog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * This Week in Data Science (September 06, 2016) * This Week in Data Science (August 30, 2016) * This Week in Data Science (August 23, 2016) * This Week in Data Science (August 16, 2016) * This Week in Data Science (August 09, 2016) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsTHIS WEEK IN DATA SCIENCE (SEPTEMBER 06, 2016) Posted on September 6, 2016 by cora Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * How Tech Giants Are Devising Real Ethics for Artificial Intelligence – Researchers from tech companies have been meeting to discuss the impact of artificial intelligence on jobs, transportation and even warfare. * The Three Faces of Bayes – The three main uses of the term “Bayesian” are presented through the lens of a naïve Bayes classifier. * IBM Data Science Experience: First steps with yorkr – A user goes through his use of IBM’s Data Science Experience, an integrated delivery platform for analytics. * EPA challenges communities to develop air sensor data platforms – EPA’s chief data scientist believes that real-time pollution sensors are the way of the future. * Inside the ‘brain’ of IBM Watson: how ‘cognitive computing’ is poised to change your life – IBM’s Cognitive Computing revolution is changing how doctors, financial experts, and many other professions find and investigate key issues in their work. * The sneaky math that made the lottery more alluring – and harder to win – In recent years, popular lotteries have been re-engineered to make their contests more appealing, but also further decrease your odds of hitting the jackpot. * What Robots Can Learn from Babies – Researchers at the Allen Institute for Artificial Intelligence (Ai2) in Seattle have developed a computer program that shows how machines determine how the objects captured by a camera will most likely behave. * Majority of mathematicians hail from just 24 scientific ‘families’ – The evolution of mathematics is traced using a comprehensive genealogy database. * Enhanced DMV facial recognition technology helps NY nab 100 ID thieves – In January, the New York State DMV enhanced its facial recognition technology by increasing the measurement points of a driver’s license picture. * How to Become a Data Scientist – Part 1 – Check out this excellent (and exhaustive) article on becoming a data scientist, written by someone who spends their day recruiting data scientists. * Essentials of Machine Learning Algorithms (with Python and R Codes) – Sunil created a guide to simplify the journey of aspiring data scientists and machine learning enthusiasts across the world. * Could artificial intelligence help humanity? Two California universities think so – Two California universities separately announced new centers devoted to studying the ways in which AI can help humanity. * Y’all have a Texas accent? Siri (and the world) might be slowly killing it – Voice recognition tools such as Apple’s Siri still struggle to understand regional quirks and accents, and users are adapting the way they speak to compensate. * Big data salaries set to rise in 2017 – Starting salaries for big data pros will continue to rise in 2017 as companies jockey to hire skilled data professionals. * 10 Years of Color – Analysis on my Personal Photo Collection – See how Brett Kobold creates a data visualization that shows the most prominent color from every photo he took over the last 10 years. UPCOMING DATA SCIENCE EVENTS * IBM World of Watson 2016 – Unleash your company’s cognitive potential at IBM World of Watson 2016 this October. * Graph Processing with Spark GraphX – Learn about graph processing on September 8th. * IBM DataFirst Launch Event – Join data and analytics leaders and practitioners from the open source community, startups, and enterprises at the IBM DataFirst Launch Event on September 27th in NYC. SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * RELATED Tags: analytics , Big Data , data science , events , weekly roundup -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Community * FAQ * Ambassador Program * Legal Follow us * * * * * * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (September 06, 2016)",Live,269 791,"400 BAD REQUEST LIemNiB4/cqS644CH @ Tue, 04 Oct 2016 14:39:26 GMT SEC-43",Compilation of Youtube videos teaching Statistics using R and other languages,Learning Statistics on Youtube,Live,270 799,"CONFIGURING COMPOSE ENTERPRISE ON GOOGLE CLOUD PLATFORM Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Sep 1, 2016*Compose is now available on Google's Cloud Platform . Being a different platform, the configuration for launching your first Compose Enterprise cluster is also different and in this article, we'll walk you through what you need to do to create your own database powerhouse in your private cloud. Before creating a Compose Enterprise cluster, you will need to do some preparatory work in the Google Cloud. We'll assume at this point that you've created a project under your Google Cloud account and enabled billing on it - without that Google will not let you enable the APIs used by Compose to create the hosts needed for your cluster. GOOGLE CLOUD SEEDING You should be here at your Dashboard on Google Cloud. There's a ""Service Account"" we need to create so select the menu in the top left of the dashboard: This will slide in the main Google Cloud Platform menu. If you are wondering where something is in Google Cloud Platform, head to this menu and you should be able to filter it down. What we want is at the top though. And select IAM & Admin . Then select Service accounts ... Followed by clicking on the Create service account label. This will get you here: Enter a name for your service account and make sure you check the Furnish a new key check box. This will ensure that the keys you need to access your project and resources will be transferred to you as a JSON file. Watch out for that, it's a blink or you'll miss it download. Make sure that file is safe and we can move on. The next thing to be done is to give your new service account the ability to manage storage. Go back to the left hand menu and select IAM . Then look up that account you just named in the list of accounts displayed. You need to grant some roles, ""Storage Admin"", ""Storage Object Admin"" and ""Service Account Manager"", to that user. For the first two, click on the Editor drop down and then scroll the list till you see Storage . Click on that and a pop-up menu opens up. Click on ""Storage Admin"" and ""Storage Object Admin"" to add their roles. Next select ""Project"" in the menu: Click on ""Service Account Actor"" then click on Save . CREATING THE CLUSTER ON COMPOSE Head over to your Compose console and select the Enterprise button from your left hand side bar. Now click the Create Cluster button on the right. Enter a name for your new Enterprise cluster at the top and select ""Google Cloud Platform"" from the options. The page will then expand with this form: Most of the values for this form is in the JSON file we downloaded when the service account was created. Open that up in your preferred editor. The project_id field value should be copied, less the double-quotes, into the project_id field. The same for the private_key_id to private key id , private_key to private key , client_email to client_email and client_id to client id . Watch out for the private_key value, it's a long one. The last two fields are not in the JSON document. The region is the Google Cloud Region you want your hosts deployed to, such as us-east1 - find out more about Regions and Zones on the Google Cloud Platform help. The other field, bucketname is a name for the backups - enter a name which you prefer here. Finally, there is a slider which shows how much Compose will charge per month for provisioning a cluster that supports that much total RAM. Each cluster is made of 3 hosts so 24GB, for example, represents 8GB of RAM on each host. You will, of course, be provisioning and charged directly by Google for those hosts. In our walkthrough, we'll go with 24GB and then click Create Cluster . GETTING YOUR OWN DEPLOYMENT CONFIGURATION The cluster will be created on Compose's side at this point, but the hosts on the Google side have yet to be created and connected to Compose. That's what this next page is about: The first step is to create a Google Deployment Manager configuration which can do the creation work for you. This is a YAML file which Compose will create according to your needs. First, we select where we want to deploy these hosts. For example, if you entered us-east1 in the preceding Create Cluster form, you should select ""Eastern US"". You will need to select the size of host for your deployment. Google offers a number of predefined standard hosts and high-memory hosts . Select one that matches up with the memory that you selected when you created the cluster. For example, if we selected 24GB when creating the cluster, that equates to 8GB of RAM per host, and looking at the predefined machine types, the nearest match is the ""n1-standard-2"" with two virtual cores and 7.5GB of RAM. We'll select that on the page. The last item is the amount of storage to initially allocate to your deployments. The slider ranges from 512GB to 3TB. Select how much you'd like and we're ready to make the configuration file. Click on Download Configuration and a file will be downloaded to your system called compose-enterprise.yaml . ENABLE APIS Before you move on to the next step, you need to activate an API, the Google Cloud Deployment Manager V2 API. You can navigate to this through the API option in the Cloud dashboard, or simply visit the API page and click on Enable . STARTING THE CLUSTER DEPLOYMENT ON GOOGLE This file needs to be used with machine with Google Cloud SDK installed on it. This can be done on a workstation or you can make use of the Google Cloud Shell, a remote shell running on the Google cloud. In the case of having the SDK installed locally, follow the appropriate install instructions for your operating system . That process will include setting up your account and connection to Google Cloud so you will need your account details to hand. If you only have one project on Google Cloud, the SDK will automatically select that. When the process is complete, you should be able to run gcloud auth list to see which accounts are configured and active. In the case of using the Google Shell we use the built in shell in the Google Cloud Platform dashboard. Select the ""prompt"" icon in the top menu and it will connect, through the web browser, to a shell on a system preconfigured. What isn't in the shell is the configuration file we need. There's a number of routes to getting it over but the quickest way is just to cut and paste it into an editor. On Mac OS X, you can do cat compose-enterprise.yaml | pbcopy or on Linux, you can install xclip and then run cat compose-enterprise.yaml | xclip -selection clipboard . Then you can go to the Google Cloud Shell and run the nano editor with nano compose-enterprise.yaml , paste the clipboard into the editor and then exit with control-X then y then return. With the file in place we can continue. The command displayed in step two on the Compose Hosts page now needs to be executed. It assumes that you will be in the same directory as the file you downloaded (or copied over). If it isn't, change the file name that comes after the --config to point at your downloaded file. If you get an error like: ERROR: (gcloud.deployment-manager.deployments.create) ResponseError: code=403, message=Access Not Configured. Google Cloud Deployment Manager API has not been used in project 99999999999 before or it is disabled. Then go back to the Enable APIs step above, do that and retry the command. For illustration, what you should see is something like this, only with your own names in it: [~] gcloud deployment-manager deployments create exemplumcluster --config Downloads/compose-enterprise.yaml Waiting for create operation-1470301061838-5393b2480f4b1-e795a8d6-47a1fe4b...done. Create operation operation-1470301061838-5393b2480f4b1-e795a8d6-47a1fe4b completed successfully. NAME TYPE STATE ERRORS exemplumcluster-disk-0-data compute.v1.disk COMPLETED [] exemplumcluster-disk-0-swap compute.v1.disk COMPLETED [] exemplumcluster-disk-1-data compute.v1.disk COMPLETED [] exemplumcluster-disk-1-swap compute.v1.disk COMPLETED [] exemplumcluster-disk-2-data compute.v1.disk COMPLETED [] exemplumcluster-disk-2-swap compute.v1.disk COMPLETED [] exemplumcluster-image compute.v1.image COMPLETED [] exemplumcluster-instance-0 compute.v1.instance COMPLETED [] exemplumcluster-instance-1 compute.v1.instance COMPLETED [] exemplumcluster-instance-2 compute.v1.instance COMPLETED [] exemplumcluster-network compute.v1.network COMPLETED [] exemplumcluster-network-capsules compute.v1.firewall COMPLETED [] exemplumcluster-network-udp-4789 compute.v1.firewall COMPLETED [] [~] The cluster is now being deployed and after a few minutes, reloading the Hosts page will show you that the initialisation is taking place: It should take around 20 minutes for this process to complete as each element of the Google cluster meshes with Compose's cluster management. After 20 minutes, refreshing the page should show the cluster as ready to run: THE FIRST DATABASE DEPLOYMENT Your first database deployment can now be done. Click on Create Deployment and you'll see the Compose database selection page. Select any database and you'll see the form for deploying your database, with one difference: There's one difference from the default deployment page. Because we have a Enterprise cluster, the Create Deployment On option appears and will default to the Enterprise cluster. It's still possible to select Compose Hosted databases, but they are charged separately from the Enterprise cluster. The interface defaults to the Enterprise cluster to avoid that. Enter the name for your deployment, select your options and configure your initial deployment resources. On Enterprise, the current default minimum is a configuration with 1GB of RAM. Once done, click Create Deployment and Compose will provision your database. VPN STEPS It's at this point in the configuration process that you have a choice. At Compose, we understand that Compose Enterprise customers will have different security requirements and, rather than open up ports to your cloud infrastructure automatically, we give you the opportunity to apply your own security procedures and processes. Briefly, that Compose hosts will need to be accessible from wherever you are administering them. You can configure a VPN or SSH tunnelling to achieve this. Within the network, enable your access host to pass TLS traffic to and from the hosts and this should cover most databases requirements. Applications configured within your project will require that the firewall rules allow them to connect to the Compose database hosts. The internal IP addresses of the hosts are mapped to *.compose.direct DNS addresses. That said, we also know that users may just want to quickly configure a VPN to access their databases. In that case we offer the following guide to creating an IPSEC VPN with the least steps possible. CREATING THE VPN INSTANCE The first step in this process is to create a machine instance that will run your VPN software. Go to the Google Cloud Platform console and select Compute Engine from the products menu. Select VM Instances from the sidebar and then select Create Instance from the top list of options. Give the new instance a name, it's mostly decorative – we'll call ours vpngateway – then select a zone for this instance to live in; you can accept the default offered if you wish or you can set it to a zone in the region where you placed your Compose Enterprise cluster. Generally for administration you won't need a whole dedicated CPU to handle the VPN load, so in Machine Type select Micro to reduce the cost of this new node. For the boot disk, click Change and select Ubuntu 14.04 LTS. Then carry on down the page till you hit the Management, disk, networking, SSH keys link. Click that to reveal the options underneath. The first screen that is revealed will be Management . Click in the Tags field and enter vpn . We'll need that tag when we set up the firewall rules. Now select Networking . This is where we set up this instance to be our gateway between the outside world and our Compose cluster. The Network field should be set to the network that was created when we created the cluster – in our example, we named the cluster exemplumcluster so the network is exemplumcluster-network so we select that. Set the External IP to ""New static IP address"". A dialog will pop up asking you to reserve an IP address with a name – we'll use vpnip for a name – enter a name and click Reserve . Finally set IP Forwarding to On and click Create . The display will now return to the VM Instances dashboard with an extra entry and after a little while, our new node will be deployed and it'll show an SSH button next to it. It's time to log into our gateway to configure it. INSTALLING THE VPN SOFTWARE Click that SSH button and Google will start a session to the VPN gateway. There's a lot of ways you could enable this as a gateway and we're going to use one of the quickest and simplest ones we've found hwdsl2's setup-ipsec-vpn . This is script which automatically configures the system to run a IPSEC VPN and it can be run with no user intervention whatsoever - see the [installation instructions]9 https://github.com/hwdsl2/setup-ipsec-vpn#ubuntu--debian ) for alternative ways of setting it up. For our configuration needs, all we need to do is run this: wget https://git.io/vpnsetup -O vpnsetup.sh && sudo sh vpnsetup.sh Hit return and watch as the script downloads and builds the required code into a VPN. When it finishes, it'll display something like this: ================================================ IPsec VPN server is now ready for use! Connect to your new VPN with these details: Server IP: 104.196.169.215 IPsec PSK: BqSfZg8qcNFDjLAc Username: vpnuser Password: M7J6Bt3EyCmwPZbM Write these down. You'll need them to connect! Important notes: https://git.io/vpnnotes Setup VPN clients: https://git.io/vpnclients ================================================ That bit about writing them down, do it then exit from the SSH session. These are our IPSEC VPN credentials. The VPN is running, but there's still a step to go. OPENING THE FIREWALL We need to allow the traffic to flow from the outside to the VPN and to allow TLS traffic to go between the VPN and the hosts. This can be done from the GCP console. Go to the Networking product page and you'll see the general networking overview. There will be a default network, at least, and the network for the cluster – in our example exemplumcluster-network . Select the clusters network and you'll now see this: We can add firewall rules here simply by clicking on Add firewall rule which brings up this form: First, the incoming rule for the VPN. Give it the name vpn-rule and select ""Allow from any source (0.0.0.0/0)"" in the source filter. Then, in the Allowed protocols and ports field put tcp:1701; udp:4500; udp:500 This allows TCP/IP traffic on port 1701 and UDP traffic on ports 4500 and 500. In the Target tags field, enter vpn , the tag allocated to the instance when it was created earlier. This will lock down the rule to being between the outside word and the VPN host. Click Create and the rule will be applied. That lets the traffic from the VPN in. Now we need to enable TLS connections within the cluster. Click Add firewall rule again. Name this rule ""databrowser"". The Source Filter will need to be set to Subnetworks and when you do that, the form changes to allow you to enter those subnetworks: We need all the subnetworks in this case, so click Select all and Ok . In the Allowed protocols and ports enter: tcp: 443 We won't be setting any target tags as this rule will apply to all systems in the cluster. Click Create and that should make your clusters VPN connection ready to use. CONFIGURING THE INCOMING CONNECTION How you set up your incoming connection will entirely depend upon your operating system. Recall back when the connection credentials were generated, there were a few URLs included. Specifically, https://git.io/vpnclients , which gives directions for creating a client VPN connection on Windows, Linux, Mac OS X, iOS and Android. We'll use Mac OS X as an example here. As per the instructions at the previous link, go to System Preferences and then to the Network section, click on the + at the bottom of the interface list to add an interface and select VPN in the drp down that appear. Diverging slightly from the instructions, select Cisco IPsec as the VPN type. Click Create to make the network interface and you'll return to the Network screen with the new interface selected. Now we can fill the details for our VPN server connection. From the information we recorded earlier... * enter the Server IP into the Server Address field * enter the Username into the Account Name field * enter the Password into the *Password field * to use the IPsec PSK * click Authentication Settings * select Shared Secret * enter the IPsec PSK into the Shared Secret field * click Ok * click Apply * click Connect TESTING THE CONNECTION The VPN should have been configured and connected by now. If you want to see if it is configured you can either try selecting the data browser in any database that have a browser option. The data browser is integrated into your cluster and seamlessly blends with the Compose console; if it appears, the VPN is working. For any database with a HTTPS web ui (eg RethinkDB or RabbitMQ), you can also try connecting to their admin UI (details in the Compose console for deployed databases). BEYOND DEPLOYMENT You now have a Compose cluster running of the Google Cloud Platform, complete with VPN access. You can deploy new compute instances into the Google Cloud project to run your application and connect directly to those databases, or create secure tunnels or SSL connections to remote applications. The choice is yours with Compose Enterprise. Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud * Subscribe Join 75,000 other devs on our weekly database newsletter. © 2016 Compose","Compose is now available on Google's Cloud Platform. Being a different platform, the configuration for launching your first Compose Enterprise cluster is also different and in this article, we'll walk you through what you need to do to create your own database powerhouse in your private cloud.",Configuring Compose Enterprise on Google Cloud Platform,Live,271 800,"Mike Broberg Blocked Unblock Follow Following Editor for the IBM Watson Data Platform developer advocacy team. OK person. May 9 -------------------------------------------------------------------------------- SIMPLE DATA VISUALIZATION IN APACHE COUCHDB™ D3, RIGHT IN THE FAUXTON DASHBOARD, VIA USERSCRIPT Have you ever wanted to quickly visualize the results of CouchDB’s built-in reduce functions for some quick feedback, without leaving the context of its handy dashboard? The JSON representation of a CouchDB map-reduce operation, aggregating movies by rating.INTRODUCING CHANGO Recently, my colleague and office neighbor va barbosa published an article on integrating a data-visualizing view directly into the Cloudant dashboard . Essentially, it’s a userscript that adds a new menu button to a database view when the results are aggregated and JSON is returned in a specific format. Clicking the Chart button will render a D3 chart. Now, it works in CouchDB, too: Automatically visualize aggregated CouchDB JSON with Chango .Because Cloudant and CouchDB now share the same codebase , updating Va’s userscript—we call it Chango , as a portmanteau combining “chart + Mango ”—was pretty straightforward. While the spirit of the Mango query interface is to make querying CouchDB easier , we decided to riff on the name with “Chango,” since it aspires to make data visualization in CouchDB more convenient. Here is the Chango script, in its entirety: All the userscript for Chango.GENERATING YOUR FIRST CHANGO CHART Chango currently works using the Firefox browser with the Greasemonkey extension. Once you have set up the browser, click the view raw button and install the script when prompted. Rather than write your own reduce functions, CouchDB comes with built-in reduce functions that run in Couch’s native Erlang. Make sure to specify your reduce when defining your database view, like so: Using the built-in reduce function _sum to aggregate results on the Movie_rating field. _count would also work here, and without emitting the value 1 for each document in the index.Then, include the reduce in your query options when using the dashboard: Including the Reduce query option in the Fauxton dashboard.You’ll be all set to generate your chart from there. Some Chango charts expect data in the same format. For example, pie-, bar-, and bubble-chart all expect to render data in the schema of [{ key: """", value: n }, ...] . When that happens, Chango will randomly select one of them. Just toggle the Chart button until you get the pie, bar, or bubble visualization you prefer. Through Chango’s dependency on Va’s simple-data-vis project, you can find the JSON schemas that SimpleDataVis expects . There’s more there than your basic charts covered here, so check it out. CHANGO UNCHAINED With that, we’re excited to see what the CouchDB community does with Chango and SimpleDataVis. Please let us know in the comments about any modifications you’ve made or questions you have. Thanks for checking out Chango, and please ♡ this article to recommend it to other Medium readers. Thanks to va barbosa . * Data Visualization * Couchdb * JavaScript * Web Development * Cloudant 1 Blocked Unblock Follow FollowingMIKE BROBERG Editor for the IBM Watson Data Platform developer advocacy team. OK person. FollowIBM WATSON DATA LAB The things you can make with data, on the IBM Watson Data Platform. * Share * 1 * * * Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Have you ever wanted to quickly visualize the results of CouchDB’s built-in reduce functions for some quick feedback, without leaving the context of its handy dashboard? Recently, my colleague and…",Simple data visualization in Apache CouchDB™ – IBM Watson Data Lab – Medium,Live,272 804,"* Videos & Webinars * About Me // Contact * Download The E-book! * Blog * Why Data? Menu Close * Videos & Webinars * About Me // Contact * Download The E-book! * Blog * Why Data? Hey, I'm Tomi Mester. This is my data blog, where I give you a sneak peek into online data analysts' best practices. You will find here articles and videos about data analysis, AB-testing, researches, data science and more...SUBSCRIBE FOR DATA ARTICLES HERE: Email Address * Name *© 2017 Data36 . Powered by WordPress . STATISTICAL BIAS TYPES EXPLAINED (WITH EXAMPLES) – PART1 Written by Tomi Mester on August 21, 2017Humans are stupid. We all are, because our brain has been made that way. The most obvious evidence to this built-in stupidity is the different biases, that our brain produces. Even if it’s so, at least we can be a bit smarter, than the average, if we are aware of them. This is a data blog, so in this article I’ll focus only on the most important statistical bias types – but I promise, that even if you are not an aspiring data professional (yet), you will profit a lot from this write up. For the ease of understanding for each statistical bias type I’ll provide two examples: an everyday one and a more online analytics related one! And just to make this clear: biased statistics are bad statistics. Everything I will describe here is to help you prevent the same mistakes, that some of the less smart “researcher” folks are doing time to time. THE MOST IMPORTANT STATISTICAL BIAS TYPES There is a long list of statistical bias types. I’ll cover those, that can affect your job as a data scientist or analyst the most. These are: 1. Selection bias 2. Self-selection bias 3. Recall bias 4. Observer bias 5. Survivorship bias 6. Omitted variable bias 7. Cause-effect bias 8. Funding bias 9. Cognitive bias STATISTICAL BIAS #1: SELECTION BIAS proper random sampling selection bias Selection bias occurs, when you are selecting your sample or your data wrong. Usually this means accidentally working with a specific subset of your audience instead of the whole, hence your sample is not representative of the whole population. There are many underlying reasons, but by far the most typical I see: collect and work only with data that is easy to access . Everyday example of selection bias:Please answer this question: What’s people’s overall opinion about Donald Trump’s presidency? Most people have an immediate and very “educated” answer for that. Unfortunately for many of them the top source of their information is their Facebook feed. Very bad and sad practice, because what they see there does not show the public opinion – it’s only their friends’ opinion. (In fact, it’s even narrower, because they see there only those friends’ opinion, who are active and posting to Facebook – so most probably 25-35 and extroverted people are overrepresented.) That’s a classic selection bias: easy-to-access data, but only for a very specific, unrepresentative subset of the whole population. Note 1: I do recommend blocking your Facebook feed for many reasons, but mostly not to get narrow-minded by it: FB News Feed Eradicator ! Note 2: If you want to read another classy selection bias story, check how Literary Digest did a similar mistake (also referred as undercoverage bias) ~80 years ago! Online analytics related example of selection bias:Another example for selection bias is, when you send out a survey for your newsletter subscribers – asking what new product would they pay for. Of course, interacting with your audience is important (I send out surveys to my Newsletter Subscribers sometimes too), but when you analyze these survey results, you should be aware, that your newsletter subscribers are not representing your potential paying audience. There might be a bunch of people, who are willing to pay for you, but they are not a part of your newsletter list. And on the other hand there might be a lot of people on your list, who would never spend money on your products, they are around just to get notified about your free stuff. And that’s only one reason yet (see the rest below), why surveying is just the simple worst research method. By the way, for this particular example, I’d suggest to do fake door testing instead! STATISTICAL BIAS #2: SELF-SELECTION BIAS Self-selection bias is a subcategory of selection bias. If you let the subjects of your analyses/researches select themselves, that means that less proactive people will be excluded. The bigger issue is that self-selection is a specific behaviour – that implies other specific behaviours – thus this sample does not represent the entire population. Everyday example of self-selection bias:Any type of polling/surveying. Eg. when you want to research successful entrepreneurs’ behaviour with surveys, your results will be skewed for sure. Why? Because successful people most probably don’t have time/motivation to answer or even take a look at random surveys. So the 99% of your answers will come from entrepreneurs, who thinks they are successful, but in fact they are not. In this specific case, I’d rather try to lure people who are proven to be successful into face-to-face interviews. Online analytics related example of self-selection bias:Say, you have an online product – and a knowledge base for that with 100+ how-to-use-the-product kind of articles in it. Let’s find out how good your knowledge base is and compare the users, who read at least 1 article from it to the users who didn’t. We find that the article-reader users are 50% more active in terms of product usage, than the non-readers. Knowledge base performs great! Or does it? In fact, we don’t know, because the article-readers are a special subset of your whole population, who might have a higher commitment to your product and this might be the reason of their interest in your knowledge base. With other words, they have “selected themselves” into the reader-group. This self-selection bias leads to a classy correlation/causation dilemma , that you can never solve by data research, just by A/B testing . STATISTICAL BIAS #3: RECALL BIAS Recall bias is another common error of interview/survey situations, when the respondent doesn’t remember correctly for things. It’s not bad or good memory – humans have selective memory by default. After a few years certain things stay, others fade. It’s normal, but it makes researches much more difficult. Everyday example of recall bias:How was that vacation 3 years ago? Awesome, right? Looking back we tend to forget the bad things and keep remembering to the good things only. Although it doesn’t help us to objectively evaluate different memories, I’m pretty sure our brain is like that for a good reason. Online analytics related example of recall bias:I’m holding data workshops from time to time. I usually send out feedback forms afterwards, so I can make the workshops better and better based on participants’ feedbacks. I usually send them the day after the workshop, but there was one particular case when I completely forgot it and sent it one week later. Looking at the comments I got, that was my most successful workshop of all time. Except that it’s not necessarily true. It’s more likely that recall bias might have kicked in pretty hard. One week after the workshop neither of the attendees would recall if the coffee were cold or if I was over-explaining a slide here or there. They remembered only to the good things. Not that I wasn’t happy for their good feedback, but if the coffee were cold, I would want to know about it – to get it fixed for the next time… STATISTICAL BIAS #4: OBSERVER BIAS Observer bias is happening, when the researcher subconsciously projects his/her expectations to the research. It can come in many forms. Eg. (unintentionally) influencing the participants (only at interviews and surveys) or doing some serious cherry picking (focusing rather on the statistics that support our hypothesis, than to the statistics, that doesn’t.) Everyday example of observer bias:Fake news! 🙂 It needs a very thorough and consequent investigative journalist to be OK with rejecting her own null-hypothesis at the publication phase. Eg. if a journalist spends 1 month on an investigation to prove that the local crime rate is high because of the careless police officers – most probably she will find a way to prove it – leaving aside the counter arguments and any serious statistical considerations. Extended by other common journalist-kind-of statistical biases, like funding bias (studies tend to support the financial sponsors’ interests) or publication bias (to fake or extremize the research results to get published) led me to the conclusion that reading any type of online media will never get me closer to any sort of truth about our world. So I’d rather suggest to consume trustful statistics than online media – or even better: find trustworthy raw data and do your own analyses to learn a “truer truth”. Online analytics related example of observer bias:Observer bias can affect online researches as well. Eg. when you are doing a Usability Tests . As a user researcher, you know your product very well (and maybe you like it too), so subconsciously you might have expectations. If you are a pro User Experience Researcher, you will know, how not to influence your testers by your questions – but if you are new to that field, make sure you spend enough time with preparing good, unbiased questions and scenarios. Maybe consider hiring a professional UX consultant to help. Note: in my workshop feedback example observer bias can occur if I send out the survey right after the workshop. Participants might be under the influence of the personal encounter – and this might indicate that they don’t want to “hurt my feelings” with negative feedbacks. Workshop feedback forms should be sent 1 day after the workshop itself. STATISTICAL BIAS #5: SURVIVORSHIP BIAS Survivorship bias is a statistical bias type, where the researcher is focusing only to that part of the data set, that already went through some kind of pre-selection process – and missing those data-points, that fell off during this process (because they are not visible anymore). Everyday example of survivorship bias:One of the most interesting stories of statistical biases: falling cats. There was a study written in 1987 about cats falling out from buildings. It stated that the cats who fell from higher have less injuries than cats who fell from lower. Odd. They explained the phenomenon with the terminal velocity, which basically means that cats falling from higher than six stories are reaching their maximum velocity during the fall, so they start to relax, prepare to landing and that’s why they don’t injure themselves that hard. As ridiculous as it sounds, as mistaken this theory turned out to be. 20 years later, the Straight Dope newspaper pointed out to the fact, that those cats who are falling from higher than six stories might have died with a higher chance, thus people don’t take them to the veterinarian – so they were simply not registered and didn’t become the part of the study. And the cats that fell from higher, but survived were simply falling more luckily, that’s why they had less injuries. Survivorship bias – literally. (I feel sorry for the cats though.) Online analytics related example of survivorship bias:Reading case studies. Case studies are super useful to give you inspiration and ideas to your new projects. But remind yourself all the time, that only success stories are published! You will never hear about the stories, where one used the exact same methods, but failed. Not so long ago I’ve read a bunch of articles about exit intent pop-ups. Every article declared that exit intent pop-ups are great and brought +30%, +40%, +200% in number of newsletter subscriptions. In fact it works pretty decent on my website too… But let’s take a break for a moment. Does it mean that exit-intent popups will work for everyone? Isn’t it possible that those guys, who have tested exit-intent pop-ups and found that it actually hurts the user experience, the brand or the page load time, they have just simply didn’t write an article about this bad experience? Of course, it’s possible – nobody likes to write about unsuccessful experiment results… The point is: if you read a case study, think about it, research it and test it – and decide based on hard evidence if it’s the right solution for you or not. 4 MORE STATISTICAL BIAS TYPES AND SOME SUGGESTIONS TO AVOID THEM… This is just the beginning! Next week I’ll continue this article with 4 more statistical bias types – that every data scientist and analyst should know about. And on the week after, I’ll give you some practical suggestions, how to overcome these! Stick with me and subscribe to my weekly Newsletters (no spam, just 100% useful data content)! And if you have any comments, let me know below! Cheers, Tomi * August 21, 2017 * In Analyze the Data * AB test analytics bias data data science learn data science metrics qualitative research research statistical bias types statistics tomi mester ← Previous post2 COMMENTS 1. MANUELPB August 22, 2017Good article. Waiting for reading second part. Manuel, from Spain Reply * TOMI MESTER August 22, 2017Thanks Manuel! Coming next week! 😉 Tomi Reply * 2. LEAVE A REPLY CANCEL REPLY Comment Name * Email * Website Get free data articles weekly: We use cookies to ensure that we give you the best experience on our website. Ok","Be aware of the different statistical bias types is inevitable, if you are about to learn data science and analytics. Here are the most important ones.",Statistical Bias Types explained (with examples),Live,273 806,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectHOW TO ANALYZE YOUR PIPE RUNS WITH BUNYANDavid Taieb / August 11, 2015INTRODUCTIONIn this post, I’ll discuss how our Simple Data Pipe sample app uses the Bunyan Node.js logging framework to capture detailed logging information about a pipe run. Then I’ll show youhow to analyze the report using the Bunyan viewer tool.If you’ve explored our Simple Data Pipe tutorial on Bluemix, you know thatmetadata about your pipe runs is stored in Cloudant as JSON. Cloudant’s supportfor binary attachments within JSON lets you attach the logs from Bunyan rightalongside their associated JSON document, which you can then access for furtheranalysis.A WORD ABOUT BUNYANBunyan is a simple and fast JSON logging library for Node.js services. It can beconfigured to output the data to streams that can be stored anywhere. SimpleData Pipe uses this library to capture log information about a particular run,then attach the report to the pipe run document stored in the pipe_db database in your Cloudant account.This logging framework supports many log levels: trace , debug , info , warn , and error . As you’ll see, Bunyan also provides a CLI utility to pretty-print its output,with the ability to filter by logging group and level. See https://github.com/trentm/node-bunyan for more information.HOW TO LOCATE THE LOG FOR A PARTICULAR RUNHere’s the scenario: You attempted a pipe run and something went wrong. You nowneed to locate the log, download it from the pipe_db Cloudant database, and analyze it for troubleshooting. 1. Go to Bluemix and click on your pipe app instance. 2. Click on the pipes-cloudant-service box to open the Cloudant dashboard. Pipes Cloudant Service Box 3. Click the Launch button. 4. In the Cloudant dashboard, click on the pipe_db database. 5. In the menu on the left, click _design/application , then Views , then all_runs . 6. Locate your last run.On the left-hand side of the all_runs view, you can see that the Map function logic that defines this view indexes pipe run documents sorted in chronological order. So, the run document you’re looking for is the last one in the view. (You may need to page through the results a few times if you have performed a lot of runs.) 7. Click on the pencil icon to open the run document. You should be able to see the JSON metadata for the run. 8. In the toolbar above the document, click the View Attachments dropdown button and right-click on run.log . Then click on the Save Link As… option. This will download the file to a directory of your choice. 9. The next step will be to use the Bunyan CLI tool to analyze run.log.ANALYZE RUN.LOGIf you have not already done so, install Bunyan on your local machine usingthese simple steps:npm install bunyan -gNote: If you are on a Linux-based system like Mac OS X, use sudo .You can view the entire log in pretty-printed format using this command: bunyan Note: To quit long output, use q .You can also filter the log to only view errors using the following command:bunyan -l errorNote: You can use any log level you want, e.g., error , info , warn , etc.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",How to use Bunyan to capture detailed logging of data migration runs through our Simple Data Pipes app.,How to analyze your pipe runs with Bunyan,Live,274 809,"DO MORE WITH COMPOSE POSTGRESQL USING ZAPIERShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 21, 2016Zapier is a service which allows you to create custom integrations among avariety of applications, including PostgreSQL. Below we'll look at a couple ofexamples for how you can do more with Compose PostgreSQL by integrating it withother tools via Zapier.We first introduced our readers to Zapier when we showed how to Zapier your data to MongoDB and later we followed that up with an article about how to send alerts from our platform to your application of choice. Since then, Zapier has added PostgreSQL to the long list of integrations they offer.NOTIFY ME WHEN SOMETHING CHANGESOne of the most common business requirements with databases is to receivenotifications when something important has changed. DBAs may want to receive anotification when a new table is created or when a new column is added to atable. Business users may want to know when there's a new row added to afavorite table or when the data from a custom query changes.Among other things, at Compose we like to keep track of how many people aretrying out our service with our free 30-day trial so we've written a ""zap"" (a Zapier integration widget) to notify us in our main Slack channel when the number of trial customers changes. Here's how:Once you sign up for a Zapier account , you'll see a button to ""Make a Zap"":TRIGGERSThe first step in making a zap is setting the trigger. You'll be asked whichapplication you want to start from. For this scenario, that's PostgreSQL:Next comes the trigger type we want to use. For us, that's a custom query, butyou can see the other options we mentioned (table, column and row):In a previous article we explained how to setup a Segment warehouse using Compose PostgreSQL . We're going to run our custom query against our Segment warehouse where we'retracking trial events.If you haven't already created a connection in Zapier to your ComposePostgreSQL, you'll be asked to do so:If you already have a connection, you'll be asked if you want to use an existingone or create a new one.For this scenario, because we are using the custom query trigger, we're asked toprovide our query:After that, we're asked to test the query by fetching a row. Once we've run thetest, we can view the row, re-test, or choose to just continue:ACTIONSNow that we have our trigger setup from PostgreSQL, we'll set the resultingaction - the notification.For the action, we're asked what application to use. At Compose, we're usingSlack:We're going to send a channel message, but there are several other options tochoose from:Then, if you already have a Slack account setup in Zapier, you'll be asked ifyou want to use it or create a new one. For our example, we'll create a newaccount:The next step is to fill out the Slack template. As you can see we're sending amessage to our ""general"" channel. In the message text, we're using the trialscount from our query with some additional text:There are several other options in the template including bot, image, link, andmention settings.At the end, similar to the trigger section, we'll then get to test our Slackmessage action. We can also re-test or just finish.Finally, we'll give our ""zap"" a name (in this case we are using Effie, thefictional tributes' escort from The Hunger Games ) and we'll turn it ""on"":Hey! There's a Slack notification!Now that the ""zap"" is turned on, it will run every 5-15 minutes (depending onthe billing plan you select) and will only perform the action (called a ""task""in the billing plan) when there has been a change in the data. Zapier lets you try this out with a 14 day free trial so you can determine what billing plan best fits your situation.FILTERS AND INTERMEDIARY ACTIONSWhile we didn't use any filters or intermediary actions between our ""zap""trigger and action, you can add them to enhance the precision or functionalityof your own zaps. A filter might check if the data met a certain criteria. Forexample, we could apply a filter to check if our trials are greater than 300before moving on to our final action that sends a channel message in Slack. Anintermediary action might be posting the data to another application, such assetting the data as a metric in a dashboard tool like Leftronic , before moving on the the final action. In this way, you can hit multiple appsor take multiple steps in the same app, with your trigger data. That's prettypowerful stuff!Now, that we've seen how to create notifications from changes in our ComposePostgreSQL database, let's look at a couple other use cases.COPY THE DATA SOMEWHERE ELSEIn our example above where we're using a custom query to generate a data row orwhen a new row is added to our table, rather than sending a notification, we maywant to copy that data to another app. For example, we may want to copy thatdata to Google sheets for our marketing team to have easy access to it forcreating reports. We could also use this option in a polyglot persistencescenario where we need the same data in a different database. In that case, wecould copy data from our Compose PostgreSQL database to our Compose MongoDB orRethinkDB databases (or vice versa!), or even to your corporate SQL Serverinstance.The steps for this use case are similar to the ones we demonstrated above,though each application will have its own specifics, of course. The great thingwith Zapier is that it's built to be intuitive and to guide you along for eachintegration type. Since we've run through one case with you here and a coupleothers in our previous articles, we know you've already got the hang of how toget data from PostgreSQL to other apps.Now let's look at our final use case for this article... having anotherapplication trigger the insertion of a data row into PostgreSQL.ADD DATA FROM ANOTHER APPAt Compose, we use Help Scout for keeping in contact with our customers and helping them resolve supportissues. Let's say that we want to tally the customer support conversations fromHelp Scout so that we can tie them directly to our accounts database inPostgreSQL.So, we'll make a new ""zap"", choose Help Scout as the trigger application, andthen choose ""New conversation"" as the trigger:You'll be asked to setup the connection to Help Scout app via API key (which youcan generate in the settings for your Help Scout profile) if you don't have onecreated already.Next, we'll select the Help Scout mailbox we want and a status if that'sapplicable:We'll then test the request and continue on.Next, we'll move on to the action... Add data to PostgreSQL.We'll choose to add a new row for this example:Since our PostgreSQL connection already exists in Zapier, we'll choose to useit, though as we mentioned, you could add a new one if you need to.Next, we'll select the table in PostgreSQL and set how the fields from HelpScout map to the fields in our table. For this example, we just have a simpletable that will collect the timestamp at which the conversation was created andthe customer email:We then test our data row insertion into PostgreSQL and finish our ""zap"" bygiving it a name and turning it on.Now, what we can do is create a report from data in PostgreSQL that aggregates acustomer's conversations from Help Scout and joins that to the account recordthat already exists in our PostgreSQL database. Or we can create a query to tellus the most frequent days and times that conversations are created to make surewe have good coverage in support.This is just a simple example, but think of how powerful this use case can be.With Zapier, you can add data to your PostgreSQL database from otherapplications so that you can easily create reports and run analyses frommultiple sources in one convenient location - Compose PostgreSQL!WRAPPING UPZapier is a powerful tool that will help you get more from your ComposePostgreSQL database, either by using it to trigger data or notifications toother apps or by using it to generate new data rows in the database based ondata or events from other apps. Compose PostgreSQL, MongoDB and RethinkDB areall currently supported by Zapier as well as more than 500 other applicationsavailable for integration. If you don't already have a Compose account, signup to get started with PostgreSQL today.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Zapier is a service which allows you to create custom integrations among a variety of applications, including PostgreSQL. Below we'll look at a couple of examples for how you can do more with Compose PostgreSQL by integrating it with other tools via Zapier.",Do More with Compose PostgreSQL using Zapier,Live,275 818,,"Love to work in Microsoft Excel? Watch how to connect to IBM dashDB as the data source for Excel, and how to import tables into a spreadsheet. ",Integrate dashDB with Excel,Live,276 819,"Skip navigation Sign in SearchLoading... Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE. WATCH QUEUE QUEUE Watch Queue Queue * Remove all * Disconnect The next video is starting stop 1. Loading... Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: WORK WITH DATA CONNECTIONS developerWorks TVLoading... Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking... Subscribe Subscribed Unsubscribe 17KLoading... Loading... Working... Add toWANT TO WATCH THIS AGAIN LATER? Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO? Sign in to report inappropriate content. Sign in * Transcript * Statistics * Add translations 2 views 1LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 2 0DON'T LIKE THIS VIDEO? Sign in to make your opinion count. Sign in 1Loading... Loading... TRANSCRIPT The interactive transcript could not be loaded.Loading... Loading... Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning * CATEGORY * Science & Technology * LICENSE * Standard YouTube License Show more Show lessLoading... Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * Tanmay Bakshi on building AskTanmay - Duration: 22:59. developerWorks TV 193,680 views 22:59 -------------------------------------------------------------------------------- * Big data and dangerous ideas | Daniel Hulme | TEDxUCL - Duration: 14:40. TEDx Talks 36,478 views 14:40 * Cabling a SoftLayer Data Center Server Rack - Duration: 4:09. IBM Bluemix 1,170,843 views 4:09 * Data Science Experience: Build SQL queries with Apache Spark - Duration: 3:29. developerWorks TV 2 views * New 3:29 * Tableau for Data Scientists - Duration: 35:23. Brent Tabl 138 views 35:23 * Data science and our magical mind: Scott Mongeau at TEDxRSM - Duration: 16:33. TEDx Talks 18,776 views 16:33 * JavaOne: Microservice hands-on - Duration: 5:22. developerWorks TV No views * New 5:22 * IBM Bluemix Data Connect - Self Service Data Preparation and Integration Demo - Duration: 16:11. carlo appugliese 1,705 views 16:11 * Data hacking - data science for entrepreneurs | Kevin Novak | TEDxWakeForestU - Duration: 17:11. TEDx Talks 18,901 views 17:11 * Data Science Hands on with Open source Tools - WHAT IS DATA SCIENTIST WORKBENCH? - Duration: 3:47. Cognitive Class 5,687 views 3:47 * Data Science Hands on with Open source Tools - Creating & Uploading Workflows - Duration: 4:42. Cognitive Class 2,069 views 4:42 * HURRICANE MARIA RECORD RAIN - FLOODING - Cosmic Ray Connection and the Grand Solar Minimum - Duration: 8:44. Oppenheimer Ranch Project 1,763 views 8:44 * My Journey to Data Scientist - Duration: 3:13. Story by Data 1,952 views 3:13 * Data Science Hands on with Open source Tools - What are Jupyter notebooks - Duration: 2:22. Cognitive Class 4,362 views 2:22 * IBM Watson Machine Learning: Build a Predictive Analytic Model - Duration: 4:06. developerWorks TV 21 views * New 4:06 * JavaOne: The excitement so far - Duration: 5:04. developerWorks TV 1 view * New 5:04 * IBM Big SQL: Analyze HDFS data with IBM Cognos Analytics - Duration: 6:54. developerWorks TV No views * New 6:54 * JavaOne: Optimize enterprise Java with Microprofile 1.2 - Duration: 5:49. developerWorks TV No views * New 5:49 * IBM Analytics Engine Overview - Duration: 7:21. developerWorks TV 7 views * New 7:21 * JavaOne: Meet a new Java face at developerWorks - Duration: 2:30. developerWorks TV 1 view * New 2:30 * Language: English * Content location: United States * Restricted Mode: Off History HelpLoading... Loading... Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Test new features * Loading... Working... Sign in to add this to Watch LaterADD TO Loading playlists...",This video shows you how to set up connections to both Bluemix and external sources.,Work with Data Connections in DSX,Live,277 820,"Homepage Follow Sign in / Sign up Homepage * Home * Data Science Experience * * Watson Data Platform * Jorge Castañón Blocked Unblock Follow Following applied mathematician and art lover | opinions are my own Jun 29, 2016 -------------------------------------------------------------------------------- DEEP LEARNING TRENDS AND AN EXAMPLE The Spark Summit 2016 took place on June 6–8 in San Francisco and it was a sold out event with more than 2,500 attendees. Not surprisingly, deep learning (DL) and artificial intelligence (AI) were the main dishes of the conference. On day one, most of the keynotes were on how DL and AI are making the world better. Don’t you think it’s amazing that you can teach a computer to distinguish between an image of a cat from an image of a dog? I do! And the mentioned example is nothing compared to others fantastic examples that were presented at the Summit. Based on google searches, starting around 2014, both terms Apache Spark and Deep Learning have had a dramatic increase. Jeff Dean, head of Google’s brain team, talked about how DL is used to verbally describe an image. Imagine a blind person using an app to understand an image without the help of other person! Jeff also talked about other use-cases where DL is useful like speech recognition and email smart reply, among others. Andrew Ng, chief scientist at Baidu and co-founder of Coursera , compared AI models with rockets: artificial neural networks to their engine and data to its fuel. At Baidu, DL and AI are being applied to train models for autonomous driving, fraud and malware detection, among other use-cases. Neural networks (NN) need more data than traditional algorithms, especially deep neural networks. The gain of NN algorithms trained with large amounts of data is in the quality of your predictions at a cost of more computational power (therefore the popularity of GPU’s used for training NN’s). Find the slide shown and more information about all the very interesting talks at the Spark Summit here . Hopefully I have convinced you that DL and AI is quiet something to look at. This is why I build a notebook on Data Science Experience to run a very well-known and simple DL example for classifying handwritten digits. Please check my notebook here . The best way to contact me for questions, feedback or just to say hi is @castanan . -------------------------------------------------------------------------------- Originally published at datascience.ibm.com on June 29, 2016. * Artificial Intelligence * Data Science Experience * Dsx A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out. Blocked Unblock Follow FollowingJORGE CASTAÑÓN applied mathematician and art lover | opinions are my own FollowIBM WATSON DATA PLATFORM Build smarter applications and quickly visualize, share, and gain insights * * * * Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","The Spark Summit 2016 took place on June 6–8 in San Francisco and it was a sold out event with more than 2,500 attendees. Not surprisingly, deep learning (DL) and artificial intelligence (AI) were…",Deep learning trends and an example,Live,278 825,"Compose The Compose logo Articles Sign in Free 30-day trialHOW TO TALK RAW REDIS Published Feb 22, 2017 redis Development How to talk raw RedisFind out how to talk to a Redis database with nothing more than echo and the netcat command and get a deeper understanding of why developers love Redis. Redis has, as we've shown in the past, many many drivers . One of the reasons for situation is that, by design, Redis has a very simple protocol, RESP, for communicating with the server. Building on that simple protocol has allowed people to create these many drivers and their various levels of abstraction or idiomatic appropriateness. But let's talk about getting down to basics here; sometimes resource constraints demand you create the smallest possible connection code. OUR FIRST COMMAND For this example, we're going to get some status information from the Redis server; the INFO command returns lots of useful information so we will use that. Now, to send strings with Redis's RESP protocol, you need to say how long the string is. That's done by preceding your string with a $ and the number of characters in the string so that's 4 characters. After the number and after the string should be the carriage return and newline, \r\n . Let's build our string to send: $4\r\nINFO\r\n Redis's RESP also wants to know how many strings are in a command. It has a ""bulk-strings"" indicator which is like the $ operator, except uses a * and the number following it is the number of strings in the command. For this command, that's... 1. So we need to precede the string with *1 . *1\r\n\$4\r\nINFO\r\n We can use the nc - net cat - command to send this string to our server. That server, for this example is at sl-eu-lon-2-portal.1.dblayer.com and on port 10030. If we echo our string and pipe it into nc we should get information: $ echo ""*1\r\n$4\r\nINFO\r\n"" | nc sl-eu-lon-2-portal.1.dblayer.com 10030 -ERR Protocol error: expected '$', got ' ' $ The problem here is the old classic shell thing of unexpected expansion. The shell sees that $ and wants to expand it from an environment variable $4 which is, of course, blank. We need to escape that and try again. $ echo ""*1\r\n\$4\r\nINFO\r\n"" | nc sl-eu-lon-2-portal.1.dblayer.com 10030 -NOAUTH Authentication required. $ TIME TO LOG ON Because who would put a completely open Redis server up on the internet. The Compose Redis servers start up with authentication on and a 16 character password set. If you are doing this against a server that you own and it worked, check now that that server isn't externally visible. Other people do not value your data. Anyway, back to getting authenicated. We need to construct another command, the AUTH command and follow it with our password. This time, there's two strings in the command, the AUTH command itself and the password, let's say it's FLIBBERTIGIBBETS for now. That gives us *2\r\n\$4\r\nAUTH\r\n\$16\r\nFLIBBERTIGIBBETS\r\n which we can put into the front of out command now: $ echo ""*2\r\n\$4\r\nAUTH\r\n\$16\r\nFLIBBERTIGIBBETS\r\n*1\r\n\$4\r\nINFO\r\n"" | nc sl-eu-lon-2-portal.1.dblayer.com 10030 +OK $2306 # Server redis_version:3.2.6 redis_git_sha1:00000000 redis_git_dirty:0 redis_build_id:dafadbf6141a77d5 redis_mode:standalone os:Linux 3.19.0-39-generic x86_64 arch_bits:64 multiplexing_api:epoll gcc_version:4.8.4 process_id:36 run_id:275d6af5f6ccf9be4efde1dbcb8223483386cf42 tcp_port:6379 ... $ That goes on for a little while... There are 2,306 characters of information in total. We know that because that's the first thing Redis told us. In the same way, as we tell it how long strings are, it does the same back, so the second line, $2306 is telling us how many characters are coming back. That's the response to the INFO command. Immediately preceding that is +OK , the response to the AUTH command. The + is the signal that this is a simple non-binary safe string; a minimal OK. LET'S MAKE IT TIDY Anyway now we can pipe those results to any command we want for post-processing. Add a | tail -n +3 to chop off the Redis RESP responses and we have a clean output, just like entering INFO at the redis-cli command line. We're good people and good people don't leave passwords in shell scripts. Let's pop the password into an environment variable... export REDISAUTH=""FLIBBERTIGIBBETS"" And change the command to use that. This time we want that shell expansion to happen. echo ""*2\r\n\$4\r\nAUTH\r\n\$16\r\n$REDISAUTH\r\n*1\r\n\$4\r\nINFO\r\n"" | nc sl-eu-lon-2-portal.1.dblayer.com 10030 | tail -n +3 Now the command is sharable without giving away your password. There's one last thing to do for this example. That INFO data is quite a lot to work through and we only really, say, want the STATS section. That's not a problem, we just need to add that to the INFO command, first by bumping the string count at the start of the command to 2 and then appending \$5\r\nSTATS\r\n to the end. $ echo ""*2\r\n\$4\r\nAUTH\r\n\$16\r\n$REDISAUTH\r\n*2\r\n\$4\r\nINFO\r\n\$5\r\nSTATS\r\n"" | nc sl-eu-lon-2-portal.1.dblayer.com 10030 | tail -n +3 # Stats total_connections_received:1700495 total_commands_processed:40375802 instantaneous_ops_per_sec:7 total_net_input_bytes:1984826397 total_net_output_bytes:2497942423646 instantaneous_input_kbps:0.41 instantaneous_output_kbps:1.10 rejected_connections:0 sync_full:5 sync_partial_ok:0 sync_partial_err:0 expired_keys:0 evicted_keys:0 keyspace_hits:194 keyspace_misses:6 pubsub_channels:1 pubsub_patterns:0 latest_fork_usec:469 migrate_cached_sockets:0 $ Now we just have our statistics. And it's still one shell command. WHAT THIS GETS US You'll note that the Redis RESP protocol is remarkably simple. You can find out more about it on the RESP specification page . There are some other response types ( - for an error, which we saw when we got the NOAUTH message and : for integers) and some other rules to take note of but it's also pretty simple to code for. If you have a new language on your hands and no Redis driver, it is good to know that the protocol is so simple and readily implementable in even the most constrained of languages. As long as you can open a TCP socket to a port and read/write to it, you are good to go. This should also help explain the types of Redis drivers that are around. The minimalist drivers basically provide enough to make sending commands to and receiving data from Redis; the user of the driver sends the strings for the commands. Also, it's got its uses in the Internet of Things. In a future article, we'll be looking at that when we add Redis stats gathering (among other things) to a very resource-constrained device. Until then, have fun going low with Redis. -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. attribution Patrick Hendry Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES Feb 21, 2017WHY IT CONSULTING AND DEVELOPER SERVICES COMPANIES LOVE COMPOSE One of the great constants of software consulting is this: You need reliable, stable, and repeatable databases and database s… Arick Disilva Feb 17, 2017NEWSBITS: REDIS, ETCD AND ELASTICSEARCH UPDATES, GO 1.8, GITHUB GUIDES AND CHATOPS AND MORE NewsBits for the week ending 17th February - Redis gets a critical update, etcd's latest release, Elasticsearch gets a bump,… Dj Walker-Morgan Feb 10, 2017NEWSBITS: RETHINKDB LIVES, REDIS AND POSTGRESQL FUTURES, FOSDEM, RUST AND WUZZ NewsBits for the week ending 10th February - RethinkDB has a new home, Redis's future is being mapped out, PostgreSQL 10's fe… Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",Find out how to talk to a Redis database with nothing more than echo and the netcat command and get a deeper understanding of why developers love Redis.,How to talk raw Redis,Live,279 826,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * Connect FULL TEXT SEARCH FROM WITHIN APACHE COUCHDB™ Mike Broberg / October 20, 2015A few months back, IBM Cloudant open-sourced the repos that power our integration with the Apache Lucene™ text search engine library. See this excellent blog from Robert Newson, who outlines the projects Clouseau and Dreyfus and explains how they interact with Cloudant’s CouchDB-based system. Use Lucene with the current release of CouchDB y’all. The Lucene Search integration will become part of the forthcoming CouchDB 2 release, but if you can’t wait, our own Robert Kowalski published instructions on how to recompile the current 1.6.1 release of CouchDB to use the new search features . See his blog at https://cloudant.com/blog/enable-full-text-search-in-apache-couchdb/ for more. In addition to their work at IBM Cloudant, both Roberts are deeply involved in Apache CouchDB as members of its Project Management Committee. A big thank you to both for their work and for making CouchDB an awesome place to store JSON data :D © “Apache”, “CouchDB”, “Lucene”, “Apache CouchDB”, “Apache Lucene”, and the CouchDB and Lucene logos are trademarks or registered trademarks of The Apache Software Foundation. All other brands and trademarks are the property of their respective owners. SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Geospatial * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Learn how to connect the 1.6.1 release of CouchDB to Cloudant's recently open-sourced Lucene integration.,Using Lucene search from within CouchDB,Live,280 831,"Compose The Compose logo Articles Sign in Free 30-day trialANALYZING PET NAME TRENDS WITH POSTGRESQL'S CROSSTABVIEW Published Jul 13, 2017 postgresql crosstabviews Analyzing Pet Name Trends with PostgreSQL's crosstabviewPostgreSQL 9.6 comes with a number of updates and new features to explore. One very useful addition is the \crosstabview command, which gives you the power to rearrange how your data is viewed without the difficulty of writing complex SQL queries. Since the release of PostgreSQL 9.6.2 on Compose, we've been playing with some of the new additions to the database. One addition we found interesting and quite useful is the new psql meta-command \crosstabview , which was released with PostgreSQL 9.6. This command allows query results to be shown in a representation, similar to a spreadsheet pivot table, without needing to write complex SQL queries. Here, we'll look at how it works and show you some of the use cases where it might be beneficial to use. The dataset we'll use is the Current Pet Licenses for the City of Tacoma and Fircrest for 2017 , which is a CSV file that contains a list of 15,555 names of cats and dogs in the two cities. To follow along, download the dataset from the link and let's look at how the \crosstabview command works. IMPORTING THE DATASET AND QUERYING PET NAMES After downloading the dataset, we created a database pets and a table names , then we imported the CSV data. CREATE TABLE names ( name TEXT, animaltype TEXT, primarybreed TEXT, tpdsector INT, latlon TEXT, animalcount INT ); \COPY names (name, animaltype, primarybreed, tpdsector, latlon, animalcount) FROM '/Downloads/Current_Pet_License-City_of_Tacoma___Fircrest.csv' CSV HEADER; Now that the pet names have been inserted, let's look for the names that both cats and dogs share. A simple query using count and a GROUP BY clause will do the trick. SELECT name, animaltype, count(name) FROM names GROUP BY name, animaltype ORDER BY 1; A sample of the results of that query is below. As you can see, some names are shared between cats and dogs (e.g. ""ABBY""). However, since the names are divided between CAT and DOG , the names are grouped accordingly and we don't have one row dedicated to a single name. name | animaltype | count ----------------------+------------+------- 2P2 | DOG | 1 A BARKSDALE | DOG | 1 A509966 | CAT | 1 AARON | DOG | 1 AB | DOG | 1 AB ""ABBY"" | DOG | 1 ABBBY | CAT | 1 ABBEY | DOG | 5 ABBI | DOG | 2 ABBIE | DOG | 10 ABBIGAIL | CAT | 1 ABBOTT | CAT | 1 ABBY | DOG | 40 ABBY | CAT | 12 ... Trying to look at every row to find each cat and dog with identical names will be tedious, especially if the dataset is much larger than this. One way that we might overcome the problem is to design a new query that would put the count of cats and dogs in their own column. SELECT name, count(CASE WHEN animaltype='CAT' THEN 1 END) AS CAT, count(CASE WHEN animaltype='DOG' THEN 1 END) AS DOG FROM names GROUP BY name ORDER BY 1; which produces ... name | cat | dog ----------------------+-----+----- 2P2 | 0 | 1 A BARKSDALE | 0 | 1 A509966 | 1 | 0 AARON | 0 | 1 AB | 0 | 1 AB ""ABBY"" | 0 | 1 ABBBY | 1 | 0 ABBEY | 0 | 5 ABBI | 0 | 2 ABBIE | 0 | 10 ABBIGAIL | 1 | 0 ABBOTT | 1 | 0 ABBY | 12 | 40 ... But creating an entirely new query to reorganize our data might be overkill, especially if you only want to rearrange the columns. That's where PostgreSQL's \crosstabview will help. QUERY WITH \CROSSTABVIEW The first query we ran grouped together and counted all the cats and dogs having names with identical spellings and placed them into separate rows. \crosstabview can transform the data automatically by placing CAT and DOG in separate, horizontal columns, merging together the pet names in the vertical column, and using the count values to fill in the grid where cells are shared between the horizontal and vertical headers. All that's required for \crosstabview to work is that you have at least three columns that it can select data from. It does this by finding the distinct values within the query's results and uses them as horizontal and vertical headers. The data shared between the header values are then projected into the grid of cells. To see it in action, all you have to do is run \crosstabview after your SQL query. SELECT name, animaltype, count(name) FROM names GROUP BY name, animaltype ORDER BY 1 \crosstabview Once the \crosstabview command is executed, it sends the query input buffer to the server then shows the results of that query in a crosstab grid. That means crosstabview will only use the last query executed to populate the crosstab grid. If \crosstabview is appended to the SQL query like in the query above, don't use the semicolon after \crosstabview , otherwise, you'll get an error: Invalid command \crosstabview; . That's because \crosstabview works similarly to ; at the end of an SQL query. Alternatively, if you execute the query first with a semicolon ; then, afterward, execute \crosstabview , it will give you the same results because it uses the query buffer. When \crosstabview is executed in the above query, we'll get the following table with individual columns for DOG and CAT , which are the distinct values taken from the animalType column. As you can see, each name is grouped together like we want and the count column values are then used to fill in the table grid. We got a similar result using the second query we wrote above but using \crosstabview allowed users to use the query buffer and saved us from building and executing a new query that produces a similar result. name | DOG | CAT ----------------------+-----+----- 2P2 | 1 | A BARKSDALE | 1 | A509966 | | 1 AARON | 1 | AB | 1 | AB ""ABBY"" | 1 | ABBBY | | 1 ABBEY | 5 | ABBI | 2 | ABBIE | 10 | ABBIGAIL | | 1 ABBOTT | | 1 ABBY | 40 | 12 ... REARRANGING TABLES WITH \CROSSTABVIEW Behind the scenes, PostgreSQL's \crosstabview will determine how to set up your table. However, if you want to rearrange how your data is viewed, PostgreSQL gives you that option, too. If you want to tell \crosstabview to rearrange the table, for example, you may want to flip horizontal and vertical headers by placing DOG and CAT vertically and name horizontally, you can do that by specifying the vertical and horizontal headers, respectively, like: \crosstabview animaltype name This tells \crosstabview to place type as the vertical header and name as the horizontal header. You need to put a space between the column names. For this example, the count column will automatically be used as the data that fills in the grid. If we wanted to specify the count column as the data that \crosstabview will use, we'd place count as the third argument. \crosstabview animaltype name count However, this is not really necessary here since PostgreSQL will automatically deduce that count is the data shared by the values in the vertical and horizontal headers. There are some limitations if you decide to specify the order of the headers. For example, running the query above with the name column as the horizontal header will give you: \crosstabview: maximum number of columns (1600) exceeded This error occurs because we've put all our names in the horizontal header, which PostgreSQL has limited to 1600 columns. Therefore, we can run the query again with \crosstabview animaltype name , but limit the query to get the first ten results, which would return something like: animaltype | 2P2 | A BARKSDALE | A509966 | AARON | AB | AB ""ABBY"" | ABBBY | ABBEY | ABBI | ABBIE ------------+-----+-------------+---------+-------+----+-----------+-------+-------+------+------- DOG | 1 | 1 | | 1 | 1 | 1 | | 5 | 2 | 10 CAT | | | 1 | | | | 1 | | | TRENDING PET NAMES Looking beyond getting pet names and animal types, we could use \crosstabview to find out what breed of dogs, for instance, tend to have certain names and whether there is a correlation between animal breeds and pet names that pet owners prefer. To do that, we could construct a query that analyzes the breeds of DOG and the names associated with them. SELECT primarybreed, name, count(name) FROM names WHERE animaltype = 'DOG' GROUP BY primarybreed, name ORDER BY 3 DESC; This query will give us a list of dog breeds, the names of dogs associated with a breed, and the number of dogs that have a specific name that is a certain breed. primarybreed | name | count -----------------+----------------------+------- LABRADOR RETR | BELLA | 23 LABRADOR RETR | MAX | 20 LABRADOR RETR | SADIE | 15 LABRADOR RETR | CHARLIE | 14 LABRADOR RETR | DAISY | 14 LABRADOR RETR | MAGGIE | 13 LABRADOR RETR | RILEY | 13 CHIHUAHUA SH | BUDDY | 12 CHIHUAHUA SH | CHICO | 12 LABRADOR RETR | MOLLY | 12 CHIHUAHUA SH | BELLA | 12 LABRADOR RETR | BEAR | 10 LABRADOR RETR | LUCY | 10 GOLDEN RETR | CHARLIE | 9 LABRADOR RETR | BAILEY | 9 LABRADOR RETR | STELLA | 9 LABRADOR RETR | COCO | 8 GERM SHEPHERD | MAGGIE | 8 LABRADOR RETR | DUKE | 8 LABRADOR RETR | LUNA | 8 GERM SHEPHERD | MAX | 8 ... From the results, it seems that there are a lot of Labrador Retrievers named Bella, but we also have a high number of short hair Chihuahua's with the same name. Bella is not the only name that is shared between breeds, but looking at the entire list of all the occurrences of Bella, or any dog for that matter is not efficient. In fact, it's the same problem that we ran into in the first query where we have a repetition of names on separate rows, but this time it's because the names are listed with different breeds. The problem with this query is that if we decided to run \crosstabview , we'd exceed the number of columns allowed since the name column would be placed in the horizontal header. We could try to go around this by specifying that we want name in the vertical column and primarybreed in the horizontal column like \crosstabview name primarybreed , but we'd get a table that is extremely difficult to read. In order to overcome this, we might want to select the top 10 names of dogs and then use those names to see what breeds tend to have those names. To do that, we'll use the following query, which is a modified version of the first query we ran in the article that selects only the animaltype = 'DOG' and is ordered in descending order according to the animal name : SELECT name, animaltype, count(name) FROM names WHERE animaltype = 'DOG' GROUP BY name, animaltype ORDER BY 3 DESC LIMIT 10 \crosstabview This gives us the following table with the top ten dog names: name | DOG ---------+----- BELLA | 117 LUCY | 103 BUDDY | 102 MAX | 92 DAISY | 87 CHARLIE | 77 MOLLY | 77 SADIE | 64 JACK | 60 MAGGIE | 56 Now that we know the top ten dog names, we can create a second query that narrows down the search and selects the number of dogs with those top ten names and the breeds that they belong to. SELECT primarybreed, name, count(primarybreed) FROM names WHERE animaltype = 'DOG' AND name LIKE ANY('{BELLA,LUCY,BUDDY,MAX,DAISY,CHARLIE,MOLLY,SADIE,JACK,MAGGIE}') GROUP BY primarybreed, name ORDER BY 3 DESC; This will return a table that looks something like this: primarybreed | name | count -----------------+---------+------- LABRADOR RETR | BELLA | 23 LABRADOR RETR | MAX | 20 LABRADOR RETR | SADIE | 15 LABRADOR RETR | CHARLIE | 14 LABRADOR RETR | DAISY | 14 LABRADOR RETR | MAGGIE | 13 CHIHUAHUA SH | BELLA | 12 CHIHUAHUA SH | BUDDY | 12 ... Now, using \crosstabview the results will be arranged according to the name of the dogs in the horizontal column and the primarybreed in the vertical column like: primarybreed | BELLA | MAX | SADIE | CHARLIE | DAISY | MAGGIE | BUDDY | MOLLY | LUCY | JACK -----------------+-------+-----+-------+---------+-------+--------+-------+-------+------+------ LABRADOR RETR | 23 | 20 | 15 | 14 | 14 | 13 | 8 | 12 | 10 | 7 CHIHUAHUA SH | 12 | 4 | 1 | 2 | 6 | 5 | 12 | 1 | 6 | 7 GOLDEN RETR | 4 | 4 | 5 | 9 | 6 | 4 | 5 | 6 | 2 | 3 GERM SHEPHERD | 4 | 8 | 4 | 4 | 3 | 8 | 1 | 3 | 3 | 2 POMERANIAN | 4 | | 1 | | | | 3 | | 5 | 1 SHIH TZU | 5 | 4 | 2 | 3 | 3 | 1 | 5 | 4 | 2 | 2 PIT BULL | 4 | 5 | 4 | 2 | 5 | | 4 | 3 | 4 | 1 AUST SHEPHERD | 1 | 4 | 2 | 3 | 1 | 2 | 5 | 1 | 1 | DACHSHUND | 2 | 2 | 1 | 3 | 2 | 1 | 5 | 4 | 5 | 2 ... Using the first table to get the top ten dog names, we can already assume the order of the most popular dogs. However, the other question that we wanted to answer is whether there are particular breeds of dogs that have these top ten names. Instead of creating another query for this, we simply used \crosstabview to organize the name of dogs and the breeds in horizontal and vertical headers. The count was then dispersed throughout the grid forming what we have above. From the data that's presented, we can determine that not only is Bella the most popular name, but it's the most popular name for Labrador Retrievers. At the same time, it's a pretty popular name for Chihuahuas, too. The table also tells us the most popular breed of dog for among the top ten names are Labrador Retrievers overwhelmingly, which might conclude that the inhabitants of Fircrest and Tacoma like their so-called family dogs. Other interesting questions that might be answered with further data is whether pet owners prefer female over male dogs, and what names and breeds are preferred for males and females. According to the limited data presented here, it appears that female dogs are preferred over males just by looking at the top ten names. However, to make that claim we'd have to categorize the gender of all the pets according to their name, which may be easy to do with Pippy Long Stockings, Clarice, and Han Solo, but a little more difficult with Fluffy, Snickerdoodle, and Boo Boo. There is a lot more that we could conclude from these results, but \crosstabview has provided, nonetheless, a way to easily take rows with figures and get meaningful result that would otherwise appear jumbled across a number of rows that we'd have to sift through, or create more complex queries to get similar results. SUMMING UP The c\rosstabview command only works in the psql shell. It's not a command that you can use in your application; for that, you will have to write a query that will produce the table structure you need, or use the crosstab function, which is included in the tablefunc extension. This extension is easy to add in Compose PostgreSQL by selecting the extension from the Compose console. However, if you simply want another view of your data from within the psql shell, then \crosstabview is a fantastic alternative that will make your life easier when trying to disect complicated datasets and the best part is that it comes out of the box with PostgreSQL 9.6. -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. attribution Ricardo Gomez Angel Abdullah Alger is a former University lecturer who likes to dig into code, show people how to use and abuse technology, talk about GIS, and fish when the conditions are right. Coffee is in his DNA. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES Jul 12, 2017INTEGRATION TESTING AGAINST REAL DATABASES Integration testing can be challenging, and adding a database to the mix makes it even more so. In this Write Stuff contribu… Guest Author Jul 7, 2017NEWSBITS: ELASTICSEARCH UPDATE ADDS IP RANGES AND MORE These are the NewsBits from Compose for the week ending 7th July: Elasticsearch and Kibana updated A release date for Redis 4… Dj Walker-Morgan Jul 3, 2017DATALAYER EXPOSED: JOSHUA DRAKE & POSTGRESQL: THE CENTER OF YOUR DATA UNIVERSE Start your Monday on a high note and catch up on videos from this year's DataLayer Conference. This week we're highlighting J… Thom Crowe Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","Let's explore the `\crosstabview` command, which gives you the power to rearrange how your data is viewed without the difficulty of writing complex SQL queries.",Analyzing Pet Name Trends with PostgreSQL's crosstabview,Live,281 835,"Enterprise Pricing Articles Sign in Free 30-Day TrialDRONE DEPLOY CONQUERS THE DATA LAYER Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 25, 2017Compose has quite a few unique customers. One of the more unique that we've visited with is DroneDeploy , a company that automates drone flight and lets users explore map data from within an app. Nick Pilkington, DroneDeploy's CTO, tells us that they are, ""taking the existing drone hardware and combining it with a very powerful piece of software to make that drone into a useful tool... something that's repeatable, something that's reliable, something that's safe, and something that provides a huge amount of value."" Pretty cool, huh? So, we visited with Nick to talk about their mapping, app and how they're using Compose. Check out the video to see how Drone Deploy conquered their data layer. -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the beach, reading, spending time with his wife and daughter and tinkering. Love this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud * Subscribe Join 75,000 other devs on our weekly database newsletter. © 2017 Compose","We visited Nick Pilkington, DroneDeploy's CTO, to talk about their mapping, app and how they're using Compose.",Customer: Drone Deploy Conquers the Data Layer,Live,282 838,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register * Projects * Blogs * About * Contribute * OpenTech * Tutorials * Events * Videos Search Brunel Visualization | More Brunel Visualization posts < Previous / Next >TWELVE WAYS TO COLOR A MAP OF AFRICA USING BRUNEL Graham Wills / Follow @GrahamWills / November 23, 2015The two main new features of Brunel 0.8 are an enhanced UI for building and a thorough re-working of our code for mapping data to color. This post is going to talk about the latter — with a lot of examples! The data set we are using is from http://opengeocode.org . We took a subset of the countries and data columns ( CSV data ) for this exercise. These examples are using some prototype code for geographic maps that we are going to introduce into a later version of Brunel (probably v1.0, slated for January), but maps looks so nice, we wanted to use them for this article. Please do not depend on the currently functionality — consider this an “advance preview” and highly subject to change. Because there are a lot of maps, these are not live versions, but static images — click on them to open up a Brunel editor window where you can see it live and make changes. The Brunel language reference describes the improvements to the color command in detail. Here we just show examples! CATEGORICAL COLORS The above two images are created by the following Brunel: * map(‘africa’) x(name) color(language) label(iso) tooltip(#all) style(‘text-shadow:none}’) * map(‘africa’) x(name) color(language:[white, nominal]) label(iso) tooltip(#all) style(‘text-shadow:none}’) For all our examples, the only changes are the color statement, so from now on we’ll just refer to the color command. If you use a simple color command, as in the first example, Brunel chooses a suitable palette. In this case “language” is a categorical field, so it chooses a nominal palette. This is a palette of 19 colors chosen to be visually distinct. The second example specifies which colors we want in the output space. The first category in the “language” field is special, so we ask for a palette consisting of white, then all the usual colors from the nominal palette. Because we know the data well, we can hand-craft a color mapping here that reflects the language patterns better. I used color(language:[white, red, yellow, green, cyan, green, green, blue, blue, blue, blue, gray, gray, gray, gray, gray]) to use red for lists containing Arabic, green when they contain English, and blue when they contain French. I mixed the colors to show lists where the languages are mixed. The geographical similarities in languages can be seen pretty easily in the chart, but the colors are a bit bright. Which leads to the following adjustment … For areas and “large” shapes, Brunel automatically creates muted versions of colors, so names like “red” and “green” are less visually dominant and distracting. This can be altered by adding a “=” to the list of colors, which means “leave the colors unmuted”, or a series of asterisks, which means “mute them more”. Here are a couple of examples, using the same basic palette as the previous one If you have a smaller fixed number of categories in your field, you can use palettes carefully designed to work well for that number. Rather than provide them in Brunel, our suggestion is to go directly to a site that allows you to select them (Cynthia Brewer’s site ColorBrewer is the standout recommendation) and copy the array of color codes and paste them directly into the Brunel code. For the example on the right, we did exactly that, using en:[‘#beaed4′, ‘#7fc97f’]) as our colors (the quotes are optional in this list). COLOR RANGES For numeric data, we want to map the data values to a smoothly changing range of values. So, instead of defining individual values, we define values which are intermediate points on a smoothly changing scale of colors. We do this using the same syntax pattern as for categorical data. We are using the latitude of the capital city to color by, rather than a more informative variables, so the color changes can be seen more clearly. On the left we specified color as color(capital_lat) so we get Brunel’s default blue-red sequential scale. This uses a variety of hues, again taken from ColorBrewer, to provide points along a linear scale of color. On the right we use an explicit color mapping from ColorBrewer, color(capital_lat:[‘#8c510a’, ‘#bf812d’, ‘#dfc27d’, ‘#f6e8c3′, ‘#f5f5f5′, ‘#c7eae5′, ‘#80cdc1′, ‘#35978f’, ‘#01665e’]) , where we simply went to the site, found a scale we liked and used the export>Javascript method. Note that Brunel will adapt to to the number of colors in the palette automatically. The above two charts show the difference between asking for color(capital_lat:reds) and color(capital_lat:red) . When a plural is used, it gives a palette that uses multiple hues, with the general tone of the color being requested. With a singular color request, you only gets shades of that exact hue . Generally we would recommend the former unless you have some specific reason to need the single-hue version. We can specify multiple colors in the same way as we do for categorical data, using capital_lat:[purpleblues, reds]) on the left and capital_lat:[blue, red]) on the right. When we have exactly two colors defined, we stitch them together, running through a neutral central color, to make a diverging color scale that highlights the low and high values of the field. SUMMARY Mapping data to color is a tricky business, and in version 0.8 of Brunel our goal is twofold: * Ensure that if you only specify a field, a suitable mapping is generated * Allow the output space of colors to be customized for user needs In future versions of Brunel we will add mapping for the input space, so, for example, we could tie the value mapped to white in the last example to be the equator, not simply midway through the data range. Look for that in a few months! * Click to share on Twitter (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Google+ (Opens in new window) * Tagged: brunel / brunelvis / color / d3 / dashboard / datavis / geo / infovis / mapping / maps / open source / perception / vis / visualizationLEAVE A COMMENT Click here to cancel reply. Tell us who you are Name (required) Email (required) Comment text Notify me of follow-up comments by email. Notify me of new posts by email. RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM Privacy IBM",Brunel Visualization now has thoroughly re-worked code to provide improved options for mapping data to color. These maps of Africa show the results.,Twelve ways to color a map of Africa using Brunel,Live,283 839,"* Home * Community * Projects * Blog * About * Resources * Code * Contributions * University * IBM Design * Apache SystemML * Apache Spark™ SPARK.TC ☰ * Community * Projects * Blog * About * Resources * Code * Contributions * University * IBM Design * Apache SystemML * Apache Spark™ MACHINE LEARNING MACHINE LEARNING IN APACHE SPARK 2.0: UNDER THE HOOD AND OVER THE RAINBOW Now that the dust has settled on Apache Spark™ 2.0 , the community has a chance to catch its collective breath and reflect a little on what was achieved for the largest and most complex release in the project's history. One of the main goals of the machine learning team here at the Spark Technology Center is to continue to evolve Apache Spark as the foundation for end-to-end, continuous, intelligent enterprise applications. With that in mind, we'll briefly mention some of the major new features in the 2.0 release in Spark's machine-learning library, MLlib, as well as a few important changes beneath the surface. Finally, we'll cast our minds forward to what may lie ahead for version 2.1 and beyond. For MLlib, there were a few major highlights in Spark 2.0: * The older RDD-based API in the mllib package is now in maintenance mode, and the newer DataFrame-based API (in the ml package), with its support for DataFrames and machine learning pipelines, has become the focus of future development for machine learning in Spark * Full support for saving and loading pipelines in Spark's native format, across languages (with the exception of cross-validators in Python) * Additional algorithm support for Python and R While these have already been well covered elsewhere, the STC team has worked hard to help make these initiatives a reality — congratulations! Another key focus of the team has been feature parity — both between mllib and ml , and between the Python and Scala APIs. In the 2.0 release, we're proud to have contributed significantly to both areas, in particular reaching close to full parity for PySpark in ml . UNDER THE HOOD Despite the understandable attention paid to major features in such a large release, what happens under the hood in terms of bug fixes and performance improvements can be equally important (if not more so!). While the team has again been involved across the board in this area, here we'd like to highlight just one example of a small (but subtle) issue that has dramatic implications for performance. WE NEED TO WORK ON OUR COMMUNICATION... Linear models, such as logistic regression, are the work-horses of machine learning. They're especially useful for very large datasets, such as those found in online advertising and other web-scale predictive tasks, because they are relatively less complex than, say, deep learning, and so are easier to train and more scalable. As such, they are among the most-used algorithms around, and were among the earliest algorithms added to Spark ml . In distributed machine learning, the bottleneck for scaling large models (that is, where there are a large number of unique variables in the model) is often not computing power, as one might think, but communication across the network. This is because these algorithms are iterative in nature, and tend to send a lot of data back and forth between nodes in a cluster in each iteration. Therefore, it pays to be as communication-efficient as possible when constructing such an algorithm. While working on adding multi-class logistic regression to Spark ML (part of the ongoing push towards parity between ml and mllib ), STC team member Seth Hendrickson realized that, due to the way that Spark automatically serializes data when inter-node communication is required (e.g. during a reduce or aggregation operation), the aggregation step of the logistic regression training algorithm resulted in 3x more data being communicated than necessary. This is illustrated in the chart below, where we compare the amount of shuffle data per iteration as the feature dimension increases. Once fixed , this resulted in a decrease in per-iteration time of over 11% (shown in the chart below), as well as a decrease in overall execution time of over 20%, mostly due to lower shuffle read time and less data being broadcast at each iteration. We would expect the performance difference to be even larger as data and cluster size increases 1 . Subsequently, various Spark community members rapidly addressed the same issue in linear regression and AFT survival regression (these patches will be released as part of version 2.1). So there you have it - Spark 2.0 even improves your communication skills! OVER THE RAINBOW What does it mean when we refer to Apache Spark as the ""foundation for end-to-end, continuous, intelligent enterprise applications""? In the context of Spark's machine learning pipelines, we believe this means usability, scalability, streaming support, and closing the loop between data, training and deployment to enable automated, intelligent workflows - in short the ""pot of gold"" at the end of the rainbow! In line with this vision, the focus areas for the team for Spark 2.1 and beyond include: * Achieving full feature parity between mllib and ml * Integrating Spark ML pipelines with the new structured streaming API to support continuous machine-learning applications * Exploring additional model export capabilities including standardized approaches such as PMML * Improving the usability and scalability of the pipeline APIs, for example in areas such as cross-validation and efficiency for datasets with many columns We'd love to hear your feedback on these areas of interest — email me at NickP@za.ibm.com, and we look forward to working with the Spark community to help drive these initiatives forward. -------------------------------------------------------------------------------- 1. Tests were run on a relatively small cluster with 4 worker nodes (each with 48 cores, 100GB memory). Input data ranged from 6GB to 200GB, with 48 partitions, and was sized to fit in cluster memory at the maximum feature size. The quoted performance improvement figures are for the maximum feature size. ↩ SHARE ON * * Share NICK PENTREATH DATE 30 August 2016TAGS machine learning, spark performanceSPARK TECHNOLOGY CENTER * Community * Projects * Blog * About The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.","Now that the dust has settled on Apache Spark 2.0, the community has a chance to catch its collective breath and reflect a little on what was achieved for the largest and most complex release in the project's history.",Apache Spark 2.0: Machine Learning. Under the Hood and Over the Rainbow.,Live,284 844,"Compose The Compose logo Articles Sign in Free 30-day trialMETRICS MAVEN: CROSSTAB REVISITED - PIVOTING WISELY IN POSTGRESQL Published Apr 4, 2017 metrics maven postgresql Metrics Maven: Crosstab Revisited - Pivoting Wisely in PostgreSQLIn our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll take another look at crosstab to help you pivot wisely. In this article, we'll look again at the crosstab function, focusing this time on the option that does not use category sql. We'll explain how and when (not) to use it. We'll also compare it to the option that does use category sql, which we covered in our previous article on pivot tables using crosstab . You can also find some discussion of both options in the official Postgres documentation for tablefunc . To use crosstab with Compose PostgreSQL, refer to the previous article for how to enable tablefunc for your deployment. PIVOTING YOUR DATA Pivoting your data can sometimes simplify how data is presented, making it more understandable. PostgreSQL provides the crosstab function to help you do that. The simplest option for crosstab , which we'll focus on in this article, is referred to as crosstab(text sql) in the documentation. We're going to call it the ""basic option"" in this article. It differs from the crosstab(text source_sql, text category_sql) option in a couple of significant ways, which we'll cover a little later in this article. If you want to learn how the crosstab(text source_sql, text category_sql) option works before diving into the basic option we're going to look at here, check out our article Creating Pivot Tables in PostgreSQL Using Crosstab . OUR DATA As we did in the previous article on crosstab , we'll use the product catalog from our hypothetical pet supply company. id | product | category | product_line | price | number_in_stock --------------------------------------------------------------------------- 1 | leash | dog wear | Bowser | 15.99 | 48 2 | collar | dog wear | Bowser | 10.99 | 76 3 | name tag | dog wear | Bowser | 5.99 | 204 4 | jacket | dog wear | Bowser | 24.99 | 12 5 | ball | dog toys | Bowser | 6.99 | 27 6 | plushy | dog toys | Bowser | 8.99 | 30 7 | rubber bone | dog toys | Bowser | 4.99 | 52 8 | rubber bone | dog toys | Tippy | 4.99 | 38 9 | plushy | dog toys | Tippy | 6.99 | 16 10 | ball | dog toys | Tippy | 2.99 | 47 11 | leash | dog wear | Tippy | 12.99 | 34 12 | collar | dog wear | Tippy | 6.99 | 88 13 | name tag | dog wear | Tippy | 5.99 | 165 14 | jacket | dog wear | Tippy | 20.99 | 50 15 | rope chew | dog toys | Bowser | 7.99 | 27 We've got one additional item in the catalog than we had last time - a rope chew toy in the Bowser line. As tends to be the case in a relational database, the data in our table extends downward, repeating values for product_line, category, and product in different combinations for each price and inventory value. We want to create a pivot table to get a simpler view of our catalog. Let's get started. AGGREGATING A VALUE Let's start by getting the average price of each product category for each of the product lines. This was the same example we used in our previous article, but this time we'll use the basic crosstab option which does not use category sql. Here's what that looks like: -- using the basic option SELECT * FROM crosstab( 'select distinct product_line, category, round(avg(price),2) as avg_price from catalog group by product_line, category order by 1,2') AS catalog(product_line character varying, dog_toys numeric, dog_wear numeric ) ; Let's look at the sub-query first. The first thing to notice is that the sub-query is encapsulated in single quotes. The query is passed to the crosstab function as a string that it will run. Next, we're using round with the avg function to get the average price rounded to two decimal places for each product line and category combination. If you need a refresher on either of these functions, we covered rounding in our Make Data Pretty article and avg in our article on mean . To get the average aggregate value, we're using group by with the other two columns: product line and category. Finally, we're ordering our results first by product line then by category. The ordering is important because in the outer query, we have to explicitly name the columns we want to see and need to know what order the data will be populated into them. The outer query calls the crosstab function on the results from the sub-query and then specifies the column names and data types for presenting that pivoted data. In effect, this creates a new table that is presented as the result of the query. Here's what the result looks like: product_line | dog_toys_avg_price | dog_wear_avg_price ------------------------------------------------------- Bowser | 7.24 | 14.49 Tippy | 4.99 | 11.74 If you compare this result to the result we got in the previous article , which used the category sql option for crosstab , you'll find they are exactly the same. The only difference here is that the Bowser line of dog toys has increased slightly since then due to the addition of the new rope chew toy. If that's the case, then you may be wondering what the difference is between the two crosstab options... Let's look into that. COMPARING CROSSTAB OPTIONS Before we look at the key differences between the two optons, let's cover a couple caveats that apply to both options. SIMILAR CAVEATS As we mentioned above and in the previous article, both options require you to indicate an explicit order for the resulting columns. If you don't order the data, you will have a hodge-podge in your pivoted columns. PostgreSQL has no way of being ""smart"" here. It does not know how your pivoted columns map to the data you're querying on. You have to know that and, to do that, you need to order the data. The next probably goes without saying, but let's just go ahead and be extra clear here. The resultant pivoted rows must have only one value for each row. If there can be multiple values, then PostgreSQL will return you one from the list. For example, if we did not average the price in the query above (which aggregates the price to a single value), but instead simply requested the price column, we could get any one of the prices associated with each product category and product line. The point of pivoting the data is to present a single value for each possible combination of attributes. The pivoted columns' data types must match the data types expected from the source data. For example, we would get an error if we had our pivoted column ""avg_price"" specified as an int instead of numeric . The result of the avg function on our price values will not produce an int . If we wanted the pivoted column to be an int , we'd need to cast the value accordingly in the sub-query. Now the differences... BIG DIFFERENCES The reason our previous article used the category sql option of crosstab is that it is more flexible than the basic option we covered here. We recommend using the category sql option over the basic option. Here's why: The category sql option allows you to include ""extra columns"" in your pivot table result. The extra columns are not used for pivoting. The common use for these columns is to provide additional descriptors of the data in each row. You can have as many extra columns as you want; however, there can only be one extra column value for each. As mentioned above in the caveat section, multiple possible values will result in any one of the values being displayed. Here's an example to make this easier to understand: -- using category sql option SELECT * FROM crosstab ( 'select distinct product_line, case when product_line = 'Bowser' then 'Fashion and fun for big dogs.' when product_line = 'Tippy' then 'Small dog fashion and fun.' else null end as description, category, round(avg(price),2) as avg_price from catalog group by product_line, category order by product_line', 'select distinct category from catalog order by 1' ) AS ( product_line character varying, description text, dog_toys_avg_price numeric, dog_wear_avg_price numeric ) ; In this case, we've added an ""extra column"" called ""description"". For this example, we've provided the values manually in a case statement, but another column from the table could also be used if there was a column that contained the additional descriptive data. Note the escaped single quotes (leaving us with two single quotes around each text value) since the sub-query for crosstab needs to be encapsulated in single quotes. We'll get a result like this: product_line | description | dog_toys_avg_price | dog_wear_avg_price ------------------------------------------------------------------------------------------ Bowser | Fashion and fun for big dogs. | 7.24 | 14.49 Tippy | Small dog fashion and fun. | 4.99 | 11.74 If you try to add an extra column using the basic crosstab option, you'll get this error: ""The provided SQL must return 3 columns: rowid, category, and values."" No extra columns allowed. The next difference is the more compelling one to use the category sql crosstab option: it places data in the correct columns when one of the rows is missing a particular value for the specified attribute. Remember our new dog toy, the rope chew? The Tippy line does not have that toy. If we wanted to pivot by toy products instead of by product categories, we would only be able to get an accurate result using the category sql option of crosstab . Check it out: -- using category sql option SELECT * FROM crosstab( 'select distinct product_line, category, product, price from catalog where category = ''dog toys'' order by 1,2', 'select distinct product from catalog where category = ''dog toys'' order by 1' ) AS ( product_line character varying, category character varying, ball_price numeric, plushy_price numeric, rope_chew_price numeric, rubber_bone_price numeric ) ; We'll get the result we expect (a null value for the rope chew toy on the Tippy product line row): product_line | category | ball_price | plushy_price | rope_chew_price | rubber_bone_price ------------------------------------------------------------------------------- Bowser | dog toys | 6.99 | 8.99 | 7.99 | 4.99 Tippy | dog toys | 2.99 | 6.99 | | 4.99 Notice in the query above that we did not need to use an aggregation for the price because there is one price per product per product line. We also added the product category as an ""extra column"" since our pivoted rows were limited to only the category for dog toys - just an additional example of using extra columns for you to ""chew"" on (pun intended). If we use the basic option of crosstab to present dog toy prices per product line, not only can we not use any extra columns as we learned above, but worse, we'll get a bad result... Here's the SQL: -- using the basic option SELECT * FROM crosstab( 'select distinct product_line, product, price from catalog where category = ''dog toys'' order by 1,2') AS catalog(product_line text, ball_price numeric, plushy_price numeric, rope_chew_price numeric, rubber_bone_price numeric ) ; And here's the result: product_line | ball_price | plushy_price | rope_chew_price | rubber_bone_price ------------------------------------------------------------------------------- Bowser | 6.99 | 8.99 | 7.99 | 4.99 Tippy | 2.99 | 6.99 | 4.99 | WHAT?! The rubber bone price for the Tippy line shifted over to populate the rope chew column! That's because, without the category sql, the basic option does not know how many columns to expect and simply populates the data top-to-bottom, left-to-right until there are no more values. So, you can only use the basic option if your data values have exactly the same number and type. That's a pretty big limiter in our book. WRAPPING UP Hopefully you now have a much more thorough understanding of crosstab in PostgreSQL, including the differences between the two options that are presented in the documentation. You are now armed with the knowledge that will help you pivot wisely. Image by: herbert2512 Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith ’s author page and keep reading.RELATED ARTICLES Mar 8, 2017METRICS MAVEN: CALCULATING AN EXPONENTIALLY WEIGHTED MOVING AVERAGE IN POSTGRESQL In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the… Lisa Smith Feb 7, 2017METRICS MAVEN: CALCULATING A WEIGHTED MOVING AVERAGE IN POSTGRESQL In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the… Lisa Smith Jan 9, 2017METRICS MAVEN: CALCULATING A WEIGHTED AVERAGE IN POSTGRESQL In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the… Lisa Smith Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","we'll look again at the crosstab function, focusing this time on the option that does not use category sql. We'll explain how and when (not) to use it. We'll also compare it to the option that does use category sql ...",Metrics Maven: Crosstab Revisited - Pivoting Wisely in PostgreSQL,Live,285 847,"USING CLOUDANT TO ENHANCE UPLOADS FOR IBM GRAPH -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Prachi Shirish Khadke 12/8/16Prachi Shirish Khadke Backend Developer for IBM Graph and a Ballroom Junkie! Learn More Recent Posts * Using Cloudant to enhance uploads for IBM Graph How I solved a graph development issue with parallel Cloudant Index creation requests. Hi. I am Prachi, a backend developer for IBM Graph, a fully-managed, enterprise-grade graph database service built on the cloud. Our development team works via a continuous delivery pipeline to regularly add new features, enhance existing ones and deliver bug fixes. Several weeks ago, I was working on the backend code to improve the graph upload experience, adding REST API methods for asynchronous graph uploads. When the service receives an asynchronous graph upload request, it notifies the user that the request has been accepted and generates an upload Id. The upload Id can be used to query the status of the upload via the service’s REST API, as in the following commands. This setup provides an nice user experience since they’re not blocked with a wait time dependent on how big the upload is or due to slowness in the service. # Session auth curl -X GET -H 'Content-Type:application/json' -u 'cffb672f-fe5e-4810-a5da-a6ce182014e2:2eafd208-841d-4afd-aa35-6bdb2214d84b' https://ibmgraph-alpha.ng.bluemix.net/32b7fa84-df0e-4546-b38e-74a71a1e69c7/_session {""gds-token"":""Y2ZmYjY3MmYtZmU1ZS00ODEwLWE1ZGEtYTZjZTE4MjAxNGUyOjE0Nzk1MDgxMTUxNTU6eXZNdUxBSGxNSXYvUEszM3pMVDJEakh6QlVkRFdEdStucFFiRFd2d2xmcz0=""} # Asynchronous graph upload curl -X POST -H 'Content-Type:multipart/form-data' -H 'Authorization: gds-token Y2ZmYjY3MmYtZmU1ZS00ODEwLWE1ZGEtYTZjZTE4MjAxNGUyOjE0Nzk1MDgxMTUxNTU6eXZNdUxBSGxNSXYvUEszM3pMVDJEakh6QlVkRFdEdStucFFiRFd2d2xmcz0=' -F 'graphml=@./air-routes-small.graphml' https://ibmgraph-alpha.ng.bluemix.net/32b7fa84-df0e-4546-b38e-74a71a1e69c7/g/uploads/graphml {""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""operation"":""bulkload"",""status"":""ACCEPTED"",""code"":202} # Graph upload status using uploadId curl -X GET -H 'Content-Type:application/json' -H 'Authorization: gds-token Y2ZmYjY3MmYtZmU1ZS00ODEwLWE1ZGEtYTZjZTE4MjAxNGUyOjE0Nzk1MDgxMTUxNTU6eXZNdUxBSGxNSXYvUEszM3pMVDJEakh6QlVkRFdEdStucFFiRFd2d2xmcz0=' https://ibmgraph-alpha.ng.bluemix.net/32b7fa84-df0e-4546-b38e-74a71a1e69c7/g/uploads/502f4f57-f60c-4e92-ae9a-63eca980817a/status {""uploads"":[{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":null,""statusCode"":202,""statusMessage"":""ACCEPTED"",""type"":""bulkload""},{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":1479508249550,""statusCode"":201,""statusMessage"":""COMPLETED"",""type"":""bulkload""}]} Part of this effort required storing state in a Cloudant database. Initially, I added three indexes to query upload status in different ways – using the Service Id, the Graph Id and the Upload Id. These queries looked like this: # Graph upload status using uploadId curl -X GET -H 'Content-Type:application/json' -H 'Authorization: gds-token Y2ZmYjY3MmYtZmU1ZS00ODEwLWE1ZGEtYTZjZTE4MjAxNGUyOjE0Nzk1MDgxMTUxNTU6eXZNdUxBSGxNSXYvUEszM3pMVDJEakh6QlVkRFdEdStucFFiRFd2d2xmcz0=' https://ibmgraph-alpha.ng.bluemix.net/32b7fa84-df0e-4546-b38e-74a71a1e69c7/g/uploads/502f4f57-f60c-4e92-ae9a-63eca980817a/status�/pre� {""uploads"":[{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":null,""statusCode"":202,""statusMessage"":""ACCEPTED"",""type"":""bulkload""},{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":1479508249550,""statusCode"":201,""statusMessage"":""COMPLETED"",""type"":""bulkload""}]} # Graph upload status using graphId curl -X GET -H 'Content-Type:application/json' -H 'Authorization: gds-token Y2ZmYjY3MmYtZmU1ZS00ODEwLWE1ZGEtYTZjZTE4MjAxNGUyOjE0Nzk1MDgxMTUxNTU6eXZNdUxBSGxNSXYvUEszM3pMVDJEakh6QlVkRFdEdStucFFiRFd2d2xmcz0=' https://ibmgraph-alpha.ng.bluemix.net/32b7fa84-df0e-4546-b38e-74a71a1e69c7/g/uploads/status {""uploads"":[{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":null,""statusCode"":202,""statusMessage"":""ACCEPTED"",""type"":""bulkload""},{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":1479508249550,""statusCode"":201,""statusMessage"":""COMPLETED"",""type"":""bulkload""}]} # Graph upload status using serviceId curl -X GET -H 'Content-Type:application/json' -H 'Authorization: gds-token Y2ZmYjY3MmYtZmU1ZS00ODEwLWE1ZGEtYTZjZTE4MjAxNGUyOjE0Nzk1MDgxMTUxNTU6eXZNdUxBSGxNSXYvUEszM3pMVDJEakh6QlVkRFdEdStucFFiRFd2d2xmcz0=' https://ibmgraph-alpha.ng.bluemix.net/32b7fa84-df0e-4546-b38e-74a71a1e69c7/uploads/status {""uploads"":[{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":null,""statusCode"":202,""statusMessage"":""ACCEPTED"",""type"":""bulkload""},{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":1479508249550,""statusCode"":201,""statusMessage"":""COMPLETED"",""type"":""bulkload""}]} At first, I used Cloudant map-reduce views for index creation, but code review feedback recommended Cloudant Queries instead. This meant rewriting a lot of code, which was painful to contemplate when the existing logic already worked. On the plus side, we’d gain a performance improvement. So I rewrote index creation using Cloudant Queries. But it was still slow. The problem was that I had created only one Cloudant design document, sequentially creating the indexes, to keep things organized properly. A colleague suggested that separate design documents may help. At first, that approach seemed unorganized and sloppy, until I realized: backend development is like general surgery. As Dr. Richard Webber said in Grey’s Anatomy: I don’t need pretty. And I don’t need perfect. What I need is for this to work. And what’s gonna make it works is for me to take out that tumor and put these healthy organs inside my very sick patient. It won’t be pretty, but it will work, and it will keep my patient alive. In engineering school, they teach us the importance of performance and agility. This real-world example shows how prioritizing engineering concerns over organization and prettiness is smart and effective. I ended up invoking 3 index creation requests in parallel, which was so fast! It’s learning moments like this that just make me smile. The fact that Cloudant Query is a REST API – stateless, predictable and easy to use, just added to my joy. :)",How I solved a graph development issue with parallel Cloudant Index creation requests.,Using Cloudant to enhance uploads for IBM Graph,Live,286 848,"Enterprise Pricing Articles Sign in Free 30-Day TrialCOMPOSE FOR MYSQL - A DEVELOPER'S VIEW Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 26, 2017In this interview with Chris Winslett, Compose developer and lead on the Compose for MySQL, we talk about why MySQL is on the Compose platform, what makes it different on Compose and how the Compose for MySQL beta is going. Q: So why Compose for MySQL? Chris Winslett: We already had nine databases and we already had a SQL database. The question is though is ""Why MySQL?"" and the answer is that it's simple yet powerful. You can get the relational database model without the high administration cost you see with other SQL databases. You get all the SQL capabilities, SELECT statements with GROUPing and JOINing and so on. It also can query across databases and concatenate the results together with UNION ALL. With other databases, like Postgres, you have to make choices which tend to increase complexity; MySQL has this feature out of the box. There's also a large ecosystem of libraries - every programming language can connect to MySQL. PHP, Python, C, all of then have extensive libraries and a lot of these have been around so long that there are two or more versions. PHP had an old model and a new model. Ruby has a MySQL library, and also a later MySQL2 library. They've all gone through these iterations reacting to the needs of the database systems leading to some very mature experiences when running on top of MySQL. That means that it's easy for a new user to spin up a database and get working with it. A large ecosystem of tools is another reason. Some common tools include Wordpress, Drupal, SugarCRM and other open-source CRMs, along with GUIs for creating queries and reports. The size of the MySQL environment is compelling. It's been the largest database since the late 90s when web databases and the open-source movement began growing. Which leads to the last reason - Customers were asking for it and wanted it. They had other databases on Compose and enjoyed the autoscaling and the automatic backups and they wanted a MySQL which was easy to deploy and highly available like all our databases are. Q: So what do users get when they deploy Compose for MySQL ? CW: We start with availability on AWS, Softlayer and Google Cloud; you can deploy on all those platforms. Then there's the Compose process for what we expect from databases. High availability, automated disaster recovery backups, failover support and simple routing all delivered from a private VLAN and manageable from the web. Q: How do you create a MySQL database on Compose? Same as any other Compose database: sign up - we have free thirty day trials, and get to the Compose web front end then click on the Create Deployments button. You'll see all the databases we do at Compose there. Browse down to the Beta section, and you'll find Compose for MySQL in there. Click it, enter a name for your database, pick where you want it, pick a size - remembering we have auto-scaling - and click Create . Your Compose for MySQL database will be with you shortly; it takes about two or three minutes. Q: What does the Beta mean? CW: A Beta database is a database that Compose has just begun offering. We've been offering MySQL since late October and during this beta period we monitor the database. With MySQL, what we are doing is watching the metrics, monitoring the uptime, seeing how we can improve the uptime, how we can improve self-healing tasks and seeing what kind of questions customers have about MySQL. We fully expect this to be a production-grade database and we have high expectations during this beta. However, we want customers to know it's a new database on Compose so it may not best fit some use cases. That's where we gather data in the beta. Q: So what MySQL are you running? CW: We're running MySQL 5.7.17 currently with Group Replication. We don't modify MySQL in anyway, so you can use all your standard MySQL drivers and tools with it. The one caveat is that because we use Group Replication to run the MySQL cluster, all the tables in the database require primary keys. A primary key is a unique identifier for a row; it can be an integer, UUID or string. It just has to be unique for the clustering. Q: Where would you not have unique ids? CW: One example would be a join table, where you are creating a table which joins users records and group records together. The table created to represent that join would typically not be designed to have a unique id. So what you need to do is alter the table, add an id column and make that id column an auto-incrementing integer. Q: Why do you need to do this? CW: The unique id caveat lets us run multiple nodes with replication and high availability. Having a unique id means it's easier for replication to see what's new and what has changed and keep things consistent. That means we can replicate data over three nodes. Q: Why three nodes? CW: Three replicated nodes allow us to take a node offline without bringing the database down. That means that we can do zero-downtime maintenance. If you've run databases before, you'll know the number one reason for a database outage is not because a host has gone down, but because you need to do maintenance on that host; update the kernel, update how the system is tuned or reset some parameters. Maintenance is the number one reason for database downtime. We also get zero-downtime backups. MySQL backups are best if you can shut down the database on a node, so what we do is shut down a data node, do the backups and bring that node back up. That gives us the best, most consistent backups. Finally, we get failover during a server outage. While the number one reason for an outage is maintenance, the number one reason for an unplanned outage is server failure. Three nodes give us a lot of advantages during these unplanned outages. That's why we were ok with the requirement to have primary keys on tables. The tradeoff for high availability is something we think – and we expect customers will think – is worth it. Q: So, how do you pick which node to connect to? CW: We look to make it as simple as possible. Customers applications connect to a haproxy and that haproxy talks to the master data node. We try and take a lot of the magic out of the process of connecting. The haproxy knows which node is currently the master data node. Q: How do you know what's in your cluster? CW: Look at the Topology in the Compose console overview. What you can see there is the result of health checks being run on the cluster. You can see the clusters own private infrastructure with the three data nodes on them and you can see the proxy which is routing to the master among the data nodes. You don't need to know that, though, all you need is the to know is the address of the proxy. Q: Do you have any advice for someone bringing an application to Compose for MySQL and the cloud? CW: Remember to create your cloud database as close as possible, network-wise, to your application as possible. Q: You mention how Compose runs beta databases; Any insights from the MySQL Beta so far? CW: We'll be blogging about the Compose for MySQL beta and doing some deep dives into how group replication works and how we recover from failure. Look out for them appearing soon. -------------------------------------------------------------------------------- If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you. Image by Maxime Daquet Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud * Subscribe Join 75,000 other devs on our weekly database newsletter. © 2017 Compose","In this interview with Chris Winslett, Compose developer, we talk about why MySQL is on the Compose platform, what makes it different on Compose and how the Compose for MySQL beta is going.",Compose for MySQL - A developer's view,Live,287 850,"☰ * Login * Sign Up * Learning Paths * Courses * Our Courses * Partner Courses * Badges * Our Badges * BDU Badge Program * BLOG Welcome to the BDUBlog .SUBCRIBE VIA FEED RSS - Posts RSS - Comments SUBSCRIBE VIA EMAIL Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address RECENT POSTS * This Week in Data Science (January 31, 2017) * This Week in Data Science (January 24, 2017) * This Week in Data Science (January 17, 2017) * This Week in Data Science (January 10, 2017) * This Week in Data Science (December 27, 2016) CONNECT ON FACEBOOK Connect on FacebookFOLLOW US ON TWITTER My TweetsTHIS WEEK IN DATA SCIENCE (JANUARY 31, 2017) Posted on January 31, 2017 by Janice Darling Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! INTERESTING DATA SCIENCE ARTICLES AND NEWS * How will we cope with the AI Chatbot takeover? – How the capabilities of AI will impact the development of chatbots. * 6 areas of AI and Machine Learning to watch closely – A breakdown of six major areas defined by the term Artificial Intelligence. * IBM adds TensorFlow support to its PowerAI – IBM adds support for Google’s TensorFlow in a move highlighting the collaboration between the AI tech giants. * Social media data and the customer-centric strategy – How to utilize social media data in improving customer relations. * The Top Predictive Analytics Pitfalls to Avoid – Missteps to avoid when performing predictive analysis in order to obtain expected results from your models. * Trusting AI with important decisions: capabilities and challenges – The importance of considering the concrete benefits of AI while ensuring safety to property and human life. * What developers actually need to know about Machine Learning –A deviation from the traditional way of exposure to and learning Machine Learning. * Applied Data Science – Excerpts from a whitepaper on data science teams and the application of insights gained through analytics to the real world. * Apple joins Amazon, Facebook, Google, IBM and Microsoft in AI initiative –Apple joins the Partnership on AI to Benefit People and Society. * How Employers Judge Data Science Projects – 6 criteria that influence how potential employers evaluate applicants strength. * Introduction to Natural Language Processing, Part 1: Lexical Units – An exploration to the core concepts of Natural Language Processing. * What is Data Engineering? – The distinction between the wide fields of data science and data engineering. * Becoming a Data Scientist – An overview of the many skills and tools used by data scientists. * The Data Science Puzzle, Revisited – A discussion of how the key concepts related to data science and data science itself are unified. * Why It Matters That Artificial Intelligence Is About to Beat the World’s Best Poker Players – How a new AI system is contributing to advancement in the field. * Get Up to Speed with Data Science in 7 Easy Steps – 7 steps for beginners to get up-to-date with data science. UPCOMING DATA SCIENCE EVENTS * IBM Event: Big Data and Analytics Summit – February 14, 2017 @ 7:15 am – 4:45 pm COOL DATA SCIENCE VIDEOS * Deep Learning with Tensorflow – Recursive Neural Tensor Networks – An overview of Recursive Neural Tensor Networks and the Natural Language Processing problems that they are able to solve. * Deep Learning with Tensorflow – The Long Short Term Memory Model – An overview of the Long Short Term Memory Model. * Deep Learning with Tensorflow – The Recurrent Neural Network Model – An overview of the Recurrent Neural Network Model. SHARE THIS: * Facebook * Twitter * LinkedIn * Google * Pocket * Reddit * Email * Print * RELATED Tags: analytics , Big Data , data science , events -------------------------------------------------------------------------------- COMMENTS LEAVE A REPLY CANCEL REPLY * About * Contact * Blog * Events * Ambassador Program * Resources * FAQ * Legal Follow us * * * * * * * Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (January 31, 2017)",Live,288 853,"Slack’s Integration API and Cloudant’s HTTP API make it simple to store data directly into a Cloudant database without breaking a sweat. This tutorial shows how to create a custom slash command in Slack and how to post it directly to Cloudant.Slack is a messaging and team-working application that is used widely to allow disparate teams of people to chat, share files, and interact on desktop, tablet, and mobile platforms. We use Slack in IBM Cloud Data Services to coordinate our activities, to work in an open collaborative environment, and to cut down on email and meetings.One of the strengths of Slack is that it integrates with other web services, so events happening in Github or Stack Overflow can be surfaced in the appropriate Slack channels. Slack also has an API that lets you create custom integrations. The simplest of these is slash commands: when a user starts a Slack message with a forward slash followed by a command string, Slack can be configured to POST that data to an external API. Say you create the slash command /lunch. A user could type:","Slack's integration API allows external services to be plugged in with ease. Even if your service isn't listed in the off-the-shelf integrations, you can still push data to other HTTP services. This tutorial shows how a Slack 'slash command' can be configured to push data to a Cloudant or CouchDB database in a few easy steps.",Writing Data Directly to Cloudant from Slack,Live,289 858,"Outside of the core Elasticsearch toolset, there's a world of tools that make the search and analytics database even more useful and accessible. In this article we'll look at some and show what you do to get them working with Compose's Elasticsearch deployments. We'll start with a command line tool, move on to a simple search tool and finish with an all purpose client for searching and manipulating your Elasticsearch database...Let us start the tool tour with Es2unix, from the Elasticsearch developers. Es2unix is a version of the Elasticsearch API that you can use from the command line. It doesn't just make the API calls though, it also converts the returned results into a line-oriented, tabular format like many other Unix tools output. That makes it ideal for integrating Elasticsearch into your awk, grep and sort using shell scripts.Es2unix will need Java installed, Java 7 at least, and the binary version can be simply downloaded with a curl command and enabled with chmod as per the installation instructions:curl -s download.elasticsearch.org/es2unix/es >~/bin/eschmod +x ~/bin/esNote this assumes you have a bin directory in your $HOME and it's on your path.Now, when you run es it'll assume that Elasticsearch is running locally. When you are using Compose Elasticsearch, that isn't the case. If you've got the HTTP/TCP access portal enabled, you'll have to give the es command a URL to locate your Elasticsearch deployment. You can get the URL from your Compose dashboard - remember to substitute in the username and password of a Elasticsearch user (from the Users tab) into the URL. This URL is then passed using the -u option:$ es -u https://user:pass@haproxy1.dblayer.com:10360/ versiones 20140723711d4f9elasticsearch 1.3.4The es command is followed by one of a selection of subcommands. There we've used the version subcommand to get the version of the es command and the version of Elasticsearch it is talking to. The health of the cluster can be established with the health subcommand:$ es -u https://user:pass@haproxy1.dblayer.com:10360/ health -vtime cluster status nodes data pri shards relo init unassign11:14:39 EsExemplum green 3 3 3 6 0 0 0Drop the -v to get unlabelled results, ideal for passing into monitoring software - adding -v on many es subcommands is a signal that more extensive labelling of returned data is desired.The es command has the ability to count all documents or the number of documents that meets a simple query, and to search all indices and return matching ids:$ es -u https://user:pass@haproxy1.dblayer.com:10360/ count ""one species or variety""11:44:02 16 ""one species or variety""shows a count of documents matching the parts of that phrase to different extents. Using the search command we can dig deeper:$ es -u https://user:pass@haproxy1.dblayer.com:10360/ -v search ""one species or variety""score index type id0.16337 darwin-origin chapter II0.12559 darwin-origin chapter IX0.10360 darwin-origin chapter IV0.10141 darwin-origin chapter I0.09734 darwin-origin chapter XI0.09326 darwin-origin chapter V0.09226 darwin-origin chapter XV0.08744 darwin-origin chapter XIV0.08069 darwin-origin chapter VIII0.07525 darwin-origin chapter IIITotal: 16Now we can see the matching score along with the id, index and type of the document. Although here, 16 documents match, Elasticsearch returns only the top ten results by default. If we wanted to be more precise we could quote the string (remembering we're in the shell so back-slash escapes are needed) and select a field for matching:$ es -u https://user:pass@haproxy1.dblayer.com:10360/ -v search """"one species or variety"""" textscore index type id text0.03073 darwin-origin chapter I [""CHAPTER I. VARIATI0.03073 darwin-origin chapter IX [""CHAPTER IX. HYBRIDTotal: 2Other subcommands in es2unix include indices, for listing indexes, ids for retrieving all ids from an index and a variety of management reporting commands such as nodes, heap and shards.You'll have probably noticed that the es command is a little laborious when you have to specify the URL every time. Es2unix doesn't have any short cuts when it comes to passing that URL like environment variables. There is another way though to shorten things and thats by using an SSH access portal instead. If you configure an SSH access portal for your Elasticsearch deployment then the default command for creating your SSH tunnels makes a node of the cluster appear to be at localhost:9200 which is the default. Once you have an SSH tunnel set up, you can drop the entire -u [URL] part and use tools as if you had Elasticsearch locally configured.Sometime you just want to set up a quick search for your Elasticsearch database with the minimum of effort. The Calaca project is very useful in that regard. It's an all JavaScript search front end for Elasticsearch which connects up to Elasticsearch. To get up and running, you'll want to download and unpack the zip file available from the Github page. Calaca's configuration can be found in the file js/config.js which looks like this:var indexName = ""name""; //Ex: twittervar docType = ""type""; //Ex: tweetvar maxResultsSize = 10;var host = ""localhost""; //Ex: http://ec2-123-aws.comvar port = 9200;As you can see, it comes configured to use the database on localhost port 9200, so you could use the SSH shortcut above. But we're here anyway so we need to change the host variable to ""https://user:pass@haproxy1.dblayer.com"" to match the URL we're given in the Compose dashboard and don't forget to copy in the username and password. The port number also needs to be copied from the dashboard URL to the port variable. The rest of the configuration is selecting what to search and what to show. Set the indexName and docType variables to index and data type you want to search. So, for our example here we have a config.js that reads:var indexName = ""darwin-origin"";var docType = ""chapter"";var maxResultsSize = 10;var host = ""https://user:pass@haproxy1.dblayer.com"";var port = 10361;Then it's a matter of editing the index.html file to set what results are shown. In the middle of the file is a section which says:Edit the result.name and result.description to display what fields you want to display from your document:We have a particularly long block of text in our document which we truncates down and we use the id and title together to create a heading. Save that, open index.html in your browser – there's no need to deploy to a server – and you'll see Calaca's search field. Enter a term and you'll see results like so:It's a quick way to get a pretty search query front end up locally without wrestling with forming Curl/JSON requests or deploying a full on server.Where Calaca's great for a super simple search client, you might want something a little more potent for your searching. For that, try ESClient, which not only has an extensive search UI but adds the ability to display those results in a table or as raw JSON results and then edit and delete selected documents. Like Calaca, ESClient needs no server, just download the zip or clone the Github respository. Configuring it means just editing the config.js file and putting in the URL from the Compose dashboard:var Config = {'CLUSTER_URL':'https://user:pass@haproxy1.dblayer.com:10361',Then you open esQueryClient.html in your browser and before you know it, there's the ESClient configuration screen - click the Connect button and a connection to the Elasticsearch database will be made and you'll be moved to the Search tab where you can select index, type, fields, sort fields, specify a Lucene or DSL query and click Search to see the results in a table below the query.Double clicking on a result will let you edit the documents that make up the result or you can use the results as a guide for a delete operation. If you set to ""Raw JSON"" switch in the Configuration tab, you'll also be able to view the complete raw returned results in the JSON Results tab.It's all rather usefully functional and there's only one slight problem. If you look at the top of the ESClient page, you'll see it's displaying the username and password as part of the URL for the database you are connecting to. Not really ideal that, but the SSH access portal can help out there too. If you set up and activate the tunnel, then you can return the CLUSTER_URL value in the config.js file to http://localhost:9200 and there'll be no username or password to display on screen.We've touched on three tools in this article, but more importantly we've shown the practical differences between using the HTTP/TCP and SSH access portals on componse. With HTTP/TCP access, there will be usernames and passwords embedded in the URL you use and this will leave any scripts or tools you configure susceptible to shoulder surfers and the like. That said, for occasionally launched tools it is quick and simple.With the SSH access portal, the configuration and authentication is done when you set up the tunnel in a separate process and the tunnel means you can use Elasticsearch as if the node was installed locally. The downside is you do need to make sure the SSH tunnel is up before you run any command and it may be easier to go through the HTTP/TCP access portal. But then thats why we give you both options at Compose so you can choose what suits you and your applications best.",There's a world of tools that make the Elasticsearch even more useful and accessible. In this article we'll look at some and show what you do to get them working with Compose's Elasticsearch deployments. ,Elasticsearch Tools & Compose,Live,290 859,"Homepage Follow Sign in Get started * Home * ✍️ Contribute * * 🔥 ML Newsletter * Dang Ha The Hien Blocked Unblock Follow Following PhD student at UiO, Data Scientist at eSmart Systems Apr 5, 2017 -------------------------------------------------------------------------------- A GUIDE TO RECEPTIVE FIELD ARITHMETIC FOR CONVOLUTIONAL NEURAL NETWORKS The receptive field is perhaps one of the most important concepts in Convolutional Neural Networks (CNNs) that deserves more attention from the literature. All of the state-of-the-art object recognition methods design their model architectures around this idea. However, to my best knowledge, currently there is no complete guide on how to calculate and visualize the receptive field information of a CNN. This post fills in the gap by introducing a new way to visualize feature maps in a CNN that exposes the receptive field information, accompanied by a complete receptive field calculation that can be used for any CNN architecture. I’ve also implemented a simple program to demonstrate the calculation so that anyone can start computing the receptive field and gain better knowledge about the CNN architecture that they are working with. To follow this post, I assume that you are familiar with the CNN concept, especially the convolutional and pooling operations. You can refresh your CNN knowledge by going through the paper “ A guide to convolution arithmetic for deep learning [1]”. It will not take you more than half an hour if you have some prior knowledge about CNNs. This post is in fact inspired by that paper and uses similar notations. Note: If you want to learn more about how CNNs can be used for Object Recognition, this post is for you.THE FIXED-SIZED CNN FEATURE MAP VISUALIZATION The receptive field is defined as the region in the input space that a particular CNN’s feature is looking at (i.e. be affected by) . A receptive field of a feature can be fully described by its center location and its size. Figure 1 shows some receptive field examples. By applying a convolution C with kernel size k = 3x3 , padding size p = 1x1 , stride s = 2x2 on an input map 5x5 , we will get an output feature map 3x3 (green map). Applying the same convolution on top of the 3x3 feature map, we will get a 2x2 feature map (orange map). The number of output features in each dimension can be calculated using the following formula, which is explained in detail in [ 1 ]. Note that in this post, to simplify things, I assume the CNN architecture to be symmetric, and the input image to be square. So both dimensions have the same values for all variables. If the CNN architecture or the input image is asymmetric, you can calculate the feature map attributes separately for each dimension. Figure 1: Two ways to visualize CNN feature maps. In all cases, we uses the convolution C with kernel size k = 3x3, padding size p = 1x1, stride s = 2x2. (Top row) Applying the convolution on a 5x5 input map to produce the 3x3 green feature map. (Bottom row) Applying the same convolution on top of the green feature map to produce the 2x2 orange feature map. (Left column) The common way to visualize a CNN feature map. Only looking at the feature map, we do not know where a feature is looking at (the center location of its receptive field) and how big is that region (its receptive field size). It will be impossible to keep track of the receptive field information in a deep CNN. (Right column) The fixed-sized CNN feature map visualization, where the size of each feature map is fixed, and the feature is located at the center of its receptive field.The left column of Figure 1 shows a common way to visualize a CNN feature map. In that visualization, although by looking at a feature map, we know how many features it contains. It is impossible to know where each feature is looking at (the center location of its receptive field) and how big is that region (its receptive field size). The right column of Figure 1 shows the fixed-sized CNN visualization, which solves the problem by keeping the size of all feature maps constant and equal to the input map. Each feature is then marked at the center of its receptive field location. Because all features in a feature map have the same receptive field size, we can simply draw a bounding box around one feature to represent its receptive field size. We don’t have to map this bounding box all the way down to the input layer since the feature map is already represented in the same size of the input layer. Figure 2 shows another example using the same convolution but applied on a bigger input map — 7x7. We can either plot the fixed-sized CNN feature maps in 3D (Left) or in 2D (Right). Notice that the size of the receptive field in Figure 2 escalates very quickly to the point that the receptive field of the center feature of the second feature layer covers almost the whole input map. This is an important insight which was used to improve the design of a deep CNN. Figure 2: Another fixed-sized CNN feature map representation. The same convolution C is applied on a bigger input map with i = 7x7. I drew the receptive field bounding box around the center feature and removed the padding grid for a clearer view. The fixed-sized CNN feature map can be presented in 3D (Left) or 2D (Right).RECEPTIVE FIELD ARITHMETIC To calculate the receptive field in each layer, besides the number of features n in each dimension, we need to keep track of some extra information for each layer. These include the current receptive field size r , the distance between two adjacent features (or jump) j, and the center coordinate of the upper left feature (the first feature) start . Note that the center coordinate of a feature is defined to be the center coordinate of its receptive field, as shown in the fixed-sized CNN feature map above. When applying a convolution with the kernel size k , the padding size p , and the stride size s , the attributes of the output layer can be calculated by the following equations: * The first equation calculates the number of output features based on the number of input features and the convolution properties. This is the same equation presented in [ 1 ]. * The second equation calculates the jump in the output feature map, which is equal to the jump in the input map times the number of input features that you jump over when applying the convolution (the stride size). * The third equation calculates the receptive field size of the output feature map, which is equal to the area that covered by k input features (k-1)*j_in plus the extra area that covered by the receptive field of the input feature that on the border. * The fourth equation calculates the center position of the receptive field of the first output feature, which is equal to the center position of the first input feature plus the distance from the location of the first input feature to the center of the first convolution (k-1)/2*j_in minus the padding space p*j_in. Note that we need to multiply with the jump of the input feature map in both cases to get the actual distance/space. The first layer is the input layer, which always has n = image size , r = 1 , j = 1 , and start = 0.5. Note that in Figure 3, I used the coordinate system in which the center of the first feature of the input layer is at 0.5. By applying the four above equations recursively, we can calculate the receptive field information for all feature maps in a CNN. Figure 3 shows an example of how these equations work. Figure 3: Applying the receptive field calculation on the example given in Figure 1. The first row shows the notations and general equations, while the second and the last row shows the process of applying it to calculate the receptive field of the output layer given the input layer information.I’ve also created a small python program that calculates the receptive field information for all layers in a given CNN architecture. It also allows you to input the name of any feature map and the index of a feature in that map, and returns the size and location of the corresponding receptive field. The following figure shows an output example when we use the AlexNet. The code is provided at the end of this post. * Machine Learning * Artificial Intelligence * Deep Learning * Image Recognition * Computer Vision One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out. 785 13 Blocked Unblock Follow FollowingDANG HA THE HIEN PhD student at UiO, Data Scientist at eSmart Systems FollowML REVIEW Highlights from Machine Learning Research, Projects and Learning Materials. From and For ML Scientists, Engineers an Enthusiasts. * 785 * * * Never miss a story from ML Review , when you sign up for Medium. Learn more Never miss a story from ML Review Get updates Get updates",The receptive field is perhaps one of the most important concepts in Convolutional Neural Networks (CNNs) that deserves more attention from the literature. This post will introduce a new way to visualize feature maps in a CNN that exposes the receptive field information.,A guide to receptive field arithmetic for Convolutional Neural Networks,Live,291 861,This video will help you to understand how Cloudant replication works. Visit http://www.cloudant.com/sign-up to sign up for a free Cloudant account. Find more videos and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center,Understand how Cloudant database replication works,Understand how replication works,Live,292 863,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectOPEN DATA DAY, ECONOMIC JUSTICE, AND CIVIC ENGAGEMENTRaj R Singh / March 8, 2016I spent International Open Data Day at the NYC School of Data , New York City's civic technology an open data conference. It was aninspirational, battery-recharging experience that reminded me what's trulyimportant in life. Along the way, I learned many things: 1. In 2012, New York city passed the first sweeping open data law that switched the burden of information sharing from the public (think 1970s-era Freedom of Information Act policies), to the government. In short, NYC government departments are legally required to publish data online, for free, whenever possible! (I need to see how my local City of Boston open data policy compares…) 2. IBM has a Chief Data Strategist , Steven Adler. He's on the board of the NYCLU and got me involved in this event. Thanks Steve! Looking forward to moving the needle on data issues with you in the future. 3. Most importantly, I learned that the increasing availability of government open data sets around the country are providing powerful new ways for communities to engage on civic issues. Not only can we surface issues, we can also partner with government in operationalizing the monitoring and analysis of problems and solutions.As Jennifer Pahlka put it today, government needs to know whether policies are working in days ormonths, not decades . What an inspiring idea!Jennifer Pahlka presenting her work.One issue the group began to tackle, spurred by the NYCLU , is around economic justice. How can we tell if government policies areplaying out fairly in society and having the intended results? An example of apowerful data-driven story is that of "" million-dollar blocks ."" These are city blocks where states are spending in excess of a milliondollars a year to incarcerate their residents. Are you surprised million-dollarblocks exist? Is that a good way to spend public funds? Only by surfacing thesefacts with real data can we begin to have a truly informed public debate.Map of “million-dollar blocks” which show state incarceration spending byhousehold.If you're reading this, you're probably in the tech sector and doing pretty wellcompared to the rest of the world. A lot of that is luck. Your embryonic-cellself replicated and grew without mutation. Then you were born into a first-worldsociety, were well-nourished, and it was pretty easy for you to get a lot ofeducation without being interrupted by famine, drought, or war. Noteveryone—even in the US—is that lucky.So my message today is: give something back. Even if you only have an hour amonth, or a day a week, or just some cash, get involved. Join a local civichack, find a Code for America project , or update OpenStreetMap . Happy International Open Data Day!SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: dashdb / geospatial / opendata / Python / R / Spark Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES Enter your email address to subscribe to this blog and receive notifications of new posts by email. Email Address * CATEGORIES * Analytics * Cloudant * Community * Compose * CouchDB * dashDB * Data Warehousing * DB2 * Elasticsearch * Gaming * Hybrid * IoT * Location * Message Hub * Migration * Mobile * MongoDB * NoSQL * Offline * Open Data * PostgreSQL * Redis * Spark * SQL RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Exciting civic and open data projects discussed Saturday at NYC School of Data.,"Open Data Day, Economic Justice, and Civic Engagement",Live,293 866,"I've consulted with hundreds of people who use CouchDB, and the same sorts of questions keep coming up. Come to this talk if you want to know more about the kinds of mistakes many users make when thinking about how to use the database in their application. I'll talk a bit about the ""rough edges"" of CouchDB, and how to work them to your advantage.",Joan Touzet talks about ten common misconceptions about CouchDB and lends insight into best practices and design patterns.,10 Common Misconceptions about CouchDB,Live,294 877,"Skip to content * Features * Business * Explore * Marketplace * Pricing This repository Sign in or Sign up * Watch 1,180 * Star 11,226 * Fork 1,736 TERRYUM / AWESOME-DEEP-LEARNING-PAPERS Code Issues 6 Pull requests 1 Projects 0 Insights Pulse Graphs Permalink Branch: master Switch branches/tags * Branches * Tags master Nothing to show Nothing to show Find file Copy path awesome-deep-learning-papers / README.md a667046 Jun 28, 2017 terryum Update README.md 21 contributorsUSERS WHO HAVE CONTRIBUTED TO THIS FILE * terryum * miguelballesteros * Jeet1994 * jdoerrie * sunshinemyson * rtlee9 * flukeskywalker * pra85 * mbchang * mendelson * lserafin * ltrottier * liyaguang * lamblin * jeremyschlatter * rajikaimal * hosang * eddiepierce * dcastro9 * dan2k3k4 * bamos Raw Blame History 384 lines (320 sloc) 43.1 KBAWESOME - MOST CITED DEEP LEARNING PAPERS A curated list of the most cited deep learning papers (since 2012) We believe that there exist classic deep learning papers which are worth reading regardless of their application domain. Rather than providing overwhelming amount of papers, We would like to provide a curated list of the awesome deep learning papers which are considered as must-reads in certain research domains. BACKGROUND Before this list, there exist other awesome deep learning lists , for example, Deep Vision and Awesome Recurrent Neural Networks . Also, after this list comes out, another awesome list for deep learning beginners, called Deep Learning Papers Reading Roadmap , has been created and loved by many deep learning researchers. Although the Roadmap List includes lots of important deep learning papers, it feels overwhelming for me to read them all. As I mentioned in the introduction, I believe that seminal works can give us lessons regardless of their application domain. Thus, I would like to introduce top 100 deep learning papers here as a good starting point of overviewing deep learning researches. To get the news for newly released papers everyday, follow my twitter or facebook page ! AWESOME LIST CRITERIA 1. A list of top 100 deep learning papers published from 2012 to 2016 is suggested. 2. If a paper is added to the list, another paper (usually from *More Papers from 2016"" section) should be removed to keep top 100 papers. (Thus, removing papers is also important contributions as well as adding papers) 3. Papers that are important, but failed to be included in the list, will be listed in More than Top 100 section. 4. Please refer to New Papers and Old Papers sections for the papers published in recent 6 months or before 2012. (Citation criteria) * < 6 months : New Papers (by discussion) * 2016 : +60 citations or ""More Papers from 2016"" * 2015 : +200 citations * 2014 : +400 citations * 2013 : +600 citations * 2012 : +800 citations * ~2012 : Old Papers (by discussion) Please note that we prefer seminal deep learning papers that can be applied to various researches rather than application papers. For that reason, some papers that meet the criteria may not be accepted while others can be. It depends on the impact of the paper, applicability to other researches scarcity of the research domain, and so on. We need your contributions! If you have any suggestions (missing papers, new papers, key researchers or typos), please feel free to edit and pull a request. (Please read the contributing guide for further instructions, though just letting me know the title of papers can also be a big contribution to us.) (Update) You can download all top-100 papers with this and collect all authors' names with this . Also, bib file for all top-100 papers are available. Thanks, doodhwala, Sven and grepinsight ! * Can anyone contribute the code for obtaining the statistics of the authors of Top-100 papers? CONTENTS * Understanding / Generalization / Transfer * Optimization / Training Techniques * Unsupervised / Generative Models * Convolutional Network Models * Image Segmentation / Object Detection * Image / Video / Etc * Natural Language Processing / RNNs * Speech / Other Domain * Reinforcement Learning / Robotics * More Papers from 2016 (More than Top 100) * New Papers : Less than 6 months * Old Papers : Before 2012 * HW / SW / Dataset : Technical reports * Book / Survey / Review * Video Lectures / Tutorials / Blogs * Appendix: More than Top 100 : More papers not in the list -------------------------------------------------------------------------------- UNDERSTANDING / GENERALIZATION / TRANSFER * Distilling the knowledge in a neural network (2015), G. Hinton et al. [pdf] * Deep neural networks are easily fooled: High confidence predictions for unrecognizable images (2015), A. Nguyen et al. [pdf] * How transferable are features in deep neural networks? (2014), J. Yosinski et al. [pdf] * CNN features off-the-Shelf: An astounding baseline for recognition (2014), A. Razavian et al. [pdf] * Learning and transferring mid-Level image representations using convolutional neural networks (2014), M. Oquab et al. [pdf] * Visualizing and understanding convolutional networks (2014), M. Zeiler and R. Fergus [pdf] * Decaf: A deep convolutional activation feature for generic visual recognition (2014), J. Donahue et al. [pdf] OPTIMIZATION / TRAINING TECHNIQUES * Training very deep networks (2015), R. Srivastava et al. [pdf] * Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015), S. Loffe and C. Szegedy [pdf] * Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (2015), K. He et al. [pdf] * Dropout: A simple way to prevent neural networks from overfitting (2014), N. Srivastava et al. [pdf] * Adam: A method for stochastic optimization (2014), D. Kingma and J. Ba [pdf] * Improving neural networks by preventing co-adaptation of feature detectors (2012), G. Hinton et al. [pdf] * Random search for hyper-parameter optimization (2012) J. Bergstra and Y. Bengio [pdf] UNSUPERVISED / GENERATIVE MODELS * Pixel recurrent neural networks (2016), A. Oord et al. [pdf] * Improved techniques for training GANs (2016), T. Salimans et al. [pdf] * Unsupervised representation learning with deep convolutional generative adversarial networks (2015), A. Radford et al. [pdf] * DRAW: A recurrent neural network for image generation (2015), K. Gregor et al. [pdf] * Generative adversarial nets (2014), I. Goodfellow et al. [pdf] * Auto-encoding variational Bayes (2013), D. Kingma and M. Welling [pdf] * Building high-level features using large scale unsupervised learning (2013), Q. Le et al. [pdf] CONVOLUTIONAL NEURAL NETWORK MODELS * Rethinking the inception architecture for computer vision (2016), C. Szegedy et al. [pdf] * Inception-v4, inception-resnet and the impact of residual connections on learning (2016), C. Szegedy et al. [pdf] * Identity Mappings in Deep Residual Networks (2016), K. He et al. [pdf] * Deep residual learning for image recognition (2016), K. He et al. [pdf] * Spatial transformer network (2015), M. Jaderberg et al., [pdf] * Going deeper with convolutions (2015), C. Szegedy et al. [pdf] * Very deep convolutional networks for large-scale image recognition (2014), K. Simonyan and A. Zisserman [pdf] * Return of the devil in the details: delving deep into convolutional nets (2014), K. Chatfield et al. [pdf] * OverFeat: Integrated recognition, localization and detection using convolutional networks (2013), P. Sermanet et al. [pdf] * Maxout networks (2013), I. Goodfellow et al. [pdf] * Network in network (2013), M. Lin et al. [pdf] * ImageNet classification with deep convolutional neural networks (2012), A. Krizhevsky et al. [pdf] IMAGE: SEGMENTATION / OBJECT DETECTION * You only look once: Unified, real-time object detection (2016), J. Redmon et al. [pdf] * Fully convolutional networks for semantic segmentation (2015), J. Long et al. [pdf] * Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2015), S. Ren et al. [pdf] * Fast R-CNN (2015), R. Girshick [pdf] * Rich feature hierarchies for accurate object detection and semantic segmentation (2014), R. Girshick et al. [pdf] * Spatial pyramid pooling in deep convolutional networks for visual recognition (2014), K. He et al. [pdf] * Semantic image segmentation with deep convolutional nets and fully connected CRFs , L. Chen et al. [pdf] * Learning hierarchical features for scene labeling (2013), C. Farabet et al. [pdf] IMAGE / VIDEO / ETC * Image Super-Resolution Using Deep Convolutional Networks (2016), C. Dong et al. [pdf] * A neural algorithm of artistic style (2015), L. Gatys et al. [pdf] * Deep visual-semantic alignments for generating image descriptions (2015), A. Karpathy and L. Fei-Fei [pdf] * Show, attend and tell: Neural image caption generation with visual attention (2015), K. Xu et al. [pdf] * Show and tell: A neural image caption generator (2015), O. Vinyals et al. [pdf] * Long-term recurrent convolutional networks for visual recognition and description (2015), J. Donahue et al. [pdf] * VQA: Visual question answering (2015), S. Antol et al. [pdf] * DeepFace: Closing the gap to human-level performance in face verification (2014), Y. Taigman et al. [pdf] : * Large-scale video classification with convolutional neural networks (2014), A. Karpathy et al. [pdf] * Two-stream convolutional networks for action recognition in videos (2014), K. Simonyan et al. [pdf] * 3D convolutional neural networks for human action recognition (2013), S. Ji et al. [pdf] NATURAL LANGUAGE PROCESSING / RNNS * Neural Architectures for Named Entity Recognition (2016), G. Lample et al. [pdf] * Exploring the limits of language modeling (2016), R. Jozefowicz et al. [pdf] * Teaching machines to read and comprehend (2015), K. Hermann et al. [pdf] * Effective approaches to attention-based neural machine translation (2015), M. Luong et al. [pdf] * Conditional random fields as recurrent neural networks (2015), S. Zheng and S. Jayasumana. [pdf] * Memory networks (2014), J. Weston et al. [pdf] * Neural turing machines (2014), A. Graves et al. [pdf] * Neural machine translation by jointly learning to align and translate (2014), D. Bahdanau et al. [pdf] * Sequence to sequence learning with neural networks (2014), I. Sutskever et al. [pdf] * Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014), K. Cho et al. [pdf] * A convolutional neural network for modeling sentences (2014), N. Kalchbrenner et al. [pdf] * Convolutional neural networks for sentence classification (2014), Y. Kim [pdf] * Glove: Global vectors for word representation (2014), J. Pennington et al. [pdf] * Distributed representations of sentences and documents (2014), Q. Le and T. Mikolov [pdf] * Distributed representations of words and phrases and their compositionality (2013), T. Mikolov et al. [pdf] * Efficient estimation of word representations in vector space (2013), T. Mikolov et al. [pdf] * Recursive deep models for semantic compositionality over a sentiment treebank (2013), R. Socher et al. [pdf] * Generating sequences with recurrent neural networks (2013), A. Graves. [pdf] SPEECH / OTHER DOMAIN * End-to-end attention-based large vocabulary speech recognition (2016), D. Bahdanau et al. [pdf] * Deep speech 2: End-to-end speech recognition in English and Mandarin (2015), D. Amodei et al. [pdf] * Speech recognition with deep recurrent neural networks (2013), A. Graves [pdf] * Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups (2012), G. Hinton et al. [pdf] * Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition (2012) G. Dahl et al. [pdf] * Acoustic modeling using deep belief networks (2012), A. Mohamed et al. [pdf] REINFORCEMENT LEARNING / ROBOTICS * End-to-end training of deep visuomotor policies (2016), S. Levine et al. [pdf] * Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection (2016), S. Levine et al. [pdf] * Asynchronous methods for deep reinforcement learning (2016), V. Mnih et al. [pdf] * Deep Reinforcement Learning with Double Q-Learning (2016), H. Hasselt et al. [pdf] * Mastering the game of Go with deep neural networks and tree search (2016), D. Silver et al. [pdf] * Continuous control with deep reinforcement learning (2015), T. Lillicrap et al. [pdf] * Human-level control through deep reinforcement learning (2015), V. Mnih et al. [pdf] * Deep learning for detecting robotic grasps (2015), I. Lenz et al. [pdf] * Playing atari with deep reinforcement learning (2013), V. Mnih et al. [pdf] ) MORE PAPERS FROM 2016 * Layer Normalization (2016), J. Ba et al. [pdf] * Learning to learn by gradient descent by gradient descent (2016), M. Andrychowicz et al. [pdf] * Domain-adversarial training of neural networks (2016), Y. Ganin et al. [pdf] * WaveNet: A Generative Model for Raw Audio (2016), A. Oord et al. [pdf] [web] * Colorful image colorization (2016), R. Zhang et al. [pdf] * Generative visual manipulation on the natural image manifold (2016), J. Zhu et al. [pdf] * Texture networks: Feed-forward synthesis of textures and stylized images (2016), D Ulyanov et al. [pdf] * SSD: Single shot multibox detector (2016), W. Liu et al. [pdf] * SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 1MB model size (2016), F. Iandola et al. [pdf] * Eie: Efficient inference engine on compressed deep neural network (2016), S. Han et al. [pdf] * Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1 (2016), M. Courbariaux et al. [pdf] * Dynamic memory networks for visual and textual question answering (2016), C. Xiong et al. [pdf] * Stacked attention networks for image question answering (2016), Z. Yang et al. [pdf] * Hybrid computing using a neural network with dynamic external memory (2016), A. Graves et al. [pdf] * Google's neural machine translation system: Bridging the gap between human and machine translation (2016), Y. Wu et al. [pdf] -------------------------------------------------------------------------------- NEW PAPERS Newly published papers (< 6 months) which are worth reading * Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour (2017), Priya Goyal et al. [pdf] * TACOTRON: Towards end-to-end speech synthesis (2017), Y. Wang et al. [pdf] * Deep Photo Style Transfer (2017), F. Luan et al. [pdf] * Evolution Strategies as a Scalable Alternative to Reinforcement Learning (2017), T. Salimans et al. [pdf] * Deformable Convolutional Networks (2017), J. Dai et al. [pdf] * Mask R-CNN (2017), K. He et al. [pdf] * Learning to discover cross-domain relations with generative adversarial networks (2017), T. Kim et al. [pdf] * Deep voice: Real-time neural text-to-speech (2017), S. Arik et al., [pdf] * PixelNet: Representation of the pixels, by the pixels, and for the pixels (2017), A. Bansal et al. [pdf] * Batch renormalization: Towards reducing minibatch dependence in batch-normalized models (2017), S. Ioffe. [pdf] * Wasserstein GAN (2017), M. Arjovsky et al. [pdf] * Understanding deep learning requires rethinking generalization (2017), C. Zhang et al. [pdf] * Least squares generative adversarial networks (2016), X. Mao et al. [pdf] OLD PAPERS Classic papers published before 2012 * An analysis of single-layer networks in unsupervised feature learning (2011), A. Coates et al. [pdf] * Deep sparse rectifier neural networks (2011), X. Glorot et al. [pdf] * Natural language processing (almost) from scratch (2011), R. Collobert et al. [pdf] * Recurrent neural network based language model (2010), T. Mikolov et al. [pdf] * Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion (2010), P. Vincent et al. [pdf] * Learning mid-level features for recognition (2010), Y. Boureau [pdf] * A practical guide to training restricted boltzmann machines (2010), G. Hinton [pdf] * Understanding the difficulty of training deep feedforward neural networks (2010), X. Glorot and Y. Bengio [pdf] * Why does unsupervised pre-training help deep learning (2010), D. Erhan et al. [pdf] * Learning deep architectures for AI (2009), Y. Bengio. [pdf] * Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009), H. Lee et al. [pdf] * Greedy layer-wise training of deep networks (2007), Y. Bengio et al. [pdf] * Reducing the dimensionality of data with neural networks, G. Hinton and R. Salakhutdinov. [pdf] * A fast learning algorithm for deep belief nets (2006), G. Hinton et al. [pdf] * Gradient-based learning applied to document recognition (1998), Y. LeCun et al. [pdf] * Long short-term memory (1997), S. Hochreiter and J. Schmidhuber. [pdf] HW / SW / DATASET * OpenAI gym (2016), G. Brockman et al. [pdf] * TensorFlow: Large-scale machine learning on heterogeneous distributed systems (2016), M. Abadi et al. [pdf] * Theano: A Python framework for fast computation of mathematical expressions, R. Al-Rfou et al. * Torch7: A matlab-like environment for machine learning, R. Collobert et al. [pdf] * MatConvNet: Convolutional neural networks for matlab (2015), A. Vedaldi and K. Lenc [pdf] * Imagenet large scale visual recognition challenge (2015), O. Russakovsky et al. [pdf] * Caffe: Convolutional architecture for fast feature embedding (2014), Y. Jia et al. [pdf] BOOK / SURVEY / REVIEW * On the Origin of Deep Learning (2017), H. Wang and Bhiksha Raj. [pdf] * Deep Reinforcement Learning: An Overview (2017), Y. Li, [pdf] * Neural Machine Translation and Sequence-to-sequence Models(2017): A Tutorial, G. Neubig. [pdf] * Neural Network and Deep Learning (Book, Jan 2017), Michael Nielsen. [html] * Deep learning (Book, 2016), Goodfellow et al. [html] * LSTM: A search space odyssey (2016), K. Greff et al. [pdf] * Tutorial on Variational Autoencoders (2016), C. Doersch. [pdf] * Deep learning (2015), Y. LeCun, Y. Bengio and G. Hinton [pdf] * Deep learning in neural networks: An overview (2015), J. Schmidhuber [pdf] * Representation learning: A review and new perspectives (2013), Y. Bengio et al. [pdf] VIDEO LECTURES / TUTORIALS / BLOGS (Lectures) * CS231n, Convolutional Neural Networks for Visual Recognition, Stanford University [web] * CS224d, Deep Learning for Natural Language Processing, Stanford University [web] * Oxford Deep NLP 2017, Deep Learning for Natural Language Processing, University of Oxford [web] (Tutorials) * NIPS 2016 Tutorials, Long Beach [web] * ICML 2016 Tutorials, New York City [web] * ICLR 2016 Videos, San Juan [web] * Deep Learning Summer School 2016, Montreal [web] * Bay Area Deep Learning School 2016, Stanford [web] (Blogs) * OpenAI [web] * Distill [web] * Andrej Karpathy Blog [web] * Colah's Blog [Web] * WildML [Web] * FastML [web] * TheMorningPaper [web] APPENDIX: MORE THAN TOP 100 (2016) * A character-level decoder without explicit segmentation for neural machine translation (2016), J. Chung et al. [pdf] * Dermatologist-level classification of skin cancer with deep neural networks (2017), A. Esteva et al. [html] * Weakly supervised object localization with multi-fold multiple instance learning (2017), R. Gokberk et al. [pdf] * Brain tumor segmentation with deep neural networks (2017), M. Havaei et al. [pdf] * Professor Forcing: A New Algorithm for Training Recurrent Networks (2016), A. Lamb et al. [pdf] * Adversarially learned inference (2016), V. Dumoulin et al. [web] [pdf] * Understanding convolutional neural networks (2016), J. Koushik [pdf] * Taking the human out of the loop: A review of bayesian optimization (2016), B. Shahriari et al. [pdf] * Adaptive computation time for recurrent neural networks (2016), A. Graves [pdf] * Densely connected convolutional networks (2016), G. Huang et al. [pdf] * Region-based convolutional networks for accurate object detection and segmentation (2016), R. Girshick et al. * Continuous deep q-learning with model-based acceleration (2016), S. Gu et al. [pdf] * A thorough examination of the cnn/daily mail reading comprehension task (2016), D. Chen et al. [pdf] * Achieving open vocabulary neural machine translation with hybrid word-character models, M. Luong and C. Manning. [pdf] * Very Deep Convolutional Networks for Natural Language Processing (2016), A. Conneau et al. [pdf] * Bag of tricks for efficient text classification (2016), A. Joulin et al. [pdf] * Efficient piecewise training of deep structured models for semantic segmentation (2016), G. Lin et al. [pdf] * Learning to compose neural networks for question answering (2016), J. Andreas et al. [pdf] * Perceptual losses for real-time style transfer and super-resolution (2016), J. Johnson et al. [pdf] * Reading text in the wild with convolutional neural networks (2016), M. Jaderberg et al. [pdf] * What makes for effective detection proposals? (2016), J. Hosang et al. [pdf] * Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks (2016), S. Bell et al. [pdf] . * Instance-aware semantic segmentation via multi-task network cascades (2016), J. Dai et al. [pdf] * Conditional image generation with pixelcnn decoders (2016), A. van den Oord et al. [pdf] * Deep networks with stochastic depth (2016), G. Huang et al., [pdf] * Consistency and Fluctuations For Stochastic Gradient Langevin Dynamics (2016), Yee Whye Teh et al. [pdf] (2015) * Ask your neurons: A neural-based approach to answering questions about images (2015), M. Malinowski et al. [pdf] * Exploring models and data for image question answering (2015), M. Ren et al. [pdf] * Are you talking to a machine? dataset and methods for multilingual image question (2015), H. Gao et al. [pdf] * Mind's eye: A recurrent visual representation for image caption generation (2015), X. Chen and C. Zitnick. [pdf] * From captions to visual concepts and back (2015), H. Fang et al. [pdf] . * Towards AI-complete question answering: A set of prerequisite toy tasks (2015), J. Weston et al. [pdf] * Ask me anything: Dynamic memory networks for natural language processing (2015), A. Kumar et al. [pdf] * Unsupervised learning of video representations using LSTMs (2015), N. Srivastava et al. [pdf] * Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding (2015), S. Han et al. [pdf] * Improved semantic representations from tree-structured long short-term memory networks (2015), K. Tai et al. [pdf] * Character-aware neural language models (2015), Y. Kim et al. [pdf] * Grammar as a foreign language (2015), O. Vinyals et al. [pdf] * Trust Region Policy Optimization (2015), J. Schulman et al. [pdf] * Beyond short snippents: Deep networks for video classification (2015) [pdf] * Learning Deconvolution Network for Semantic Segmentation (2015), H. Noh et al. [pdf] * Learning spatiotemporal features with 3d convolutional networks (2015), D. Tran et al. [pdf] * Understanding neural networks through deep visualization (2015), J. Yosinski et al. [pdf] * An Empirical Exploration of Recurrent Network Architectures (2015), R. Jozefowicz et al. [pdf] * Deep generative image models using a laplacian pyramid of adversarial networks (2015), E.Denton et al. [pdf] * Gated Feedback Recurrent Neural Networks (2015), J. Chung et al. [pdf] * Fast and accurate deep network learning by exponential linear units (ELUS) (2015), D. Clevert et al. [pdf] * Pointer networks (2015), O. Vinyals et al. [pdf] * Visualizing and Understanding Recurrent Networks (2015), A. Karpathy et al. [pdf] * Attention-based models for speech recognition (2015), J. Chorowski et al. [pdf] * End-to-end memory networks (2015), S. Sukbaatar et al. [pdf] * Describing videos by exploiting temporal structure (2015), L. Yao et al. [pdf] * A neural conversational model (2015), O. Vinyals and Q. Le. [pdf] * Improving distributional similarity with lessons learned from word embeddings, O. Levy et al. [[pdf]] ( https://www.transacl.org/ojs/index.php/tacl/article/download/570/124 ) * Transition-Based Dependency Parsing with Stack Long Short-Term Memory (2015), C. Dyer et al. [pdf] * Improved Transition-Based Parsing by Modeling Characters instead of Words with LSTMs (2015), M. Ballesteros et al. [pdf] * Finding function in form: Compositional character models for open vocabulary word representation (2015), W. Ling et al. [pdf] (~2014) * DeepPose: Human pose estimation via deep neural networks (2014), A. Toshev and C. Szegedy [pdf] * Learning a Deep Convolutional Network for Image Super-Resolution (2014, C. Dong et al. [pdf] * Recurrent models of visual attention (2014), V. Mnih et al. [pdf] * Empirical evaluation of gated recurrent neural networks on sequence modeling (2014), J. Chung et al. [pdf] * Addressing the rare word problem in neural machine translation (2014), M. Luong et al. [pdf] * On the properties of neural machine translation: Encoder-decoder approaches (2014), K. Cho et. al. * Recurrent neural network regularization (2014), W. Zaremba et al. [pdf] * Intriguing properties of neural networks (2014), C. Szegedy et al. [pdf] * Towards end-to-end speech recognition with recurrent neural networks (2014), A. Graves and N. Jaitly. [pdf] * Scalable object detection using deep neural networks (2014), D. Erhan et al. [pdf] * On the importance of initialization and momentum in deep learning (2013), I. Sutskever et al. [pdf] * Regularization of neural networks using dropconnect (2013), L. Wan et al. [pdf] * Learning Hierarchical Features for Scene Labeling (2013), C. Farabet et al. [pdf] * Linguistic Regularities in Continuous Space Word Representations (2013), T. Mikolov et al. [pdf] * Large scale distributed deep networks (2012), J. Dean et al. [pdf] * A Fast and Accurate Dependency Parser using Neural Networks. Chen and Manning. [pdf] ACKNOWLEDGEMENT Thank you for all your contributions. Please make sure to read the contributing guide before you make a pull request. LICENSE To the extent possible under law, Terry T. Um has waived all copyright and related or neighboring rights to this work. Jump to Line Go * Contact GitHub * API * Training * Shop * Blog * About * © 2017 GitHub , Inc. * Terms * Privacy * Security * Status * Help You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.",A curated list of the most cited deep learning papers (since 2012). ,Awesome deep learning papers,Live,295 878,"Compose The Compose logo Articles Sign in Free 30-day trialMAKING OF A SMART BUSINESS CHATBOT: PART 3 Published Aug 24, 2017 Making of a Smart Business Chatbot: Part 3 janusgraph watson conversation Free 30 Day TrialChatbots are a great way to interact with your customers in real-time and gain insights into your users. In this third part of the series on building smart business chatbots, we’ll use a JanusGraph-backed knowledge base to give our chatbot from part 1 and part 2 some utility. We’ve reached the third part of our Building Smart Business Chatbots and now we’re going to use JanusGraph to give our bot the knowledge to go with the chat. We’ll use Watson Conversation to allow our users to search for articles that might match their interests and responds back in conversational form. Let’s get started... GRAPHING IT UP It always helps to have some data before we start coding things up, so let’s start by inputting some articles from the Compose blog into a new JanusGraph database. Follow the first few steps in our article on Markov Chains to spin up a JanusGraph Instance on Compose and get Gremlin up-and-running. Once you have those going, we’ll create a new database for our articles: gremlin> :> def graph = ConfiguredGraphFactory.create(""composeblog"") ==>standardjanusgraph[astyanax:[10.189.87.4, 10.189.87.3, 10.189.87.2]] gremlin> :> graph.tx().commit() Next, we’ll grab a few articles with various tags and topics from a few different authors. We’ll model them by using vertices for our authors, tags, and articles. We’ll use edges to represent the relationships between those vertices: Let’s go ahead and start building out our graph. We’ll fill in 10 articles from the blog across 3 different authors. First, let’s add the authors: gremlin> :> graph.tx().commit() ==>null gremlin> :> def john = graph.addVertex(T.label, ""person"", ""name"", ""John O'Connor"") ==>v[4112] gremlin> :> def abdullah = graph.addVertex(T.label, ""person"", ""name"", ""Abdullah Alger"") ==>v[8208] gremlin> :> def dj = graph.addVertex(T.label, ""person"", ""name"", ""DJ Walker-Morgan"") ==>v[4208] gremlin> :> graph.tx().commit() ==>null Next, we’ll add some tags from a sampling of articles. We’ll use the following sampling of articles to give us a good starting point, and we’ll pull the tags directly from those articles: * Taking a Look at Robomongo and Studio 3T with Compose for MongoDB * Avoid Storing Data Inside ""Admin"" When Using MongoDB * Storing Network Addresses using PostgreSQL * Mastering PostgreSQL Tools: Full-Text Search and Phrase Search * How to Script Painless-ly in Elasticsearch * MQTT and STOMP for Compose RabbitMQ * Elasticsearch 5.4.2 comes to Compose * Compose PostgreSQL powers up to 9.6 * Introduction to Graph Databases * Easier Java connections to MongoDB at Compose * Graph 101: Magical Markov Chains * Building Secure Instant API's with RESTHeart and Compose * Compose Tips: Dates and Dating in MongoDB * 5-minute Signup Forms with Node-RED and Compose * Mongo Metrics: Calculating the Mode * Building Secure Distributed Javascript Microservices with RabbitMQ and SenecaJS Let’s go through each of these articles and extract the relevant tags: gremlin> :> def mongodb = graph.addVertex(T.label, ""tag"", ""name"", ""mongodb"") ==>v[8304] gremlin> :> def janusgraph = graph.addVertex(T.label, ""tag"", ""name"", ""janusgraph"") ==>v[4232] gremlin> :> def nodeRed = graph.addVertex(T.label, ""tag"", ""name"", ""node-red"") ==>v[4304] gremlin> :> def nodejs = graph.addVertex(T.label, ""tag"", ""name"", ""nodejs"") ==>v[8328] gremlin> :> def rabbitmq = graph.addVertex(T.label, ""tag"", ""name"", ""rabbitmq"") ==>v[4152] gremlin> :> def elasticsearch = graph.addVertex(T.label, ""tag"", ""name"", ""elasticsearch"") ==>v[4184] gremlin> :> def postgres = graph.addVertex(T.label, ""tag"", ""name"", ""postgres"") ==>v[8280] gremlin> :> graph.tx().commit() ==>null Now that we have our tags, we can input our articles along with their relationship between tags and authors. gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Taking a Look at Robomongo and Studio 3T with Compose for MongoDB"", “url”, “https://www.compose.com/articles/taking-a-look-at-robomongo-and-studio-3t-with-compose-for-mongodb/”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Avoid Storing Data Inside ""Admin"" When Using MongoDB"", “url”, “https://www.compose.com/articles/avoid-storing-data-inside-admin-when-using-mongodb/”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Storing Network Addresses using PostgreSQL"", “url”, “https://www.compose.com/articles/storing-network-addresses-using-postgresql/”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Mastering PostgreSQL Tools: Full-Text Search and Phrase Search"", “url”, “https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""How to Script Painless-ly in Elasticsearch"", “url”, “https://www.compose.com/articles/how-to-script-painless-ly-in-elasticsearch/”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""MQTT and STOMP for Compose RabbitMQ"", “url”, “https://www.compose.com/articles/mqtt-and-stomp-for-compose-rabbitmq”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Elasticsearch 5.4.2 comes to Compose"", “url”, “https://www.compose.com/articles/elasticsearch-5-4-2-comes-to-compose”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Compose PostgreSQL powers up to 9.6"", “url”, “https://www.compose.com/articles/compose-postgresql-powers-up-to-9-6/”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Introduction to Graph Databases"", “url”, “https://www.compose.com/articles/introduction-to-graph-databases/”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Easier Java connections to MongoDB at Compose"", “url”, “https://www.compose.com/articles/easier-java-connections-to-mongodb-at-compose-2/”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Graph 101: Magical Markov Chains"", “url”, “https://www.compose.com/articles/graph-101-magical-markov-chains/”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Building Secure Instant API's with RESTHeart and Compose"", “url”, “https://www.compose.com/articles/building-secure-instant-apis-with-restheart-and-compose/”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Compose Tips: Dates and Dating in MongoDB"", “url”, “https://www.compose.com/articles/understanding-dates-in-compose-mongodb/”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""5-minute Signup Forms with Node-RED and Compose"", “url”, “https://www.compose.com/articles/5-minute-signup-with-node-red-and-compose/”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Mongo Metrics: Calculating the Mode"", “url”, “https://www.compose.com/articles/mongo-metrics-calculating-the-mode/”) gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Building Secure Distributed Javascript Microservices with RabbitMQ and SenecaJS"", “url”, “https://www.compose.com/articles/building-secure-distributed-javascript-microservices-with-rabbitmq-and-senecajs/”) gremlin> :> graph.tx().commit() Finally, let's add edges between our articles, authors, and tags so our graph is complete. At this point, you are entering quite a bit of data and, if you're using JanusGraph on Compose, your session might have timed out. Rather than using variable names to add edges to vertices like we did in the previous article , you can access them directly through the traversal object: gremlin> : ==>v[4112] Where the number inside of g.V() is the ID of the vertex. If you're not sure what the ID is of the vertex you're looking for, you can use the valueMap() method to figure it out: gremlin> : ==>{name=[Abdullah Alger], id=8208, label=person} ==>{name=[John O'Connor], id=4112, label=person} ==>{name=[DJ Walker-Morgan], id=4208, label=person} First, we'll add edges between each of our articles and authors. Since we may have disconnected by now, we'll use the id of each article to add the edges. We can find that by using the following command: gremlin> : ==>{label=article, id=45168, name=[Building Secure Distributed Javascript Microservices with RabbitMQ and SenecaJS], url=[https://www.compose.com/articles/building-secure-distributed-javascript-microservices-with-rabbitmq-and-senecajs/]} ==>{label=article, id=12496, name=[MQTT and STOMP for Compose RabbitMQ], url=[https://www.compose.com/articles/mqtt-and-stomp-for-compose-rabbitmq]} ... Use the command above to find the id s of your articles, and remember to change the ID in the g.V() command below with the id of those article you want to add an edge to: gremlin> : ==>v[4112] gremlin> :> g.V(36976).next().addEdge(""author"", john) ==>e[he6-sj4-5jp-368][36976-author->4112] gremlin> :> g.V(41072).next().addEdge(""author"", john) ==>e[hse-vow-5jp-368][41072-author->4112] gremlin> :> g.V(4120).next().addEdge(""author"", john) ==>e[1z7-36g-5jp-368][4120-author->4112] gremlin> :> g.V(16472).next().addEdge(""author"", john) ==>e[7ij-cpk-5jp-368][16472-author->4112] gremlin> :> g.V(20568).next().addEdge(""author"", john) ==>e[7wr-fvc-5jp-368][20568-author->4112] gremlin> :> def abdullah = g.V(8208).next() ==>v[8208] gremlin> :> g.V(12496).next().addEdge(""author"", abdullah) ==>e[4re-9n4-5jp-6c0][12496-author->8208] gremlin> :> g.V(12400).next().addEdge(""author"", abdullah) ==>e[i6m-9kg-5jp-6c0][12400-author->8208] gremlin> :> g.V(24688).next().addEdge(""author"", abdullah) ==>e[iku-j1s-5jp-6c0][24688-author->8208] gremlin> :> g.V(16496).next().addEdge(""author"", abdullah) ==>e[iz2-cq8-5jp-6c0][16496-author->8208] gremlin> :> g.V(20592).next().addEdge(""author"", abdullah) ==>e[jda-fw0-5jp-6c0][20592-author->8208] gremlin> :> g.V(8400).next().addEdge(""author"", abdullah) ==>e[55m-6hc-5jp-6c0][8400-author->8208] gremlin> : ==>v[4208] gremlin> :> g.V(12496).next().addEdge(""author"", dj) ==>e[5ju-9n4-5jp-38w][12496-author->4208] gremlin> :> g.V(32880).next().addEdge(""author"", dj) ==>e[jri-pdc-5jp-38w][32880-author->4208] gremlin> :> g.V(28784).next().addEdge(""author"", dj) ==>e[k5q-m7k-5jp-38w][28784-author->4208] gremlin> :> g.V(12376).next().addEdge(""author"", dj) ==>e[8az-9js-5jp-38w][12376-author->4208] gremlin> :> g.V(12424).next().addEdge(""author"", dj) ==>e[4cx-9l4-5jp-38w][12424-author->4208] gremlin> :> graph.tx().commit() ==>null Now, we'll add an edge for each of our topics. gremlin> :> g.V(45168).next().addEdge(""topic"", rabbit) ==>e[odxce-yuo-28lx-37c][45168-topic->4152] gremlin> :> g.V(45168).next().addEdge(""topic"", nodejs) ==>e[odxqm-yuo-28lx-6fc][45168-topic->8328] gremlin> :> g.V(12496).next().addEdge(""topic"", rabbit) ==>e[odxcq-9n4-28lx-37c][12496-topic->4152] gremlin> :> g.V(12400).next().addEdge(""topic"", mongodb) ==>e[ody4u-9kg-28lx-6eo][12400-topic->8304] gremlin> :> g.V(36976).next().addEdge(""topic"", janus) ==>e[odyj2-sj4-28lx-39k][36976-topic->4232] gremlin> :> g.V(32880).next().addEdge(""topic"", mongodb) ==>e[odyxa-pdc-28lx-6eo][32880-topic->8304] gremlin> :> g.V(41072).next().addEdge(""topic"", mongodb) ==>e[odzbi-vow-28lx-6eo][41072-topic->8304] gremlin> :> g.V(24688).next().addEdge(""topic"", elastic) ==>e[odzpq-j1s-28lx-388][24688-topic->4184] gremlin> :> g.V(4120).next().addEdge(""topic"", mongodb) ==>e[odxc3-36g-28lx-6eo][4120-topic->8304] gremlin> :> g.V(28784).next().addEdge(""topic"", janus) ==>e[oe03y-m7k-28lx-39k][28784-topic->4232] gremlin> :> g.V(16496).next().addEdge(""topic"", postgres) ==>e[oe0i6-cq8-28lx-6e0][16496-topic->8280] gremlin> :> g.V(12376).next().addEdge(""topic"", postgres) ==>e[odxcb-9js-28lx-6e0][12376-topic->8280] gremlin> :> g.V(16472).next().addEdge(""topic"", mongodb) ==>e[odxqj-cpk-28lx-6eo][16472-topic->8304] gremlin> :> g.V(20568).next().addEdge(""topic"", nodered) ==>e[ody4r-fvc-28lx-3bk][20568-topic->4304] gremlin> :> g.V(20568).next().addEdge(""topic"", mongodb) ==>e[odyiz-fvc-28lx-6eo][20568-topic->8304] gremlin> :> g.V(12424).next().addEdge(""topic"", elastic) ==>e[odxch-9l4-28lx-388][12424-topic->4184] gremlin> :> g.V(20592).next().addEdge(""topic"", postgres) ==>e[oe0we-fw0-28lx-6e0][20592-topic->8280] gremlin> :> g.V(8400).next().addEdge(""topic"", mongodb) ==>e[odxqy-6hc-28lx-6eo][8400-topic->8304] gremlin> :> graph.tx().commit() ==>null If you're paying close attention, you'll notice that I actually doubled-up on some of those topics. One of the most useful things about graph databases is the ability to model relationships as you discover them, rather than having to plan out these relationships ahead of time (as you would with a relational database). We're able to connect multiple topics to the same article simply by adding another edge to the article node. Now that we have our graph put together, let's run a quick test by querying JanusGraph for all of the articles written by Abdullah: gremlin> :> g.V(abdullah).in(""author"").values(""name"") ==>Avoid Storing Data Inside 'Admin' When Using MongoDB ==>Taking a Look at Robomongo and Studio 3T with Compose for MongoDB ==>MQTT and STOMP for Compose RabbitMQ ==>Storing Network Addresses using PostgreSQL ==>Mastering PostgreSQL Tools: Full-Text Search and Phrase Search ==>How to Script Painless-ly in Elasticsearch And for fun, let's see all of the articles with a topic of mongodb : gremlin> :> def mongo = g.V().has(""name"", ""mongodb"").next() ==>v[8304] gremlin> :> g.V(mongo).in(""topic"").values(""name"") ==>Compose Tips: Dates and Dating in MongoDB ==>Avoid Storing Data Inside 'Admin' When Using MongoDB ==>Taking a Look at Robomongo and Studio 3T with Compose for MongoDB ==>Building Secure Instant API's with RESTHeart and Compose ==>5-minute Signup Forms with Node-RED and Compose ==>Easier Java connections to MongoDB at Compose ==>Mongo Metrics: Calculating the Mode That looks about right - we can now ask JanusGraph to find all of the articles written on a particular topic or by a particular author. Now, let's see how we can bring these together by connecting JanusGraph up with our Node-RED application. CONNECTING TO JANUSGRAPH FROM NODE-RED We've been building our chatbot with Node-RED hosted on Bluemix, and now it's time to connect our JanusGraph instance to it. The JanusGraph HTTP API can be used to execute gremlin queries using HTTP, so we'll try this out by using the HTTP Request node in Node-RED. JanusGraph exposes a single HTTP POST endpoint to execute Gremlin queries. The endpoint expects a JSON-formatted document with a single key (gremlin) that has the value of your Gremlin query: { ""gremlin"": ""YOUR_GREMLIN_QUERY_HERE"" } This API is stateless which means that, unlike using Gremlin from the command line, we won't be able to use variables across commands. We'll also need to open the graph each time we want to use it (remember, the graph and g we used previously won't be available to us. Connecting to the API is a two-step process: first, we'll need a session token we can use to authenticate our web calls. These tokens have a timeout of 60 minutes, so we'll need to refresh the tokens periodically. Once we have the token, we'll be able to send requests to JanusGraph with the token in the header of our call. GENERATING A SESSION TOKEN First, we'll need to generate the session token. Let's start by just using a simple inject node to test our session token web call. Drag an inject node, an http request node, and a debug node onto the canvas. Double-click the http request node and give it a name of JG Auth . Wire them all up so they look like the following: Then, double-click the JG Auth node to configure it with a method of GET and a URL using the connection string from the Gremlin using Token Authentication section of the Compose dashboard: Wire them up, click deploy , and click on the button next to the inject node. You should see something like this in the debug panel: {""token"": """"} That's the session token you can now use to make requests to your JanusGraph instance. Now, let's send a request using that token. Drag another inject node, http request , and debug node onto the canvas, and this time drag a function node onto the canvas as well. Double click each of them to name them, giving the http request node a name of JG Request and the function node a name of JG Query . Then, wire them up like the following: Double-click the JG Query function node so we can add the token to the msg.header object and the query to our msg.payload object. We'll also configure our msg.url and msg.method here so we don't have to open the JG Request node, and we'll hard-code the token for now: msg.headers = { ""Authorization"": ""Token